JASON KNIGHT

Witty subtitle here

Statistics for the mildly curious

2014-02-02 - Reading time: 11 minutes

So you've heard about statistics in the news, and maybe you even took a course a long, long time ago in a classroom far, far away but the memories have all long since contributed to the inevitable heat death of the universe. But for whatever reason you're now curious: what is this statistics thing all about?

Now sure, you could pick up a copy of "Statistics for Dummies" or some-such, but I wanted to give you the essence of statistics in a single blog post using some corny examples. So let's get started!

Summarization

Say you plan on entering the local cucumber growing competition later in the growing season, and you are interested how your current crop is faring. One way to proceed is to go outside and merely look at the cucumbers on their vine to get a feel for how big they are. This system quickly leaves us wanting because it is completely subjective: we cannot go on to accurately record our observations which makes it harder to compare this years crop with cucumbers in past or future seasons or to our estranged cousin Lester's crop in the great cucumber growing country of Latvia.

So we need an objective quantification of what we're interested in. One way to do this is to pick a few (we have a large crop) and then weigh them. Now if you only picked one cucumber which weighed (or massed if your picky) 1.2kg then it is easy enough to write that down, send it to our dear aunt Myrtle (another cucumber aficionado in the family) or save it in our filing cabinet.  But let's say we pick 70 cucumbers.  We certainly don't want to memorize all 70 numbers or even send all 70 weights to poor aunt Myrtle.

So we want some way to summarize these 70 number into a few more meaningful numbers. Thus we introduce the concept of summarizing 'statistics' to represent this dataset in a more compact way. Most people are familiar with the average or mean, but some others include: standard deviation, variance, median, mode, minimum, maximum, quartiles, quantiles, deciles, coefficient of variation, etc... And rather than explaining all those here, if my readership (Hi Bryan!) demands it, I could do another blog post describing some of these in more detail, or you could always check Wikipedia.

For the dataset of our 70 cucumbers masses (which you can download here if you want to follow along at home), we can take a few statistics to summarize the dataset as:
Summary Stats:
Mean:         2.809481
Std Dev:      0.589431
Minimum:      1.827191
1st Quartile: 2.345176
Median:       2.751632
3rd Quartile: 3.079201
Maximum:      4.628404
Good, now dear aunt Myrtle can be proud of our large average cucumber mass, and our consistent (low variance) growing results.

Parameter Inference

But we can do better: Often times, we can describe our data more concisely and more accurately if we make a distributional assumption.

For example, say we are a casino owner and are concerned about possible tampering with a new shipment of dice we just received. To investigate, we paid some poor soul to roll a few dice several thousand times and record the results (available to download here). We then remember the lesson our cousin Victor taught us with his prize winning veggies and summarize the 7500 rolls as:

Summary Stats:
Mean: 3.512000
Minimum: 1.000000
1st Quartile: 2.000000
Median: 4.000000
3rd Quartile: 5.000000
Maximum: 6.000000

Yet, we're left feeling unsatisfied, because we don't really care so much about these particular statistics. Instead, we'd like some way to consider the chance of rolling the various faces of the die. So to do that, we assume that the dice rolls follow some *probability distribution*. In this case, that distribution is the simple discrete or categorical distribution. In this case, we assume there is some probability p_1 of rolling a 1, p_2 of rolling a 2 and so on such that p_1 + p_2 + p_3 + p_4 + p_5 + p_6 = 1. We call these values p_i parameters and given values for these parameters they specify a particular realization of the general model.

At this point, we have a model and some data, but we're stuck: we don't have any way of combining these two together!

Luckily our cousin Vinny has a few opinions in these matters and recommends a solution: count up the number of ones in your dataset and divide by the total number of rolls and use this proportion as the value for p_1. Then rinse and repeat for the other die faces and voila, you have values for all the parameters in your probability model.

Statisticians call this general process inference where data is used to infer, or estimate, the parameter values for an assumed model. While this technique of counting and dividing by the total seems logical, it is merely one potential way of going about this procedure and there are much more principled ways of performing parameter inference.

Aside: While Vinny's method feels good, it is dangerous to proceed this way in general because our method is ad hoc and has no theoretical justification. What I mean by this is that: we have no idea how 'good' Vinny's procedure is, and it may be biased, or introduce a lot of error, or even be incorrect, and unless we have more theoretical justification for using an inference procedure, we must be extremely wary of the resulting estimates. Fortunately much of statistics is focused on resolving the inference problem in a number of situations so don't despair, but unfortunately we don't have the time or space to go into more detail here.

Model Based Inference

Unfortunately, spending so many long nights at the casino performing parameter inference on our dice shipments has made problems at home. The household pet beetle, Sarah, is pregnant, and we're not sure the beetle cage proudly displayed in the living room is going to be large enough to house Sarah and her spawn. If only there was some way to know (or at least predict) how many babies Sarah was likely to have in the coming days, then we could sleep easier at night.

Fortunately, over the last thirty years, we've kept meticulous records of our beetles and their clutch sizes. Furthermore, we noticed that the larger and longer a beetle is, the more offspring it tends to have, so we recorded the mother's carapace length along with the number of beetle babies born in each clutch.

But now we hit a roadblock, because after searching Wikipedia, we can't seem to find any probability distribution designed to model this type of dependence between a beetle's length and the number of babies it has. Fortunately, we can create new probability distributions using the standard distributions (like the categorical distribution we used for dice, or a Gaussian, Poisson, or any other distribution) as building blocks to describe the situation we have. For example, in this case, we might assume that there is a linear relationship between the carapace length of a beetle and the number of children it is going to have.

"But wait a second, couldn't I just use the best fit line in Excel on my beetle clutch data?" The answer to that question is yes, but it leaves much to be desired: A linear best fit line will give us a prediction of the number of beetle babies born, but what confidence would we then have in that prediction?  And if our cage only can hold 15 babies, what is the chance that Sarah will have 15 or less babies?

For these types of probabilistic questions, we need a probabilistic model, which we can construct using simpler distributions and gluing them together to best fit the situation we have. For example, we can assume that the number of babies, B, is distributed (using the '~' tilde character), according to a Poisson distribution which has a single parameter (unlike the 6 that we had earlier for our dice categorical distribution) that we make equal to the length L of the beetle times a constant a.  Thus:

B ~ Poisson(a * L)

Using this, we can perform parameter inference for our model using our data, then plug Sarah's length into our model and see what the chance is that she'll have 15 babies or less.

I know I'm rushing through the details here, but I'm hoping you still grabbed the main concept: that we are adding complexity by building up larger models using simple distributions as building blocks, and then using these models to learn from the data and then make predictions.

Hypothesis Testing

"But wait," I hear you say, "you haven't said anything about that beloved old warhorse of mine: the T-test. Isn't that an integral part of statistics?"

Ahh yes, the t-test. Don't worry my old friend, we'll get there in good time.  But first, let's back up a little.

Lester, our cucumber growing cousin from Latvia, is clamoring for a fight. To him, the thought of some 'foreigner' growing heavier cucumbers is a great insult to the pride of Latvia. In a smug email, he then retorted that his cucumbers were obviously larger and gave us his list of 30 cucumber weights to prove it.

So the question becomes, are his cucumbers really larger than ours?

One straightforward way to proceed is to calculate the mean of his cucumbers and compare that to the mean of your cucumbers, but this method is problematic because Lester and you didn't measure all his cucumbers, merely a sampling (we're assuming that he took a uniform sampling and didn't bias his sample by weighing only the largest ones) and he could have just gotten lucky and randomly chosen a few good ones.

So to make progress, we need to more precisely state the question we would like to ask of our data. One of the simplest set of assumptions we can make about our cucumber data is that it can be modeled using a Gaussian distribution.

The Gaussian distribution has two parameters, one which controls its location, and the other its spread. In fancy terminology, we can say that it is parameterized by the mean and standard deviation.

Now we can specify our question precisely enough to answer it mathematically: Is the mean of the Gaussian modeling cousin Lester's cucumber's mass distribution larger than the mean of the Gaussian modeling my cucumber's masses?

To do this, we can either do all the calculations ourselves or perform a t-test which was designed to answer this exact question.

Summary

Now I have glossed over many details to make this whole introduction more palatable, and I couldn't even show you the result of a T-test without going into the details of how to interpret it (which are fraught with subtleties that would take time to properly). But I hope I have given a high level overview of some of the main concepts of statistics.

If you're interested in learning more, then of course there are many details to cover in each of the topics above, but I've completely ignored the large topic of Bayesian versus Frequentist stastics. While they both cover the same activities above, they do so in philosophically different ways. This division often causes large areas of disagreement among amateurs and mid-level statisticans but I would say that most professionals seem to know where the dividing line is and pick a side while acknowledging that the other side is a valid way to proceed given certain assumptions and requirements. But that is another topic for another time!