Here I'm hypothetically examining my town and an adjacent town and making some comparisons based on samples that taken from each of these populations.
In my town I found that the mean height of female adults was 64 inches compared to 68 inches in their town.
And last year the flu incidence in my town was about 20%, based on a sample, while it was 10% about in their town.
The key question here is "Are these differences likely to be real, and I want to note that I've selected two types of estimates.
When we compare the height of female adults, we comparing a measurement value, and when we're comparing the incidence of flu we are comparing a frequency, not a measurement.
So I want to address we might go about comparing these things to address this question of whether not these differences are real.
First, let's focus on comparing the values of a measure, such as height or weight. Typically, we begin by stating or establishing a null hypothesis, and to a certain extent, this concept of a null hypothesis that seems like a backwards way of looking at things, and it sort is, but basically what we frequently find is that it is usually easier to disprove something by gathering evidence that refutes it.
So, for example, suppose I believe that all crows are black. If I can find one crow that is white or brown or gray, then I have sufficient evidence to discard the notion that all crows are black. And science kind of works in an analogous fashion when testing hypotheses.
Suppose that I actually believe that doing an incidental appendectomy increases post- operative wound infections.
But I want to test that hypothesis, so when I test it, I start from the hypothesis that there is no difference - that is that people who have the incidental appendectomy will have the same risk of wound infections as people who don't.
It's just a starting place from which I can determine if there is strong evidence to reject the idea that they're the same.
So, I consider what I would expect to see in the data if there were no difference between the groups, and then I want to look at the actual data to compare the groups and see if my observations are compatible with that null hypothesis. So, in the example shown here we are focusing on measurement data, and we are comparing the means in two groups, but we acknowledge that those means reflect the average in groups of people, and there might be substantial variability. Now, in the upper illustration here I'm actually just trying to depict the null hypothesis.
If I wanted to test the hypothesis that smokers and non-smokers weigh the same (on average) or if they were different, I would first established this hypothesis. I would say let's suppose that they have the same mean weight.
And in depicting this I have acknowledged that there's variability in weight in both of these groups - smokers and
Non-smokers, but from the picture here....
I've drawn it so that they look similar. The distributions of the weights are very similar, and the mean values are values are the same.
That's just establishing our expectations if there were no difference. But then we're going to gather actual data. We are going to take samples.
We are not going to measure the whole population; we're going to take samples in order to estimate what the mean is in each group.
And then the question is whether there is sufficient evidence that the two groups differ - sufficient evidence so that I can say, well, it appears that my observations are not compatible with his upper notion that the two groups have the same mean, and therefore I have to conclude that they probably differ. So the key question when we're doing this kind of testing is "If the groups being compared really didn't differ, what would be the probability of observing differences this great or greater in a sample.
This null hypothesis then establishes what we would expect to see if there were no difference between the groups, and we are going to compare those expectations under the null hypothesis to what we actually observed in the data collected from samples in order to determine whether our observations are consistent with "no difference."
And if there is sufficiently strong evidence that the observations are not consistent with the notion of "no difference," then we conclude that the groups probably are different.
So, while it is sort of a backwards reasoning, it actually works, and it is a very convenient way to set up a "straw dog" and test hypotheses and compare groups. Just to push this notion of sufficient evidence a little bit further, suppose I had taken samples.
And I'm showing four examples here. Let's focus first on the one at the upper left where I've got some variability in body weights, but the means are actually pretty close.
And there is a very, very large amount of overlap between the two distributions.
I would look at this and I say, well, I didn't test the whole population; I just took samples, but based on the samples I'm not really convinced that these two groups have different average weights.
And, similarly, the one at the upper right shows means that are a little bit farther apart, but there is still quite a bit of overlap, and it was certainly depend on how large the samples were, but I'm not really convinced that these two groups really are different. It could be that just by the luck of the draw when I sampled people
I got means that were a little bit different, and I got distributions that were a little bit different, but, in fact, I don't have sufficient evidence to conclude that the groups are different.
When we look at the illustration at the lower left, the means that I obtained from my samples are substantially different, and there is very little overlap - some overlap, but not a lot, and when I look at that, I'm thinking, hmmm, these observations are probably not consistent with the notion that there's no difference between the groups. And similarly, for the one on the lower right, there is a pretty sizable difference and not a lot of overlap. So
I'm thinking that these are not compatible with the notion of no difference.
So in the lower two, I might conclude that the groups probably are different, but in the upper two I'm not so sure. They could be slightly different, but I really don't have sufficient evidence to draw that conclusion.
What statistical tests are going to do is help us - guide us - in terms of determining whether we have sufficient evidence to conclude that the groups are probably different or not and whether or not we should throw out this straw dog - the null hypothesis - and conclude that, in fact, the groups probably are different.
And when we do statistical tests, we compute p-values; "p" stands for probability.
And the definition is that a p-value is simply the probability of seeing differences this big or bigger in samples if the groups in fact were not different, that is, if the null hypothesis were true. When we look at these two distributions here in the illustration, we're saying that if there really were no difference between the groups, then there would be a very low probability of observing sample differences this great, and therefore the null hypothesis - the notion that they're the same - is probably not correct. So we're going to reject that notion and conclude that the groups are probably different. There are many, many statistical tests.
They all compute p-values for us, but the particular test that is being used depends on study design, the type of measurement, and so forth. But we're not going to get into that.
We're just going to address the overall notion of hypothesis testing and interpreting p-values, because you're going to see those for all kinds of scientific studies.
Scientists generally use this criterion of 5% - the notion that a p-value - a small p-value, a low p-value, that is, less than or equal to 5%, means that is unlikely that the difference is that we saw just occurred as a result of sampling variability. Consequently, if we see a p-value - it is an arbitrary criterion, but most would say that, if the p-value is less than or equal to 0.05 (that is, 5%), then random error (sampling error) is an unlikely explanation for the observed differences. So, the difference would then be describe as statistically significant. If the p-value is greater than 0.05, that doesn't mean that the groups are the same.
It just means that we do not have sufficient evidence to conclude that the groups are different. They might be different, but we have not gathered sufficient evidence to convince us of that.
Again, this 0.05 criterion (5%) is a very arbitrary criterion. One might see a comparison that has a p-value of 0.06 or 0.07, or 0.08.
And that means that there was only an 8% chance of observing differences this big or bigger just as a result of sampling variability.
So, those might be considered borderline or suspicious, but they still don't meet the usual criterion of having sufficient evidence to conclude that they are probably significant. And bear in mind that p-values also don't take into account other sources of error in the data such as bias and confounding.
There are limitations to using p-values, but they provide some guidance in evaluating the strength of the evidence when we take samples.
To conclude, if the p-value is less than or equal to 0.05, we say there a low probability, i.e., it is unlikely that the differences we have observed registered are just due to sampling error - chance.
So, if that is the case, we reject the null hypothesis and conclude that the groups probably are different.