P-Values

A test statistic enables us to determine a p-value, which is the probability (ranging from 0 to 1) of observing sample data as extreme (different) or more extreme if the null hypothesis were true. The smaller the p-value, the more incompatible the data are with the null hypothesis.

A p-value ≤ 0.05 is an arbitrary but commonly used criterion for determining whether an observed difference is "statistically significant" or not. While it does not take into account the possible effects of bias or confounding, a p-value of ≤ 0.05 suggests that there is a 5% probability or less that the observed differences were the result of sampling error (chance). Furthermore, while it does not indicate certainty, it suggests that the null hypothesis is probably not true, so we reject the null hypothesis and accept the alternative hypothesis if the p-value is less than or equal to 0.05. The 0.05 criterion is also called the "alpha level," indicating the probability of incorrectly rejecting the null hypothesis.

A p-value > 0.05 would be interpreted by many as "not statistically significant," meaning that there was not sufficiently strong evidence to reject the null hypothesis and conclude that the groups are different. This does not mean that the groups are the same. If the evidence for a difference is weak (not statistically significant), we fail to reject the null, but we never "accept the null," i.e., we cannot conclude that they are the same – only that there is insufficient evidence to conclude that they are different.

While commonly used, p-values have fallen into some disfavor recently because the 0.05 criterion tends to devolve into a hard and fast rule that distinguishes "significantly different" from "not significantly different."

"A P value of 0.05 does not mean that there is a 95% chance that a given hypothesis is correct. Instead, it signifies that if the null hypothesis is true, and all other assumptions made are valid, there is a 5% chance of obtaining a result at least as extreme as the one observed. And a P value cannot indicate the importance of a finding; for instance, a drug can have a statistically significant effect on patients' blood glucose levels without having a therapeutic effect."

[Monya Baker: Statisticians issue warning over misuse of P values. Nature, March 7,2016]

Consider two studies evaluating the same hypothesis. Both studies find a small difference between the comparison groups, but for one study the p-value =0.06, and the authors conclude that the groups are "not significantly different"; the second study finds p=0.04, and the authors conclude that the groups are significantly different. Which is correct? Perhaps one solution is to simply report the p-value and let the reader come to their own conclusion.

Cautions Regarding Interpretation of P-Values

There is an unfortunate tendency for p-values to devolve into a conclusion of "significant" or "not significant" based on the p-value.
If an effect is small and clinically unimportant, the p-value can be "significant" if the sample size is large. Conversely, an effect can be large, but fail to meet the p ≤ 0.05 criterion if the sample size is small. Therefore, p-values cannot determine clinical significance or relevance.
When many possible associations are examined using a criterion of p ≤ 0.05, the probability of finding at least one that meets the this criterion increases in proportion to the number of associations that are tested.
Statistical significance does not take into account the evaluation of bias and confounding.
P-values do not imply causation.
P-values do not indicate whether the null or alternative hypothesis is really true.
P-values do not indicate the strength or direction of an effect, i.e., the "magnitude of effect."
P-values do not provide a way of assessing the precision of an estimated difference, and do not provide a range of possible values for the measure of effect that are compatible with the observed data.

Many researchers and practitioners now prefer confidence intervals, because they focus on the estimated effect size and how precise the estimate is rather than "Is there an effect?"

Also note that the meaning of "significant" depends on the audience. To scientists it mean "statistically significant," i.e., that p ≤ 0.05, but to a lay audience significant means "important."

What to Report

Measure of effect: the magnitude of the difference between the groups, e.g., difference in means, risk ratio, risk difference, odds ratio, etc.
P-value: The probability of observing differences this great or greater if the null hypothesis is true.
Confidence interval: a measure of the precision of the measure of effect. The confidence interval estimates the range of values compatible with the evidence.

Many public health researchers and practitioners prefer confidence intervals, since p-values give less information and are often interpreted inappropriately. When reporting results one should provide all three of these.

return to top | previous page | next page