Module 7 - Comparing Continuous Outcomes

Introduction

When trying to accurately estimate the strength of association between an exposure and an outcome, there are several sources of error that must be considered:

Random error (sampling error)
Selection bias
Information bias
Confounding

Random errors are errors that arises from sampling. Just by chance the estimates from samples may not accurately reflect an association in the population from which the samples are drawn. In this module we will discuss how to evaluate random error when dealing with continuous outcomes. The statistical significance of these comparisons is often evaluated by computing a confidence interval or a p-value.

We will consider the relevance of p-values for population health science research, and we will also consider the limitations of using p-values to determine "statistical significance" when comparing groups. We will specifically address the common use of t-tests for evaluating associations between two continuous variables, for example, to determine whether people who are obese have a greater risk of having high blood pressure.

Essential Questions

How can we test hypotheses regarding continuous outcomes?
How can we tell if one group is more affected by a given disease than another group?
What are the strengths and limitations of using p-values to guide our interpretations?
What do we gain by using continuous data in comparison to categorical or dichotomous?
What exposures and outcomes are well-suited for continuous measurement?

Learning Objectives

After successfully completing this section, the student will be able to:

Conduct and interpret these statistical tests using tables and using R:
- one-sample t-test
- two-sample t-test for independent samples (unpaired)
- t-test for dependent samples (matched or paired samples)
Explain results of t-tests in an understandable way to lay audiences
Explain strengths and limitations of using p-values
Select the correct statistical procedure for different questions
Describe the difference between a 1-tailed and 2-tailed t-test
Explain and differentiate between Type I and Type II errors

Hypothesis Testing

A hypothesis is a testable statement that tries to explain relationships, and it can be accepted or rejected through scientific research.

A fundamental hypothesis expresses one's belief or suspicion about a relationship, e.g., "Smoking causes lung cancer." This statement is not easily testable, however.
A research hypothesis is a more precise and testable statement, such as, "People who smoke cigarettes regularly will have a higher incidence of lung cancer over a 10-year period than people who do not smoke cigarettes."
The null hypothesis is that the groups do not differ. If there is sufficient evidence that this is not the case, then we reject the null hypothesis and accept the alternate: the groups are probably different.

Our general strategy is to specify a null and alternative hypothesis and then select and calculate the appropriate test statistic, which we will use to determine the p-value, i.e., the probability of finding differences this great or greater due to sampling error (chance). If the p-value is very small, it indicates a low probability of observing these difference as a result of sampling error, i.e., the null hypothesis is probably not true, so we will reject the null hypothesis and accept the alternative.

For continuous outcomes, there are three fundamental comparisons to make:

One sample t-test: one sample is compared to an historical or external mean (μ₀) ["s" is the standard deviation in the sample]

Two independent sample t-test (unpaired t-test): the means of two independent (different) groups are compared. (Are the means in two groups the same?)

where

Two dependent sample t-test (paired t-test): two sets of matched or paired sets of data are compared

(There is also a procedure called analysis of variance (ANOVA) for comparing the means among more than two groups, but we will not address ANOVA in this course.)

Notice that all three of these tests generate a test statistic that takes into account:

The magnitude of difference between the groups being compared, i.e., the "measure of effect," i.e., the numerator of the test statistic)

The sample size and variability in the samples (the denominator in the test statistic)

P-Values

A test statistic enables us to determine a p-value, which is the probability (ranging from 0 to 1) of observing sample data as extreme (different) or more extreme if the null hypothesis were true. The smaller the p-value, the more incompatible the data are with the null hypothesis.

A p-value ≤ 0.05 is an arbitrary but commonly used criterion for determining whether an observed difference is "statistically significant" or not. While it does not take into account the possible effects of bias or confounding, a p-value of ≤ 0.05 suggests that there is a 5% probability or less that the observed differences were the result of sampling error (chance). A p-value less than or equal to 0.05 does not indicate that the groups are definitely different, it suggests that the null hypothesis is probably not true, so we reject the null hypothesis and accept the alternative hypothesis, i.e. that the groups probably differ. The 0.05 criterion is also called the "alpha level," indicating the probability of incorrectly rejecting the null hypothesis.

A p-value > 0.05 would be interpreted by many as "not statistically significant," meaning that there was not sufficiently strong evidence to reject the null hypothesis and conclude that the groups are different. This does not mean that the groups are the same. If the evidence for a difference is weak (not statistically significant), we fail to reject the null, but we never "accept the null," i.e., we cannot conclude that they are the same – only that there is insufficient evidence to conclude that they are different.

While commonly used, p-values have fallen into some disfavor recently because the 0.05 criterion tends to devolve into a hard and fast rule that distinguishes "significantly different" from "not significantly different."

"A P value of 0.05 does not mean that there is a 95% chance that a given hypothesis is correct. Instead, it signifies that if the null hypothesis is true, and all other assumptions made are valid, there is a 5% chance of obtaining a result at least as extreme as the one observed. And a P value cannot indicate the importance of a finding; for instance, a drug can have a statistically significant effect on patients' blood glucose levels without having a therapeutic effect."

[Monya Baker: Statisticians issue warning over misuse of P values. Nature, March 7,2016]

Consider two studies evaluating the same hypothesis. Both studies find a small difference between the comparison groups, but for one study the p-value =0.06, and the authors conclude that the groups are "not significantly different"; the second study finds p=0.04, and the authors conclude that the groups are significantly different. Which is correct? Perhaps one solution is to simply report the p-value and let the reader come to their own conclusion.

Cautions Regarding Interpretation of P-Values

There is an unfortunate tendency for p-values to devolve into a conclusion of "significant" or "not significant" based on the p-value.
If an effect is small and clinically unimportant, the p-value can be "significant" if the sample size is large. Conversely, an effect can be large, but fail to meet the p ≤ 0.05 criterion if the sample size is small. Therefore, p-values cannot determine clinical significance or relevance.
When many possible associations are examined using a criterion of p ≤ 0.05, the probability of finding at least one that meets the this criterion increases in proportion to the number of associations that are tested.
Statistical significance does not take into account the evaluation of bias and confounding.
P-values do not imply causation.
P-values do not indicate whether the null or alternative hypothesis is really true.
P-values do not indicate the strength or direction of an effect, i.e., the "magnitude of effect."
P-values do not provide a way of assessing the precision of an estimated difference, and do not provide a range of possible values for the measure of effect that are compatible with the observed data.

Many researchers and practitioners now prefer confidence intervals, because they focus on the estimated effect size and how precise the estimate is rather than "Is there an effect?"

Also note that the meaning of "significant" depends on the audience. To scientists it means "statistically significant," i.e., that p ≤ 0.05, but to a lay audience significant means "important."

What to Report

Measure of effect: the magnitude of the difference between the groups, e.g., difference in means, risk ratio, risk difference, odds ratio, etc.
P-value: The probability of observing differences this great or greater if the null hypothesis is true.
Confidence interval: a measure of the precision of the measure of effect. The confidence interval estimates the range of values compatible with the evidence.

Many public health researchers and practitioners prefer confidence intervals, since p-values give less information and are often interpreted inappropriately. When reporting results one should provide all three of these.

One Sample t-test

Investigators wanted to determine whether children who were born very prematurely have poorer cognitive ability than children born at full term. To investigate they enrolled 100 children who had been born prematurely and measured their IQ in order to compare the sample mean to a normative IQ of 100. In essence, the normative IQ is a historical or external (μ₀). In the sample of 100 children who had been born prematurely the data for IQ were as follows: X̄ = 95.8, SD = 17.5" . The investigators wish to use a one-tailed test of significance.

One-Tailed and Two-Tailed Tests of Significance

A research hypothesis that states that two groups differ without specifying direction, i.e., which is greater, is a two-tailed hypothesis. In contrast, a research hypothesis that specifies direction, e.g., that the mean in a sample group will be less than the mean in a historic comparison group is a one-tailed hypothesis. Similarly, a hypothesis that a mean in a sample group will be greater than the mean in a historic comparison group is also a one-tailed hypothesis.

Suppose we conduct a test of significance to compare two groups (call them A and B) using large samples. The test statistic could be either a t score or a Z score, depending on the test we choose, but if the sample is very large the t or Z scores will be similar. Suppose also that we have specified an "alpha" level of 0.05, i.e., an error rate of 5% for concluding that the groups differ when they really don't. In other words, we will use p≤0.05 as the criterion for statistical significance.

Two-tailed test: The first figure below shows that with a two-tailed test in which we acknowledge that one group's mean could be either above or below the other, the alpha error rate has to be split into the upper and lower tails, i.e., with half of our alpha (0.025) in each tail. Therefore, we need to achieve a test statistic that is either less than -1.96 or greater than +1.96.

One-tailed test (lower tail): In the middle figure the hypothesis is that group A has a mean less than group B, perhaps because it is unreasonable to think the mean IQ in group A would be greater than that in group B. If so, all of the 5% alpha is in the lower tail, and we only need to achieve a test statistic less than 1.645 to achieve "statistical significance."

One-tailed test (upper tail): The third image shows a one-tailed test in which the hypothesis is that group A has a mean value greater than that of group B, so all of the alpha is in the upper tail, meaning that we need a test statistic greater than +1.645 to achieve statistical significance.

Clearly, the two-tailed test is more conservative because, regardless of direction, the test statistic has to be more than 1.96 units away from the null. The vast majority of tests that are reported are two-tailed tests. However, there are occasionally situations in which a one-tailed test hypothesis can be justified.

A one-tailed test could be justified in the study examining whether children who had been born very prematurely have lower IQ scores than children who had a normal, full term gestation, since there is no reason to believe that those born prematurely would have higher IQs.

A One-tailed Test of Hypothesis

First, we set up the hypotheses:
Null hypothesis: H₀: μ_prem = 100 (i.e., that there is no difference or association)
Children born prematurely have mean IQs that are not different from those of the general population (μ=100)

Alternative hypothesis (i.e., the research hypothesis): H₀:μ_prem < 100)
Children born prematurely have lower mean IQ than the general population (μ<100).

The t statistic= -2.4. We can look up the corresponding p-value with df=99, or we can use R to compute the probability.

> pt(-2.4,99)
[1] 0.009132283

Since p=0.009, we reject the null hypothesis and accept the alternative hypothesis. We conclude that children born prematurely have lower mean IQ scores than children in the general population who had a full term gestation.

A Two-tailed Test of Hypothesis

What if we had performed a two-tailed test of hypothesis on the same data?

The null hypothesis would be the same, but the alternative hypothesis would be:

Alternative hypothesis(research hypothesis: H_A: μ_prem≠100
Children born prematurely have IQ scores that are different (either lower or higher) from children from the general population who had full term gestation.

The calculation of the t statistic would also be the same, but the p-value will be different.

> 2*pt(-2.4,99)
[1] 0.01826457

The probability is the area under the standard normal distribution that is either lower than -2.40 or greater than +2.40. So the probability is 0.009+0.009 = 0.018, i.e., a probability of 0.009 in both the lower and upper tail. We still conclude that children born prematurely have a significantly lower mean IQ than the general population (mean=95.8, s=17.5, p=0.018).

When performing two-tailed tests, direction is not specified in the hypothesis, but one should report the direction in any report, publication, or presentation, e.g., "Children who had been born prematurely had lower mean IQ scores than in the general population."

Student's t-test

The t-statistic was developed and published in 1908 by William Gosset, a chemist/statistician working for the Guinness brewery in Dublin. Guinness employees were not allowed to publish under their own name, so Gosset published under the pseudonym "Student".

One Sample t-test Using R

In the example for IQ tests above we were given the mean IQ and standard deviation for the children who had been born prematurely. However, suppose we were given an Excel spreadsheet with the raw data listed in a column for "iq"?

In Excel we could save this data as a .CSV file using the "Save as" function. We could then import the data in the .CSV file into R and analyze the data as follows:

[Note that R defaults to performing the more conservative two-tailed test unless a one-tailed test is specified as we will describe below.]

> t.test(iq,mu=100)

One Sample t-test
data: iq t = -2.3801, df=99, p-value = 0.01922
alternative hypothesis: true mean is not equal to 100
95 percent confidence interval:
92.35365 99.30635
Sample estimates
mean of x
95.83

In order to perform a one-tailed test, you need to specify the alternative hypothesis. For example:

> t.test(iq,mu=100, alternative="less")

> t.test(iq,mu=100, alternative="greater")

Errors in Hypothesis Testing

Hypothesis tests do not provide certainty, only an indication of the strength of the evidence. In essence, "alpha" =0.05 is an error rate of 5% when rejecting the null hypothesis and accepting the alternative hypothesis. This is referred to as a Type I error. In contrast, a Type II error occurs when there really is a difference, but we fail to reject the null hypothesis.

Truth	H₀ Not Rejected (Insufficient Evidence of a Difference)	H₀ Rejected (Difference)
H₀ true	Correct	Type I error
H₀ false	Type II error	Correct

Type I Error: Concluding that there is a difference when there isn't

Type II Error: Concluding no difference when there really is one.

Comparing the Means in Two Independent Samples (Unpaired t-test)

Suppose investigators want to assess the effectiveness of a new drug in lowering cholesterol, and they want to conduct a small clinical trial in which patients were randomized to receive the new drug or placebo, and total cholesterol was measured after six weeks on the assigned treatment. The table below shows the results.

	Sample Size (n)	Mean Cholesterol (6 wk)	SD
New drug	15	195.9	28.7
Placebo	15	227.4	30.3

Is there sufficiently strong evidence of a difference in mean cholesterol for patients on the new drug compared to patients receiving the placebo?

Note that SD is used to characterize individual variability within each group, although we will use SE when conducting a hypothesis test to compare the means.

H₀: μ₁ = μ₂

H₁:μ₁≠ μ₂

Since we are comparing the means in two independent groups (i.e., different patients in the two treatment groups), we will use the two independent sample t-test (aka, unpaired t-test) shown below. "Sp" is the pooled standard deviation for the two groups.

where

Equal Variance Assumption

Use of this unpaired t-test assumes that the variances for the two groups are more or less equal. Recall that variance is the square of the standard deviation, and note that s₁² and s₂² are the variances for the two samples, and they are included in the calculation of Sp. We can test the assumption of equal variance by computing the ratio of the two variances, and if the ratio is in the range of 0.5 to 2.0, then the assumption is adequately met. If this is not the case, a different equation must be used (the Welch t-test), but if we are using R, there is a simple way to make this adjustment, as you will see.

In the example we are currently using the ratio of the variances is

Ratio = 28.7² / 30.3² = 0.90

so the assumption is met.

Next, we need to compute the pooled standard deviation.

So, S_p=29.5 will be plugged into the equation for the test statistic as shown in the next step.

d_f = n₁ + n₂ - 2 = 28

We can use R to compute the two-tailed p-value:

2*pt(-2.92,28)
[1] 0.006840069

The p-value is 0.007, so there is sufficient evidence to reject the null hypothesis.

Conclusion:
Mean cholesterol is statistically significantly lower in patients on treatment as compared to placebo: 195.9 ± SD: 28.7 vs. 217.4 ± SD: 30.3, p=0.007).

Two Independent Sample t-Test Using R

In order to test whether the mean age of people who drank alcohol differed from that of non-drinkers, the following code was used in R:

> t.test(Age~Drink,var.equal=TRUE)

Two Sample t-test
data: Age by Drink
t= 2.4983, df=2189, p-value = 0.01255
alternative hypothesis: true differences in means is not equal to 0
95 percent confidence interval:
0.140054 1.1624532
sample estimates:
mean in group No mean in group Yes
25.54535 24.89410

Independent Sample t-test and 95% Confidence Interval for a Difference in Means

In the example above, the two means are 25.4535 and 24.89410. The difference in means = 0.65125. Notice that R gives the 95% confidence interval for the difference in the means for the two groups (0.140054, 1.1624532). The null for the difference in means is 0, and if the 95% confidence interval for the difference in means does not contain the null value, then the p-value must be < 0.05 if we are using a two-tailed test. Therefore, the observation that the 95% confidence interval does not contain the null value of 0 is consistent with the p-value= 0.01255.

Example: Long-term Developmental Outcomes in Infants with Iron Deficiency

Lozoff and colleagues compared developmental outcomes in children who had been anemic in infancy to those in children who had not been anemic. Physical and mental development were compared between children with versus without a history of iron deficiency anemia. Physical development was assessed by computing the mean gross motor composite score. Mental development was assessed by measuring mean IQ scores in each group. Statistical significance was evaluated using the two independent sample t-test. Some results are shown in the table below.

	Gross Motor	Verbal IQ
Children w iron deficiency (n=30) Children w/o deficiency (n=133)	52.2 ± 13.0 58.8 ± 12.5	101.4 ± 13.2 102.9 ± 12.4
Difference in means 95% CI for difference in means	-6.6 (-11.6, -1.6)	-1.5 (-6.5, 3.5)
p-value	p=0.010	p=0.556

The difference in means for gross motor scores was -6.5 with a 95% confidence interval from -11.6 to -1.6. The null value of 0 difference is not included inside the confidence interval, indicating that the p-value must be <0.05, and as you can see it is 0.01. However, for verbal IQ the mean difference was -1.5 with a 95% confidence interval from -6.5 to 3.5. This confidence interval does include the null value of 0, indicating that the p-value must be >0.05, and as you can see the p-value for verbal IQ is 0.556 indicating that the difference is not statistically significant.

Unequal Variance

If the sample variances do not meet the equal variance assumption, R can easily compute this using the Welch t-test by simply changing "var.equal=TRUE" to "var.equal=FALSE" as shown below.

> t.test(Age~Drink,var.equal=FALSE)

T-test for Two Dependent Samples (Paired or Matched Design)

The third application of a t-test that we will consider is for two dependent (paired or matched) samples. This can be applied in either of two types of comparisons.

Pre-post Comparisons: One sample of subjects is measure twice under two different conditions, e.g., before and after receiving a drug.
Comparison of Matched Samples: Two samples of pair-matched subjects, e.g., siblings or twins, or subjects matched by age and hospital ward

Example 1: Does an intervention increase knowledge about HIV transmission?

Early in the HIV epidemic, there was poor knowledge of HIV transmission risks among health care staff. A short training was developed to improve knowledge and attitudes around HIV disease. Was the training effective in improving knowledge?

Table - Mean ( ± SD) knowledge scores, pre- and post-intervention, n=15

Pre-Intervention	Post-Intervention	Change
18.3 ± 3.8	21.9 ± 4.4	3.53 ± 3.93

The raw data for this comparison is shown in the next table.

Subject	Kscore1	Kscore2	difference
1	17	22	5
2	17	21	4
3	15	21	6
4	19	26	7
5	18	20	2
6	14	14	0
7	27	31	4
8	20	18	-2
9	12	22	10
10	21	20	-1
11	20	27	7
12	24	23	-1
13	17	15	-2
14	17	24	7
15	17	24	7
		mean difference =	3.533333
		sd_diff=	3.925497
		p-value=	0.003634

The strategy is to calculate the pre-/post- difference in knowledge score for each person and determine whether the mean difference=0.

First, establish the null and alternative hypotheses.

H₀: μ_d = 0
H₁: μ_d ≠ 0

Then compute the test statistic for paired or matched samples

T-test of Two Dependent (paired) Samples Using R

For the example above we can compute the p-value using R. First, we compute the means.

> mean(Kscore1)
[1] 18.33333

> mean(Kscore2)
[1] 21.86667

Then we perform the t-test for two dependent samples.

> t.test(Kscore2,Kscore1,paired=TRUE)

Paired t-test

data: Kscore2 and Kscore1 t = 3.4861, df = 14, p-value = 0.003634
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.359466 5.707201
sample estimates:
mean of the differences
3.533333

The null hypothesis is that the mean change in knowledge scores from before to after is 0. However, the analyis shows that the mean difference is 3.53 with a 95% confidence interval that ranges from 1.36 to 5.7. Since the confidence interval does not include the null value of 0, the p-value must be < 0.05, and in fact it is 0.003634.

How Should Results Be Reported?

The report by Lozoff et al. found the following results:

	Gross Motor	Verbal IQ
Children w iron deficiency (n=30) Children w/o deficiency (n=133)	52.2 ± 13.0 58.8 ± 12.5	101.4 ± 13.2 102.9 ± 12.4
Difference in means 95% CI for difference in means	-6.6 (-11.6, -1.6)	-1.5 (-6.5, 3.5)
p-value	p=0.010	p=0.556

One good way of reporting this would be:

Children with a history of anemia had statistically significant lower gross motor function scores than the comparison children who did not have anemia (difference in means: -6.6 points; 95% confidence interval: -11.6,-1.6; p=0.016). There was no statistically significant difference in verbal IQ scores (difference in means: -1.5 points; 95% confidence interval -6.5, 3.5; p=0.556).

Thinking Ahead

As you might predict, we can also compare categorical data by comparing frequencies to generate measures of significance. How might these measures differ from the relative measures of association (risk ratio and odds ratio) discussed earlier in this course?

What might we do if we wanted to consider the role of multiple variables on a single continuous health outcome? Why might it be advantageous to consider multiple predictors at once?