2.1 Confidence Intervals for a Single Group
2.1.1 Confidence interval for a mean
The t.test( ) function performs one-sample and two-sample t-tests. In performing a one-sample t-test, this function also gives a confidence interval for the population mean.
> t.test(agewalk)
One Sample t-test
data: agewalk
t = 57.9405, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
10.74397 11.51603
sample estimates:
mean of x
11.13
The t.test( ) function can be used to conduct several types of t-tests, and it's a good idea to check the title in the output ('One Sample t-test) and the degrees of freedom (which for a CI for a mean are n-1) to be sure R is performing a one-sample t-test.
If we are interested in a confidence interval for the mean, we can ignore the t-value and p-value given by this procedure (which are discussed in Section 2.2), and focus on the 95% confidence interval. Here, the mean age at walking for the sample of n=50 infants (degrees of freedom are n-1) was 11.13, with a 95% confidence interval of (10.74 , 11.52).
R calculates a 95% confidence interval by default, but we can request other confidence levels using the 'conf.level' option. For example, the following requests the 90% confidence interval for the mean age at walking:
> t.test(agewalk,conf.level=.90)
One Sample t-test
data: agewalk
t = 57.9405, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
10.80795 11.45205
sample estimates:
mean of x
11.13
Note that R changes the label for the confidence interval (90 percent …) to reflect the specified confidence level.
2.1.2 Confidence interval for a proportion
The prop.test( ) command performs one- and two-sample tests for proportions, and gives a confidence interval for a proportion as part of the output. For example, in the Age at Walking example, 26/50=.52 of the infants were girls. I first used the table() function to find these frequencies, and then calculated the proportion. I then calculated the confidence interval using the prop.test( ) function.
NOTE: When using the prop.test( ) function, specifying 'correct=TRUE' tells R to use the small sample correction when calculating the confidence interval (a slightly different formula), and specifying 'correct=FALSE' tells R to use the usual large sample formula for the confidence interval (Since categorical data are not normally distributed, the usual z-statistic formula for the confidence interval for a proportion is only reliable with large samples - with at least 5 events and 5 non-events in the sample).
> table(sexmale)
sexmale
0 1
26 24
> 26/(26+24)
0.52
> prop.test(26,50,correct=FALSE)
1-sample proportions test without continuity correction
data: 26 out of 50, null probability 0.5
X-squared = 0.02, df = 1, p-value = 0.8875
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.3851174 0.6520286
sample estimates:
p
0.52
The prop.test( ) procedure can be used for several scenarios, so it's a good idea to check the labeling (1-sample proportions) to make sure we set things up correctly. The procedure also tests a hypothesis about the proportion (see Section 2.3), but we can focus on the 'p' of 0.52 (the sample proportion) and the confidence interval (0.385 , 0.652). This procedure uses a slightly different formula for the CI than presented in class, and the results of the two versions of the formula may differ slightly. With small samples, it is more appropriate to use the 'correct=TRUE' option to use the correction factor. There is also a 'binom.exact( )' function which calculates a confidence interval for a proportion using an exact formula appropriate for small sample sizes.
Confidence Intervals for Comparing Means
2.1.3 Confidence interval for a difference in means, independent samples
The t.test( ) function can also be used to compare means between two samples, and gives the confidence interval for the difference in the means from two independent samples as well as performing the independent samples t-test. For the following syntax, the underlying data set includes the subjects from both samples, with one variable indicating the dependent variable (the outcome variable) and another variable indicating which group a subject is in. The outcome variable and grouping variable are identified using the 'outcome ~ group' syntax. For the usual pooled-variance version of the t-test:
> t.test(agewalk ~ group,var.equal=TRUE)
Two Sample t-test
data: agewalk by group
t = -3.1812, df = 48, p-value = 0.002571
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.9331253 -0.4358587
sample estimates:
mean in group 1 mean in group 2
10.72727 11.91176
The t.test( ) function can be used to conduct several types of t-tests, with several different data set ups, and it's a good idea to check the title in the output ('Two Sample t-test) and the degrees of freedom (n1 + n2 – 2) to be sure R is performing the pooled-variance version of the two sample t-test.
The t-statistic and p-value are discussed under Section 2.2.2. The 95% confidence interval that is given is for the difference in the means for the two groups (10.73 – 11.91 gives a difference in means of -1.18, and the CI that R gives is a CI for this difference in means). By default, R gives the 95% CI; the 'conf.level' level option can be used to change the confidence level (see Section 2.1.1). Note that the output gives the means for each of the two groups being compared, but not the standard deviations or sample sizes. This additional information can be obtained using the tapply( ) function as described in Section 7 (in this example, tapply(agewalk,group,sd) will give standard deviations, table(group) will give n's).
To calculate the confidence interval for the difference in means using the unequal variance formula:
> t.test(agewalk ~ group,var.equal=FALSE)
Welch Two Sample t-test
data: agewalk by group
t = -3.1434, df = 31.39, p-value = 0.003635
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.9526304 -0.4163536
sample estimates:
mean in group 1 mean in group 2
10.72727 11.91176
Again, it's good to check the title (Welch Two Sample t-test) and degrees of freedom (which often take on decimal values for the unequal variance version of the t-test) to be sure R is using the unequal variance formula for the confidence interval and t-test.
2.1.4 Confidence interval for a mean difference, paired samples
The t.test( ) function can also be used to calculate the confidence interval for a mean from a paired (pre-post) sample, and to perform the paired-sample t-test. In this situation, we need to specify the two data vectors representing the two variables to be compared. The following example compares the means of a pre-test score (score1) and a post-test score (score2) from a sample of 5 subjects. The t.test( ) function does not give the means of the two underlying variables (it does give the mean difference) and so I used the mean( ) function to get this descriptive information. Generally standard deviations and sample size would also be reported, which can be obtained from the sd( ) and length( ) functions.
> mean(score1)
[1] 20.2
> mean(score2)
[1] 21
> t.test(score1,score2,paired=TRUE)
Paired t-test
data: score1 and score2
t = -0.4377, df = 4, p-value = 0.6842
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.874139 4.274139
sample estimates:
mean of the differences
-0.8
The t.test( ) function can be used for several different types of t-tests, and so it's a good idea to check the title (Paired t-test) and degrees of freedom (n-1, where n is the number of pairs in the study) to be sure R is performing a paired sample analysis.
The confidence interval here is the confidence interval for the mean difference; the confidence interval should agree with the p-value in that the CI should not contain 0 when p<0.05, and the CI should contain 0 when p>0.05.
Note that the t.test( ) procedure gives the mean difference, but does not give the standard deviations of the difference or the standard deviations of the two variables. Generally, standard deviations are reported as part of the data summary for a comparison of means, and these standard deviations can be found using the 'sd( )' command.
Confidence Intervals for Comparing Frequencies
2.1.5 Confidence interval for the difference in proportions, independent samples
The prop.test( ) command performs a two-sample test for proportions, and gives a confidence interval for the difference in proportions as part of the output. The example below uses data from the Age at Walking example, comparing the proportion of infants walking by 1 year in the exercise group (group=1) and control group (group=2). The table( ) command is used to find the number of infants walking by 1 year in each study group, and the proportion walking can be calculated from these frequencies. The prop.test( ) command will calculate a confidence interval for the difference between two proportions; for the two-sample situation, first enter a vector representing the number of successes in each of the two groups (using the c( ) command to create a column vector), and then a vector representing the number of subjects in each of the two groups. To use the usual large-sample formula in calculating the confidence interval, include the 'correct=FALSE' option to turn off the small sample size correction factor in the calculation (although in this example, with only 17 subjects in the control group, the small sample version of the confidence interval might be more appropriate).
> table(by1year,group)
group
by1year 1 2
0 5 9
1 28 8
> 28/33
0.848
> 8/17
0.470
> prop.test(c(28,8),c(33,17),correct=FALSE)
2-sample test for equality of proportions without continuity
correction
data: c(28, 8) out of c(33, 17)
X-squared = 7.9478, df = 1, p-value = 0.004815
alternative hypothesis: two.sided
95 percent confidence interval:
0.1109476 0.6448456
sample estimates:
prop 1 prop 2
0.8484848 0.4705882
Warning message:
In prop.test(c(28, 8), c(33, 17), correct = FALSE) :
Chi-squared approximation may be incorrect
The prop.test( ) command does several different analyses, and it's a good idea to check the title to make sure R is comparing two groups ('2-sample test for equality…'). The procedure also gives the results of a chi-square test comparing the two proportions (see Section 2.5), but here we are interested in the confidence interval and the proportions in each study group. For this example, 84.8% of the exercise group was walking by 1 year, and 47.1% of the control group was walking by 1 year. The difference in these two proportions is 84.8 – 47.1 = 37.7, and the 95% CI for this difference is (11.1% , 64.5%). We are 95% confident that more infants walk by 1 year in the exercise group (since this interval does not contain 0); we are 95% confident that the additional percent of kids walking by 1 year is between 11.1% and 64.5%.
2.1.6 Confidence interval for a risk ratio
Epidemiologic analyses are available through 'epitools', an add-on package to R. To use the epitools functions, you must first do a one-time installation. In R, click on the 'Packages' menu, then 'Install Package(s)', then select a download site (from the US), then select the epitools package. This will install the add-on package onto your computer. To use the package, you must also load it into R: click on the 'Packages' menu, then 'Load Package', then select epitools. While you only need to install the package once onto your computer, you will need to load the package into R each time you want to use it.
The data layout matters for calculating RRs. For the riskratio( ) function from epitools, data should be set up in the following format:
|
No Disease |
Disease |
---|---|---|
Control |
|
|
Exposed |
|
|
This data layout corresponds to the usual 0/1 coding for the exposure and disease variables, but is slightly different than the layout traditionally used in the Introductory Epidemiology class (so be careful!). The riskratio( ) command calculates the RR of disease for those in the exposed group relative to the control group.
2.1.6.1 Confidence interval for a RR from a per-subject data set
Using the Age at Walking example, I'll find the relative risk of being a late walker (walking at 12 months or older) for those in the non-exercise group compared to those in the exercise group. I first created two 0/1 dichotomous variables (see Section 1.4.2 on creating new variables) to reflect the RR of interest: NoExercise is coded 1 for those in the non-exercise control group and 0 for those in the exercise group; LateWalker is coded 1 for those walking at 12 months or later and 0 for those walking before 12 months. With the variables defined in this manner, the table should be oriented correctly for the RR of interest. I first printed the 2x2 table as a check, then used the riskratio() function to calculate the relative risk and large sample 95% confidence interval.
> table(NoExercise,LateWalker)
LateWalker
NoExercise 0 1
0 28 5
1 8 9
> riskratio.wald(NoExercise,LateWalker)
$data
Outcome
Predictor 0 1 Total
0 28 5 33
1 8 9 17
Total 36 14 50
$measure
risk ratio with 95% C.I.
Predictor estimate lower upper
0 1.000000 NA NA
1 3.494118 1.387688 8.797984
$p.value
two-sided
Predictor midp.exact fisher.exact chi.square
0 NA NA NA
1 0.008000253 0.007949207 0.004814519
$correction
[1] FALSE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
Warning message:
In chisq.test(xx, correct = correction) :
Chi-squared approximation may be incorrect
The RR here is 3.49 ( (9/17) / (5/33) ) , with a 95% CI of (1.39 , 8.80). There are several versions of a CI for a relative risk, and using 'riskratio.wald( )' requests the standard normal approximation formula; 'riskratio.small( )' uses a correction to the CI for small samples (and the 'Warning message' that R gave in the above example, that the 'Chi-squared approximation may be incorrect' is a small sample size warning). R will choose the appropriate version of the CI if 'riskratio( )' is specified.
2.1.6.2 Inputting counts from a 2x2 table into R for calculation of a RR
Cell counts from a 2x2 table (or larger tables) can also be entered directly into R for analysis (RR, OR, or chi-square analysis). For example, the following table presents data on adverse side effects for patients undergoing robot-assisted vs. traditional surgery:
|
No Side Effect |
Side Effect |
Total |
---|---|---|---|
Traditional |
5169 |
111 |
5280 |
Robo-Assist |
3355 |
165 |
3520 |
The rate of side effects was 2.1% (111/5280) vs. 4.7% (165/3520) for those undergoing traditional vs. robot-assisted surgery. Table orientation matters for the RR (see Section 2.1.6.1), and this table is set up to find the RR of a side effect, for those undergoing robot-assisted compared to traditional surgery.
The matrix(c( ),nrow=,ncol= ) command can be used to enter cell counts from a table directly into R. R treats data entered using the column command (c( ) ) as columns of numbers, so data must be entered by column – counts for the first column followed by counts for the second column. The 'nrow=' and 'ncol=' command specify the dimensions of the table (here, 2 rows and 2 columns). The following commands enter and save the above table as 'sideeffects', prints the table as a check to be sure the table is oriented correctly, and then finds the RR and 95% CI:
> sideeffects <- matrix(c(5169,3355,111,165),nrow=2,ncol=2)
> sideeffects
[,1] [,2]
[1,] 5169 111
[2,] 3355 165
> riskratio.wald(sideeffects)
$data
Outcome
Predictor Disease1 Disease2 Total
Exposed1 5169 111 5280
Exposed2 3355 165 3520
Total 8524 276 8800
$measure
risk ratio with 95% C.I.
Predictor estimate lower upper
Exposed1 1.00000 NA NA
Exposed2 2.22973 1.759603 2.825463
$p.value
two-sided
Predictor midp.exact fisher.exact chi.square
Exposed1 NA NA NA
Exposed2 1.857736e-11 2.056211e-11 9.338045e-12
$correction
[1] FALSE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
Those given robot-assisted surgery had 2.23 times the odds of an adverse side effect, compared to those given traditional surgery; we are 95% confident that the true RR is between 1.76 and 2.83.
2.1.7 Confidence interval for an odds ratio
The epitools add-on package also has a function to calculate odds ratios and confidence intervals for odds ratios. You must first load the epitools package into R (see Section 2.1.6). Orientation of the table matters when calculating the OR, and the orientation described above for the relative risk also applies for the odds ratio. The oddsratio.wald( ) command can be used with a per-subject data set or can be used to find the OR and CI from summarized cell counts entered directly into R (see the matrix( ) command described in Section 2.1.6.2).
Calculating the odds ratio ( (9/8) / (5/28) = 6.3 ) and 95% CI for late walkers (see the example in 2.1.6 above), for non-exercisers vs. exercisers in the Age at Walking example:
> oddsratio.wald(NoExercise,LateWalker)
$data
Outcome
Predictor 0 1 Total
0 28 5 33
1 8 9 17
Total 36 14 50
$measure
odds ratio with 95% C.I.
Predictor estimate lower upper
0 1.0 NA NA
1 6.3 1.639283 24.2118
$p.value
two-sided
Predictor midp.exact fisher.exact chi.square
0 NA NA NA
1 0.008000253 0.007949207 0.004814519
$correction
[1] FALSE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
Warning message:
In chisq.test(xx, correct = correction) :
Chi-squared approximation may be incorrect
The 'oddsratio.wald" option gives the usual estimate for the odds ratio, with OR=6.3 and 95% CI of 1.64 , 24.21. 'oddsratio.small( )' uses a correction for small sample size in calculating the CI.