Confidence Intervals for a Single Mean or Proportion
Authors:
Lisa Sullivan, PhD, Professor of Biostatistics, Boston University School of Public Health
Wayne LaMorte, MD, PhD, MPH, Professor of Epidemiology, Boston University School of Public Health
In previous modules we have stressed the importance of recognizing that samples provide us with estimates of various healthrelated parameters in a population. Estimates based on samples are, of course, subject to sampling error (random error), and it is important to evaluate the precision of our estimates. In public health this is most commonly done by computing a confidence interval. One can compute confidence intervals all types of estimates, but this short module will provide the conceptual background for computing confidence intervals and will then focus on the computation and interpretation of confidence intervals for a mean or a proportion in a single group. This is particularly relevant for the analysis and presentation of descriptive studies, such as a case series, in which one is simply trying to accurately report characteristics of a single group. Later modules will address the computation and interpretation of confidence intervals for estimates from analytical studies (e.g., risk ratios, odds ratios, etc.) in which one is conducting hypothesis testing.
Key Questions:
How do I gauge the precision of an estimated mean or an estimated proportion in a single sample?
How do I interpret and calculate a confidence interval for an estimate in a single sample?
After successfully completing this unit, the student will be able to:
The goal of exploratory or descriptive studies is not to formally compare groups in order to test for associations between exposures and health outcomes, but to estimate and summarize the characteristics of a particular population of interest. Typical examples would be a case series of humans who had been diagnosed and treated for bird flu or a crosssectional study in a community for the purpose of better understanding the current health status and potential challenges for the future. The variables being estimated would logically include both continuous variables (e.g., age, systolic and diastolic blood pressure, body mass index, serum cholesterol levels, household income, etc.) and dichotomous (Yes/No) variables (e.g., raising poultry, getting a flu shot, etc.).
There are two types of estimates for each population parameter: the point estimate and confidence interval (CI) estimate. For both continuous variables (e.g., population mean) and dichotomous variables (e.g., population proportion) one first computes the point estimate from a sample. Recall that sample means and sample proportions are unbiased estimates of the corresponding population parameters.
For both continuous and dichotomous variables, the confidence interval estimate (CI) is a range of likely values for the population parameter based on:
Strictly speaking a 95% confidence interval means that if we were to take 100 different samples and compute a 95% confidence interval for each sample, then approximately 95 of the 100 confidence intervals will contain the true mean value (μ).
In practice, however, we select one random sample and generate one confidence interval, which may or may not contain the true mean. The observed interval may over or underestimate μ. Consequently, the 95% CI is the likely range of the true, unknown parameter.
Key Concept A confidence interval does not reflect the variability in the unknown parameter. Rather, it reflects the amount of random error in the sample and provides a range of values that are likely to include the unknown parameter. Another way of thinking about a confidence interval is that it is the range of likely values of the parameter with a specified level of confidence (which is similar to a probability).

Suppose we want to generate a 95% confidence interval estimate for an unknown population mean. The Central Limit Theorem states that, for large samples, the distribution of the sample means is approximately normally distributed with a mean:
and a standard deviation (also called the standard error):
[NOTE: There is often confusion regarding standard deviations and standard errors. Standard deviations describe variability in a measure among experimental units (e.g., among participants in a clinical sample). Standard errors represent variability in estimates of a mean or proportion; i.e., if one had taken many samples to estimate a mean or proportion, the standard error is the estimated standard deviation of the sampling means or sampling proportions. Another way to think of this is that standard deviations describe the variability in a population while standard errors represent variability in the sampling means or proportions.]
The Central Limit Theorem also states that
For the standard normal distribution there is a 95% probability that a standard normal variable, Z, will fall between 1.96 and 1.96. In other words,
We can substitute the equation for Z from the central limit theorem into this equation in order to derive an expression for computing the 95% confidence interval for the population mean, as follows:
Link to the stepbystep derivation of this equation
So, the general form of a confidence interval is:
point estimate + margin of error
or
point estimate + Z (SE)
where Z is the value from the standard normal distribution for the selected confidence level (e.g., for a 95% confidence level, Z=1.96). In practice, we often do not know the value of the population standard deviation (σ). However, if the sample size is large (n > 30), then the sample standard deviations can be used to estimate the population standard deviation.
Key Concept: In healthrelated publications a 95% confidence interval is most often used, but this is an arbitrary value, and other confidence levels can be selected. Note that for a given sample, the 99% confidence interval would be wider than the 95% confidence interval, because it allows one to be more confident that the unknown population parameter is contained within the interval.

Table  ZScores for Commonly Used Confidence Intervals
Desired Confidence Interval 
Z Score 

90% 
1.645 
95% 
1.96 
99% 
2.576 
Key Concept: A point estimate for a population parameter is the "best" single number estimate of that parameter. A confidence interval is a range of values for the population parameter with a level of confidence attached (e.g., 95% confidence that the range or interval contains the parameter).

For Z? 95% of zvalues are between 1.96, 1.96
For X? 95% of individuals have X within ±1.96 sd of µ
For ? 95% of samples have within ±1.96 SE of µ
Suppose we wish to estimate the mean systolic blood pressure, body mass index, total cholesterol level or white blood cell count in a single target population. We select a sample and compute descriptive statistics including the sample size (n), the sample mean, and the sample standard deviation (s). The formulas for confidence intervals for the population mean depend on the sample size and are given below.
Use the Z table for the standard normal distribution.
Use the t table with df=n1
Example: Descriptive statistics on variables measured in a sample of a n=3,539 participants attending the 7th examination of the offspring in the Framingham Heart Study are shown below.
Characteristic 
n 
Sample Mean 
Standard Deviation (s) 
Systolic Blood Pressure 
3,534 
127.3 
19.0 
Diastolic Blood Pressure 
3,532 
74.0 
9.9 
Total Serum Cholesterol 
3,310 
200.3 
36.8 
Weight 
3,506 
174.4 
38.7 
Height 
3,326 
65.957 
3.749 
Body Mass Index 
3,326 
28.15 
5.32 
Because the sample is large, we can generate a 95% confidence interval for systolic blood pressure using the following formula:
The Z value for 95% confidence is Z=1.96. [Note: Both the table of Zscores and the table of tscores can also be accessed from the "Other Resources" on the right side of the page.]
Substituting the sample statistics and the Z value for 95% confidence, we have
So the confidence interval is
(126.7,127.9)
Interpretation: A point estimate for the true mean systolic blood pressure in the population is 127.3, and we are 95% confident that the true mean is between 126.7 and 127.9. The margin of error is very small here because of the large sample size
What is the 90% confidence interval for BMI? (Note that Z=1.645 to reflect the 90% confidence level.)
Answer
With smaller samples (n< 30) the Central Limit Theorem does not apply, and another distribution called the t distribution must be used. The t distribution is similar to the standard normal distribution but takes a slightly different shape depending on the sample size. In a sense, one could think of the t distribution as a family of distributions for smaller samples. Instead of "Z" values, there are "t" values for confidence intervals which are larger for smaller samples, producing larger margins of error, because small samples are less precise. t values are listed by degrees of freedom (df) which take into account the sample size. Just as with large samples, the t distribution assumes that the outcome of interest is approximately normally distributed.
A table of t values can be accessed from the "Other Resources" on the left side of the page. Scroll down to the bottom of the table and note that as the sample size becomes larger, the t values become closer to the z value listed at the bottom of the table. Consequently, one can always use a t score, even with large sample.
The table below shows data on a subsample of n=10 participants in the 7th examination of the Framingham Offspring Study.
Characteristic 
n 
Sample Mean 
Standard Deviation (s) 
Systolic Blood Pressure 
10 
121.2 
11.1 
Diastolic Blood Pressure 
10 
71.3 
7.2 
Total Serum Cholesterol 
10 
202.3 
37.7 
Weight 
10 
176.0 
33.0 
Height 
10 
67.175 
4.205 
Body Mass Index 
10 
27.26 
3.10 
Suppose we compute a 95% confidence interval for the true systolic blood pressure using data in the subsample. Because the sample size is small, we must now use the confidence interval formula that involves t rather than Z.
The sample size is n=10, the degrees of freedom (df) = n1 = 9. The t value for 95% confidence with df = 9 is t = 2.262.
Substituting the sample statistics and the t value for 95% confidence, we have the following expression:
.
Interpretation: Based on this sample of size n=10, our best estimate of the true mean systolic blood pressure in the population is 121.2. Based on this sample, we are 95% confident that the true systolic blood pressure in the population is between 113.3 and 129.1. Note that the margin of error is larger here primarily due to the small sample size.
Using the subsample in the table above, what is the 90% confidence interval for BMI?
Answer
To illustrate let's first take a small subset of the adult respondents to the Weymouth Health Survey and create a small data set consisting of their ages.
> ages < c(56,30,34,77,55,67,45,65,44,47,49,60,63,64,55,67,88)
I can can compute both the point estimate (mean) and the 95% confidence interval using the t.test() command.
> t.test(ages)
One Sample ttest
data: ages
t = 15.9265, df = 16, pvalue = 3.1e11
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
49.25999 64.38707
sample estimates:
mean of x
56.82353
It's a good idea to check the title in the output ('One Sample ttest) and the degrees of freedom (which for a CI for a mean are n1) to be sure R is performing a onesample ttest.
If we are interested in a confidence interval for the mean, we can ignore the tvalue and pvalue, and focus on the 95% confidence interval. Here, the mean age at walking for the sample of n=17 (degrees of freedom are n1=16) was 56.82353 with a 95% confidence interval of (49.25999, 64.38707).
R calculates a 95% confidence interval by default, but we can request other confidence levels using the 'conf.level' option. For example, the following requests the 90% confidence interval for "ages."
> t.test(ages,conf.level=.90)
One Sample ttest
data: ages
t = 15.9265, df = 16, pvalue = 3.1e11
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
50.59445 63.05261
sample estimates:
mean of x
56.82353
A town wide health survey was conducted in Weymouth, MA in 2002. Some of the data can be found in Weymouth_Adult1.csv, which is available under Other Resources at the left. Import this data file into R, and compute the mean and 95% confidence interval for the variable "weight," which is the weight of the adult household respondent in pounds, and interpret the result in a sentence.
Answer
Suppose we wish to estimate the proportion of people with diabetes in a population or the proportion of people with hypertension or obesity. These diagnoses are defined by specific levels of laboratory tests and measurements of blood pressure and body mass index, respectively. Subjects are defined as having these diagnoses or not, based on the definitions. When the outcome of interest is dichotomous like this, the record for each member of the sample indicates having the condition or characteristic of interest or not. Recall that for dichotomous outcomes the investigator defines one of the outcomes a "success" and the other a failure. The sample size is denoted by n, and we let x denote the number of "successes" in the sample.
For example, if we wish to estimate the proportion of people with diabetes in a population, we consider a diagnosis of diabetes as a "success" (i.e., and individual who has the outcome of interest), and we consider lack of diagnosis of diabetes as a "failure." In this example, X represents the number of people with a diagnosis of diabetes in the sample. The sample proportion is p̂ (called "phat"), and it is computed by taking the ratio of the number of successes in the sample to the sample size, that is:
If there are more than 5 successes and more than 5 failures, then the confidence interval can be computed with this formula:
The point estimate for the population proportion is the sample proportion, and the margin of error is the product of the Z value for the desired confidence level (e.g., Z=1.96 for 95% confidence) and the standard error of the point estimate. In other words, the standard error of the point estimate is:
This formula is appropriate for samples with at least 5 successes and at least 5 failures in the sample. This was a condition for the Central Limit Theorem for binomial outcomes. If there are fewer than 5 successes or failures then alternative procedures, called exact methods, must be used to estimate the population proportion.
Example: During the 7th examination of the Offspring cohort in the Framingham Heart Study there were 1219 participants being treated for hypertension and 2,313 who were not on treatment. If we call treatment a "success", then x=1219 and n=3532. The sample proportion is:
This is the point estimate, i.e., our best estimate of the proportion of the population on treatment for hypertension is 34.5%. The sample is large, so the confidence interval can be computed using the formula:
Substituting our values we get
which is
So, the 95% confidence interval is (0.329, 0.361).
Thus we are 95% confident that the true proportion of persons on antihypertensive medication is between 32.9% and 36.1%.
Key Concept: There are several types of estimates in a single population that are proportions for which one can compute confidence intervals using these methods. These include:

The table below, from the 5th examination of the Framingham Offspring cohort, shows the number of men and women found with or without cardiovascular disease (CVD). Estimate the prevalence of CVD in men using a 95% confidence interval.

Free of CVD 
Prevalent CVD 
Total 
Men 
1,548 
244 
1,792 
Women 
1,872 
135 
2,007 
Total 
3,420 
379 
3,799 
Answer
In the Weymouth, MA health survey there were 333 adult respondents who reported a history of diabetes out of of 3573 respondents (333/3573=0.0932 or 9.32%).
We can use the Weymouth health survey data to get the counts of those with or without a history of diabetes using the table() function:
> table(hx_dm)
hx_dm
0 
1 
3240 
333 
Then find the denominator (sum of those with or without diabetes).
> 3240+333
[1] 3573
To get the 95% confidence interval ,then use the prop.test()function.
> prop.test(333,3573,correct=FALSE)
1sample proportions test without continuity correction
data: 333 out of 3573, null probability 0.5
Xsquared = 2365.141, df = 1, pvalue < 2.2e16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.0840988 0.1031730
sample estimates:
p
0.09319899
So, the point estimate (proportion with diabetes in the sample) was 9.3%, and with 95% confidence the true estimate lies between 0.084 and 0.103 or 8.4 to 10.3%.
This module focuses on computing the point estimates and 95% confidence limits for estimating a population means or a population proportions in a sample. Point estimates are the best singlevalued estimates of an unknown population parameter. Because these can vary from sample to sample, most investigations start with a point estimate and build in a margin of error. The margin of error quantifies sampling variability and includes a value from the Z or t distribution reflecting the selected confidence level as well as the standard error of the point estimate. It is important to remember that the confidence interval contains a range of likely values for the unknown population parameter; a range of values for the population parameter consistent with the data. It is also possible, although the likelihood is small, that the confidence interval does not contain the true population parameter. This is important to remember in interpreting intervals. The precision of a confidence interval is defined by the margin of error (or the width of the interval). A larger margin of error (wider interval) is indicative of a less precise estimate.
What is the 90% confidence interval for BMI? (Note that Z=1.645 to reflect the 90% confidence level.)
So, the 90% confidence interval is (126.77, 127.83)
=======================================================
Answer to BMI Problem on page 3
Question: Using the subsample in the table above, what is the 90% confidence interval for BMI?
Solution: Once again, the sample size was 10, so we go to the ttable and use the row with 10 minus 1 degrees of freedom (so 9 degrees of freedom). But now you want a 90% confidence interval, so you would use the column with a twotailed probability of 0.10. Looking down to the row for 9 degrees of freedom, you get a tvalue of 1.833.
Once again you will use this equation:
Plugging in the values for this problem we get the following expression:
Therefore the 90% confidence interval ranges from 25.46 to 29.06.
=======================================================
The table below, from the 5th examination of the Framingham Offspring cohort, shows the number of men and women found with or without cardiovascular disease (CVD). Estimate the prevalence of CVD in men using a 95% confidence interval.

Free of CVD 
Prevalent CVD 
Total 
Men 
1,548 
244 
1,792 
Women 
1,872 
135 
2,007 
Total 
3,420 
379 
3,799 
The prevalence of cardiovascular disease (CVD) among men is 244/1792=0.1362. The sample size is large and satisfies the requirement that the number of successes is greater than 5 and the number of failures is greater than 5. Therefore, the following formula can be used again.
Substituting, we get
So, the 95% confidence interval is (0.120, 0.152).
With 95% confidence the prevalence of cardiovascular disease in men is between 12.0 to 15.2%.
==================================================================================================
> t.test(weight)
One Sample ttest
data: weight
t = 204.6426, df = 3324, pvalue < 2.2e16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
167.5318 170.7731
sample estimates:
mean of x
169.1525
>
Interpretation:
The mean weight of adult household respondents in Weymouth was 169 pounds. With 95% confidence, the true mean is in the range of 167.5 to 170.8 pounds.
From the central limit theorem we know that
and the sampling mean has a standard deviation (also called the standard error):
It also states that
For the standard normal distribution there is a 95% probability that a standard normal variable, Z, will fall between 1.96 and 1.96. In other words,
We can substitute the equation for Z from the central limit theorem into this equation in order to derive an expression for computing the 95% confidence interval for the population mean, as follows:
Using algebra, we can rework this inequality such that the mean (μ) is the middle term, as shown below.
then
and finally
Therefore,