Confidence Intervals for a Single Mean or Proportion

Authors:

Lisa Sullivan, PhD, Professor of Biostatistics, Boston University School of Public Health

Wayne LaMorte, MD, PhD, MPH, Professor of Epidemiology, Boston University School of Public Health

Introduction

In previous modules we have stressed the importance of recognizing that samples provide us with estimates of various health-related parameters in a population. Estimates based on samples are, of course, subject to sampling error (random error), and it is important to evaluate the precision of our estimates. In public health this is most commonly done by computing a confidence interval. One can compute confidence intervals all types of estimates, but this short module will provide the conceptual background for computing confidence intervals and will then focus on the computation and interpretation of confidence intervals for a mean or a proportion in a single group. This is particularly relevant for the analysis and presentation of descriptive studies, such as a case series, in which one is simply trying to accurately report characteristics of a single group. Later modules will address the computation and interpretation of confidence intervals for estimates from analytical studies (e.g., risk ratios, odds ratios, etc.) in which one is conducting hypothesis testing.

Key Questions:

How do I gauge the precision of an estimated mean or an estimated proportion in a single sample?

How do I interpret and calculate a confidence interval for an estimate in a single sample?

Learning Objectives

After successfully completing this unit, the student will be able to:

• Explain what a confidence interval is.
• Interpret the confidence interval for a mean or a proportion from a single group.
• Use R to compute a confidence interval for the mean in a single group
• Use R to compute a confidence interval for a proportion in a single group

Estimating Population Parameters in a Single Group

The goal of exploratory or descriptive studies is not to formally compare groups in order to test for associations between exposures and health outcomes, but to estimate and summarize the characteristics of a particular population of interest. Typical examples would be a case series of humans who had been diagnosed and treated for bird flu or a cross-sectional study in a community for the purpose of better understanding the current health status and potential challenges for the future. The variables being estimated would logically include both continuous variables (e.g., age, systolic and diastolic blood pressure, body mass index, serum cholesterol levels, household income, etc.) and dichotomous (Yes/No) variables (e.g., raising poultry, getting a flu shot, etc.).

There are two types of estimates for each population parameter: the point estimate and confidence interval (CI) estimate. For both continuous variables (e.g., population mean) and dichotomous variables (e.g., population proportion) one first computes the point estimate from a sample. Recall that sample means and sample proportions are unbiased estimates of the corresponding population parameters.

Confidence Intervals

For both continuous and dichotomous variables, the confidence interval estimate (CI) is a range of likely values for the population parameter based on:

• the point estimate, e.g., the sample mean
• the investigator's desired level of confidence (most commonly 95%, but any level between 0-100% can be selected)
• and the sampling variability or the standard error of the point estimate.

Strictly speaking a 95% confidence interval means that if we were to take 100 different samples and compute a 95% confidence interval for each sample, then approximately 95 of the 100 confidence intervals will contain the true mean value (μ).

In practice, however, we select one random sample and generate one confidence interval, which may or may not contain the true mean. The observed interval may over- or underestimate μ. Consequently, the 95% CI is the likely range of the true, unknown parameter.

 Key Concept A confidence interval does not reflect the variability in the unknown parameter. Rather, it reflects the amount of random error in the sample and provides a range of values that are likely to include the unknown parameter. Another way of thinking about a confidence interval is that it is the range of likely values of the parameter with a specified level of confidence (which is similar to a probability).

Background on Confidence Intervals for an Unknown Population Mean

Suppose we want to generate a 95% confidence interval estimate for an unknown population mean. The Central Limit Theorem states that, for large samples, the distribution of the sample means is approximately normally distributed with a mean:

and a standard deviation (also called the standard error):

[NOTE: There is often confusion regarding standard deviations and standard errors.  Standard deviations describe variability in a measure among experimental units (e.g., among participants in a clinical sample). Standard errors represent variability in estimates of a mean or proportion; i.e., if one had taken many samples to estimate a mean or proportion, the standard error is the estimated standard deviation of the sampling means or sampling proportions. Another way to think of this is that standard deviations describe the variability in a population while standard errors represent variability in the sampling means or proportions.]

The Central Limit Theorem also states that

For the standard normal distribution there is a 95% probability that a standard normal variable, Z, will fall between -1.96 and 1.96. In other words,

We can substitute the equation for Z from the central limit theorem into this equation in order to derive an expression for computing the 95% confidence interval for the population mean, as follows:

Link to the step-by-step derivation of this equation

So, the general form of a confidence interval is:

point estimate + margin of error

or

point estimate + Z (SE)

where Z is the value from the standard normal distribution for the selected confidence level (e.g., for a 95% confidence level, Z=1.96). In practice, we often do not know the value of the population standard deviation (σ). However, if the sample size is large (n > 30), then the sample standard deviations can be used to estimate the population standard deviation.

 Key Concept: In health-related publications a 95% confidence interval is most often used, but this is an arbitrary value, and other confidence levels can be selected. Note that for a given sample, the 99% confidence interval would be wider than the 95% confidence interval, because it allows one to be more confident that the unknown population parameter is contained within the interval.

Table - Z-Scores for Commonly Used Confidence Intervals

Desired Confidence Interval

Z Score

90%

1.645

95%

1.96

99%

2.576

 Key Concept: A point estimate for a population parameter is the "best" single number estimate of that parameter. A confidence interval is a range of values for the population parameter with a level of confidence attached (e.g., 95% confidence that the range or interval contains the parameter).

Three Ways to Think About the 95% Confidence Interval for a Mean

For Z? 95% of z-values are between -1.96, 1.96

For X? 95% of individuals have X within ±1.96 sd of µ

For ? 95% of samples have within ±1.96 SE of µ

Calculating Confidence Intervals for a Mean from One Sample

Suppose we wish to estimate the mean systolic blood pressure, body mass index, total cholesterol level or white blood cell count in a single target population. We select a sample and compute descriptive statistics including the sample size (n), the sample mean, and the sample standard deviation (s). The formulas for confidence intervals for the population mean depend on the sample size and are given below.

Confidence Intervals for μ

• For n > 30

Use the Z table for the standard normal distribution.

• For n<30

Use the t table with df=n-1

Example: Descriptive statistics on variables measured in a sample of a n=3,539 participants attending the 7th examination of the offspring in the Framingham Heart Study are shown below.

 Characteristic n Sample Mean Standard Deviation (s) Systolic Blood Pressure 3,534 127.3 19.0 Diastolic Blood Pressure 3,532 74.0 9.9 Total Serum Cholesterol 3,310 200.3 36.8 Weight 3,506 174.4 38.7 Height 3,326 65.957 3.749 Body Mass Index 3,326 28.15 5.32

Because the sample is large, we can generate a 95% confidence interval for systolic blood pressure using the following formula:

The Z value for 95% confidence is Z=1.96. [Note: Both the table of Z-scores and the table of t-scores can also be accessed from the "Other Resources" on the right side of the page.]

Substituting the sample statistics and the Z value for 95% confidence, we have

So the confidence interval is

(126.7,127.9)

Interpretation: A point estimate for the true mean systolic blood pressure in the population is 127.3, and we are 95% confident that the true mean is between 126.7 and 127.9. The margin of error is very small here because of the large sample size

What is the 90% confidence interval for BMI? (Note that Z=1.645 to reflect the 90% confidence level.)

Confidence Interval Estimates for Smaller Samples

With smaller samples (n< 30) the Central Limit Theorem does not apply, and another distribution called the t distribution must be used. The t distribution is similar to the standard normal distribution but takes a slightly different shape depending on the sample size. In a sense, one could think of the t distribution as a family of distributions for smaller samples. Instead of "Z" values, there are "t" values for confidence intervals which are larger for smaller samples, producing larger margins of error, because small samples are less precise. t values are listed by degrees of freedom (df) which take into account the sample size. Just as with large samples, the t distribution assumes that the outcome of interest is approximately normally distributed.

A table of t values can be accessed from the "Other Resources" on the left side of the page. Scroll down to the bottom of the table and note that as the sample size becomes larger, the t values become closer to the z value listed at the bottom of the table. Consequently, one can always use a t score, even with large sample.

The table below shows data on a subsample of n=10 participants in the 7th examination of the Framingham Offspring Study.

 Characteristic n Sample Mean Standard Deviation (s) Systolic Blood Pressure 10 121.2 11.1 Diastolic Blood Pressure 10 71.3 7.2 Total Serum Cholesterol 10 202.3 37.7 Weight 10 176.0 33.0 Height 10 67.175 4.205 Body Mass Index 10 27.26 3.10

Suppose we compute a 95% confidence interval for the true systolic blood pressure using data in the subsample. Because the sample size is small, we must now use the confidence interval formula that involves t rather than Z.

The sample size is n=10, the degrees of freedom (df) = n-1 = 9. The t value for 95% confidence with df = 9 is t = 2.262.

Substituting the sample statistics and the t value for 95% confidence, we have the following expression:

.

Interpretation: Based on this sample of size n=10, our best estimate of the true mean systolic blood pressure in the population is 121.2. Based on this sample, we are 95% confident that the true systolic blood pressure in the population is between 113.3 and 129.1. Note that the margin of error is larger here primarily due to the small sample size.

Using the subsample in the table above, what is the 90% confidence interval for BMI?

Computing the 95% Confidence Interval for the Mean in One Sample with R

To illustrate let's first take a small subset of the adult respondents to the Weymouth Health Survey and create a small data set consisting of their ages.

> ages <- c(56,30,34,77,55,67,45,65,44,47,49,60,63,64,55,67,88)

I can can compute both the point estimate (mean) and the 95% confidence interval using the t.test() command.

> t.test(ages)

One Sample t-test

data: ages

t = 15.9265, df = 16, p-value = 3.1e-11

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:

49.25999 64.38707

sample estimates:

mean of x

56.82353

It's a good idea to check the title in the output ('One Sample t-test) and the degrees of freedom (which for a CI for a mean are n-1) to be sure R is performing a one-sample t-test.

If we are interested in a confidence interval for the mean, we can ignore the t-value and p-value, and focus on the 95% confidence interval. Here, the mean age at walking for the sample of n=17 (degrees of freedom are n-1=16) was 56.82353 with a 95% confidence interval of (49.25999, 64.38707).

R calculates a 95% confidence interval by default, but we can request other confidence levels using the 'conf.level' option. For example, the following requests the 90% confidence interval for "ages."

> t.test(ages,conf.level=.90)

One Sample t-test

data: ages

t = 15.9265, df = 16, p-value = 3.1e-11

alternative hypothesis: true mean is not equal to 0

90 percent confidence interval:

50.59445 63.05261

sample estimates:

mean of x

56.82353

A town wide health survey was conducted in Weymouth, MA in 2002. Some of the data can be found in Weymouth_Adult1.csv, which is available under Other Resources at the left. Import this data file into R, and compute the mean and 95% confidence interval for the variable "weight," which is the weight of the adult household respondent in pounds, and interpret the result in a sentence.

Confidence Interval for a Proportion in One Sample

Suppose we wish to estimate the proportion of people with diabetes in a population or the proportion of people with hypertension or obesity. These diagnoses are defined by specific levels of laboratory tests and measurements of blood pressure and body mass index, respectively. Subjects are defined as having these diagnoses or not, based on the definitions. When the outcome of interest is dichotomous like this, the record for each member of the sample indicates having the condition or characteristic of interest or not. Recall that for dichotomous outcomes the investigator defines one of the outcomes a "success" and the other a failure. The sample size is denoted by n, and we let x denote the number of "successes" in the sample.

For example, if we wish to estimate the proportion of people with diabetes in a population, we consider a diagnosis of diabetes as a "success" (i.e., and individual who has the outcome of interest), and we consider lack of diagnosis of diabetes as a "failure." In this example, X represents the number of people with a diagnosis of diabetes in the sample. The sample proportion is p̂ (called "p-hat"), and it is computed by taking the ratio of the number of successes in the sample to the sample size, that is:

Confidence Interval for the Population Proportion

If there are more than 5 successes and more than 5 failures, then the confidence interval can be computed with this formula:

The point estimate for the population proportion is the sample proportion, and the margin of error is the product of the Z value for the desired confidence level (e.g., Z=1.96 for 95% confidence) and the standard error of the point estimate. In other words, the standard error of the point estimate is:

This formula is appropriate for samples with at least 5 successes and at least 5 failures in the sample. This was a condition for the Central Limit Theorem for binomial outcomes. If there are fewer than 5 successes or failures then alternative procedures, called exact methods, must be used to estimate the population proportion.

Example: During the 7th examination of the Offspring cohort in the Framingham Heart Study there were 1219 participants being treated for hypertension and 2,313 who were not on treatment. If we call treatment a "success", then x=1219   and n=3532.   The sample proportion is:

This is the point estimate, i.e., our best estimate of the proportion of the population on treatment for hypertension is 34.5%. The sample is large, so the confidence interval can be computed using the formula:

Substituting our values we get

which is

So, the 95% confidence interval is (0.329, 0.361).

Thus we are 95% confident that the true proportion of persons on antihypertensive medication is between 32.9% and 36.1%.

 Key Concept: There are several types of estimates in a single population that are proportions for which one can compute confidence intervals using these methods. These include: prevalence cumulative incidence incidence rates

The table below, from the 5th examination of the Framingham Offspring cohort, shows the number of men and women found with or without cardiovascular disease (CVD). Estimate the prevalence of CVD in men using a 95% confidence interval.

 Free of CVD Prevalent CVD Total Men 1,548 244 1,792 Women 1,872 135 2,007 Total 3,420 379 3,799

Computing the 95% Confidence Interval for a Proportion in One Sample with R

In the Weymouth, MA health survey there were 333 adult respondents who reported a history of diabetes out of of 3573 respondents (333/3573=0.0932 or 9.32%).

We can use the Weymouth health survey data to get the counts of those with or without a history of diabetes using the table() function:

> table(hx_dm)

hx_dm

 0 1 3240 333

Then find the denominator (sum of those with or without diabetes).

> 3240+333

[1] 3573

To get the 95% confidence interval ,then use the prop.test()function.

> prop.test(333,3573,correct=FALSE)

1-sample proportions test without continuity correction

data: 333 out of 3573, null probability 0.5

X-squared = 2365.141, df = 1, p-value < 2.2e-16

alternative hypothesis: true p is not equal to 0.5

95 percent confidence interval:

0.0840988 0.1031730

sample estimates:

p

0.09319899

So, the point estimate (proportion with diabetes in the sample) was 9.3%, and with 95% confidence the true estimate lies between 0.084 and 0.103 or 8.4 to 10.3%.

Summary

This module focuses on computing the point estimates and 95% confidence limits for estimating a population means or a population proportions in a sample. Point estimates are the best single-valued estimates of an unknown population parameter. Because these can vary from sample to sample, most investigations start with a point estimate and build in a margin of error. The margin of error quantifies sampling variability and includes a value from the Z or t distribution reflecting the selected confidence level as well as the standard error of the point estimate. It is important to remember that the confidence interval contains a range of likely values for the unknown population parameter; a range of values for the population parameter consistent with the data. It is also possible, although the likelihood is small, that the confidence interval does not contain the true population parameter. This is important to remember in interpreting intervals. The precision of a confidence interval is defined by the margin of error (or the width of the interval). A larger margin of error (wider interval) is indicative of a less precise estimate.

Solutions to Selected Problems

Answer to first problems on page 3

What is the 90% confidence interval for BMI? (Note that Z=1.645 to reflect the 90% confidence level.)

So, the 90% confidence interval is (126.77, 127.83)

=======================================================

Answer to BMI Problem on page 3

Question: Using the subsample in the table above, what is the 90% confidence interval for BMI?

Solution: Once again, the sample size was 10, so we go to the t-table and use the row with 10 minus 1 degrees of freedom (so 9 degrees of freedom). But now you want a 90% confidence interval, so you would use the column with a two-tailed probability of 0.10. Looking down to the row for 9 degrees of freedom, you get a t-value of 1.833.

Once again you will use this equation:

Plugging in the values for this problem we get the following expression:

Therefore the 90% confidence interval ranges from 25.46 to 29.06.

=======================================================

Answer to Problem at Bottom of Page 4

The table below, from the 5th examination of the Framingham Offspring cohort, shows the number of men and women found with or without cardiovascular disease (CVD). Estimate the prevalence of CVD in men using a 95% confidence interval.

 Free of CVD Prevalent CVD Total Men 1,548 244 1,792 Women 1,872 135 2,007 Total 3,420 379 3,799

The prevalence of cardiovascular disease (CVD) among men is 244/1792=0.1362. The sample size is large and satisfies the requirement that the number of successes is greater than 5 and the number of failures is greater than 5. Therefore, the following formula can be used again.

Substituting, we get

So, the 95% confidence interval is (0.120, 0.152).

With 95% confidence the prevalence of cardiovascular disease in men is between 12.0 to 15.2%.

==================================================================================================

Answer to 95% Confidence Interval for Weight of Adult Respondents in Weymouth, MA

> t.test(weight)

One Sample t-test

data: weight

t = 204.6426, df = 3324, p-value < 2.2e-16

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:

167.5318 170.7731

sample estimates:

mean of x

169.1525

>

Interpretation:

The mean weight of adult household respondents in Weymouth was 169 pounds. With 95% confidence, the true mean is in the range of 167.5 to 170.8 pounds.

Derivation of the 95% Confidence Interval for a Mean

From the central limit theorem we know that

and the sampling mean has a standard deviation (also called the standard error):

It also states that

For the standard normal distribution there is a 95% probability that a standard normal variable, Z, will fall between -1.96 and 1.96. In other words,

We can substitute the equation for Z from the central limit theorem into this equation in order to derive an expression for computing the 95% confidence interval for the population mean, as follows:

Using algebra, we can rework this inequality such that the mean (μ) is the middle term, as shown below.

then

and finally

Therefore,