Introduction
Link to a Word file with the transcript of the video
Consider two examples in which samples are to be used to estimate some parameter in a population:
- Suppose I wish to estimate the mean weight of the freshman class entering Boston University in the fall, and I select the first five freshmen who agree to be weighed. Their mean weight is 153 pounds. Is this an accurate estimate of the mean value for the entire freshman class? Intuitively, you know that the estimate might be off by a considerable amount, because the sample size is very small and may not be representative of the mean for the entire class. In addition, if I were to repeat this process and take multiple samples of five students and compute the mean for each of these samples, I would likely find that the estimates varied from one another by quite a bit. This also implies that some of the estimates are very inaccurate, i.e. far from the true mean for the class.
- Suppose I have a box of marbles that are either blue or yellow, and I want you to estimate the proportion of blue marbles without looking into the box. I shake up the box and allow you to select 4 marbles and examine them to compute the proportion of blue marbles in your sample. Again, you know intuitively that the estimate might be very inaccurate, because the sample size is so small. If you were to repeat this process and take multiple samples of 4 marbles to estimate of the proportion of blue marbles, you would likely find that the estimates varied from one another by quite a bit, and many of the estimates would be very inaccurate.
The parameters being estimated differ in these two examples. The first is a measurement variable, i.e. body weight, which could have been any one of an infinite number of measurements on a continuous scale. In the second example the marbles are either blue or yellow (i.e., a discrete variable that can only have a limited number of values), and in each sample the proportion of blue marbles was determined in order to estimate the proportion of blue marbles in the entire box. Nevertheless, while these variables are of different types, they both illustrate the problem of random error when using a sample to estimate a parameter in a population.
The problem of random error also arises in epidemiologic investigations. The basic goals of epidemiologic studies are a) to measure a disease frequency or b) to compare measurements of disease frequency in two exposure groups in order to measure the extent to which there is an association with a health outcome. However, both of these estimates might be inaccurate because of random error.
Essential Questions
- How do we differentiate differences that are real versus just due to chance?
- How do we assess the uncertainty from samples?
Examples of Where This is Leading:
- Computing the probability of a continuous outcome :
- The BMIs in a population of 60 year-old men are normally distributed with μ = 29 and σ = 6. What is the probability of a man having BMI less than 30? (Note: μ denotes the mean of a variable in the population and σ is the standard deviation for that variable.)
- Determining whether groups differ with respect to a continuous measurement outcome:
- Do children born prematurely have lower IQs than average?
- Do patients randomized to a new drug have lower cholesterol levels than those receiving a placebo?
- A program to improve knowledge about HIV transmission was developed. Did the knowledge score of participants improve after going through the program?
Learning Objectives
After completing this section, you will be able to:
- Define the properties of a normal distribution
- Describe the relationship between the standard normal distribution and the Z-table
- Use the standard normal distribution to compute probabilities
- Define and distinguish between point estimate and confidence interval
- Define, calculate and interpret the 95% confidence interval for a mean and a proportion