# Random Error

Consider two examples in which samples are to be used to estimate some parameter in a population:

- Suppose I wish to estimate the mean weight of the freshman class entering Boston University in the fall, and I select the first five freshmen who agree to be weighed. Their mean weight is 153 pounds. Is this an accurate estimate of the mean value for the entire freshman class? Intuitively, you know that the estimate might be off by a considerable amount, because the sample size is very small and may not be representative of the mean for the entire class. In addition, if I were to repeat this process and take multiple samples of five students and compute the mean for each of these samples, I would likely find that the estimates varied from one another by quite a bit. This also implies that some of the estimates are very inaccurate, i.e. far from the true mean for the class.
- Suppose I have a box of colored marbles and I want you to estimate the proportion of blue marbles without looking into the box. I shake up the box and allow you to select 4 marbles and examine them to compute the proportion of blue marbles in your sample. Again, you know intuitively that the estimate might be very inaccurate, because the sample size is so small. If you were to repeat this process and take multiple samples of 4 marbles to estimate of the proportion of blue marbles, you would likely find that the estimates varied from one another by quite a bit, and many of the estimates would be very inaccurate.

The parameters being estimated differed in these two examples. The first was a measurement variable, i.e. body weight, which could have been any one of an infinite number of measurements on a continuous scale. In the second example the marbles were either blue or some other color (i.e., a discrete variable that can only have a limited number of values), and in each sample it was the frequency of blue marbles that was computed in order to estimate the proportion of blue marbles. Nevertheless, while these variables are of different types, they both illustrate the problem of random error when using a sample to estimate a parameter in a population.

The problem of random error also arises in epidemiologic investigations. We noted that basic goals of epidemiologic studies are a) to measure a disease frequency or b) to compare measurements of disease frequency in two exposure groups in order to measure the extent to which there is an association. However, both of these estimates might be inaccurate because of random error. Here are two examples that illustrate this.

- For the most part, bird flu has been confined to birds, but it is well-documented that humans who work closely with birds can contract the disease. Suppose we wish to estimate the probability of dying among humans who develop bird flu. In this case we are not interested in comparing groups in order to measure an association. We just want to have an accurate estimate of how frequently death occurs among humans with bird flu. It isn't known how many humans have gotten bird flu, but suppose an investigator in Hong Kong identified eight cases and confirmed that they had bird flu by laboratory testing. Four of the eight victims died of their illness, meaning that the incidence of death (the case-fatality rate) was 4/8 = 50%. Does this mean that 50% of all humans infected with bird flu will die? How precise is this estimate?
- Suppose investigators wish to estimate the association between frequent tanning and risk of skin cancer. A cohort study is conducted and follows 150 subjects who tan frequently throughout the year and 124 subject who report that they limit their exposure to sun and use sun block with SPF 15 or greater regularly. At the end of ten years of follow up the risk ratio is 2.5, suggesting that those who tan frequently have 2.5 times the risk. How precise is this estimate? Does it accurately reflect the association in the population at large?

Certainly there are a number of factors that might detract from the accuracy of these estimates. There might be systematic error, such as biases or confounding, that could make the estimates inaccurate. However, even if we were to minimize systematic errors, it is possible that the estimates might be inaccurate just based on who happened to end up in our sample. This source of error is referred to as random error or sampling error.

In the bird flu example, we were interested in estimating a proportion in a single group, i.e. the proportion of deaths occurring in humans infected with bird flu. In the tanning study the incidence of skin cancer was measured in two groups, and these were expressed as a ratio in order to estimate the magnitude of association between frequent tanning and skin cancer. When the estimate of interest is a single value (e.g., a proportion in the first example and a risk ratio in the second) it is referred to as a *point estimate*. For both of these point estimates one can use a confidence interval to indicate its precision.