Defining & Finding Cases and Controls
Case Definitions
Careful thought should be given to the case definition to be used. If the definition is too broad or vague, it is easier to capture people with the outcome of interest, but a loose case definition will also capture people who do not have the outcome of interest. On the other hand, an overly restrictive case definition will exclude potential cases, and the sample size may be limited. Investigators frequently wrestle with this problem during outbreak investigations. Initially, they will often use a somewhat broad definition in order to identify potential cases. However, as an outbreak investigation progresses, there is a tendency to narrow the case definition to make it more precise and specific, for example by requiring confirmation of the diagnosis by laboratory testing. In general, investigators conducting case-control studies should thoughtfully construct a definition that is as clear and specific as possible without being overly restrictive.
For example, if one were to conduct a case-control study on the association between smoking and heart disease and simply defined the cases as someone who smokes and controls as someone who doesn't smoke raises a lot of questions. Does one or two cigarettes a year make one a smoker? Should someone who used to smoke regularly, but quit be classified as a smoker, a non-smoker, or neither?
The CDC suggests the following definitions regarding classification of tobacco smoking:
- Current smoker: An adult who has smoked 100 cigarettes in his or her lifetime and who currently smokes cigarettes. Beginning in 1991 this group was divided into everyday smokers or somedays smokers.
- Every day smoker: An adult who has smoked at least 100 cigarettes in his or her lifetime, and who now smokes every day. Previously called a regular smoker
- Somedays smoker: An adult who has smoked at least 100 cigarettes in his or her lifetime, who smokes now, but does not smoke every day. Previously called an occasional smoker
- Former smoker: An adult who has smoked at least 100 cigarettes in his or her lifetime but who had quit smoking at the time of interview.
- Never smoker: An adult who has never smoked, or who has smoked less than 100 cigarettes in his or her lifetime
Another classic example of the importance of a clear case definition is a case-control study trying to determine whether use of a particular drug by pregnant women increases the risk of birth defects in their offspring. Should the investigators define a case as a child with any congenital defect large or small? Different drugs and other exposures have different effects and may influence one organ system but not others. Using an all-encompassing case definition like any congenital defect might lead to an underestimate of an important association or even a failure to recognize the association at all.
Finding Cases
Typical sources for cases include:
- Patient rosters at medical facilities
- Death certificates
- Disease registries (e.g., cancer or birth defect registries; the SEER Program [Surveillance, Epidemiology and End Results] is a federally funded program that identifies newly diagnosed cases of cancer in population-based registries across the US )
- Cross-sectional surveys (e.g., NHANES, the National Health and Nutrition Examination Survey)
- Health insurer records (e.g., Blue Cross-Blue Shield, Kaiser-Permanente)
Selecting Controls
Selection of control subjects hinges on how the cases are selected. The purpose of the controls is to estimate the exposure distribution in the source population, i.e., to estimate the odds of exposure in the overall source population from which the cases came. It is important to remember that these controls are not the unexposed controls in a laboratory experiment. Some of the controls in a case-control study will have the exposure of interest, and what they provide is an estimate of how prevalent the exposure is in the overall source population.
Selection of an appropriate control group is one of the most difficult aspects of conducting a case-control study. There are two key principles that should be followed in selecting controls:
- The comparison group ("controls") should be representative of the source population that produced the cases. The method of selecting and enrolling control subjects should meet the would criterion, i.e., if the controls had experienced the outcome, would they have been identified as cases in this study? If the answer is yes, then the controls are likely to be representative of the source population. If no, there is likely to be selection bias.
- The "controls" must be sampled in a way that is independent of the exposure, meaning that their selection should not be more (or less) likely if they have the exposure of interest.
If either of these principles are not adhered to, selection bias can result. Selection bias will be discussed in detail in the module on bias.
Consider the hypothetical example in the figure below, which summarizes the exposure distribution in diseased and non-diseased people in a sources population and compares it to the exposure distributions in the samples of cases and controls that were selected for a study.
Suppose the investigators were dealing with a rare disease that was present in only 24 people in a source population with 3.6 million non-diseased people. Suppose also that the true exposure distribution in the 24 cases was 17:8, or 2.1 to 1, and the exposure distribution in the non-diseased people in the source population was 421,101:3,178,899, or 0.13 to 1. If so, the true odds ratio in the population would be 2.1/0.13 = 16.15.
Suppose further that the investigators could only identify 12 cases who were willing to participate in the study, and they selected three times as many control subjects. Among the 12 sampled cases the exposure distribution was 8:4, or 2 to 1, and among the 36 sampled controls, the exposure distribution was 5:31, or 0.16. If so, the estimated odds ratio from the samples would be 2.0/0.16 = 12.5. Despite the fact that this was a very small sample, the sampling methodology provided exposure distributions that were similar to those in the entire source population, and these provided an estimated odds ratio that was reasonably close to the true value in this population.
Test Yourself
Investigators conducted a case-control study to determine whether having an induced abortion increases the risk of a subsequent spontaneous abortion (JAMA 243:2495-2499, 1980). Cases were women who entered Boston City Hospital from 1976-1978 with a spontaneous abortion <20 weeks gestation. Controls were women who delivered live-born infants at Boston City Hospital during the same time period. Both groups were asked whether they had had a prior induced abortion.
"We identified patients entering Boston City Hospital from July 1976 until February 1978 with a spontaneous abortion at less than 20 weeks' gestation or premature delivery between 20 to 27 weeks' gestation (the case group). We used obstetric patients whose dates of delivery coincided with the cases' dates of spontaneous loss as a comparison group."
Do the controls in this study fulfill the would criterion?
Investigators conducted a case-control study to measure the association between condom use and acute gonorrhea in men. Cases were male patients with acute gonorrhea seen at a local health clinic from June 1 through June 30. Controls were male patients diagnosed with a sexually transmitted disease other than gonorrhea during the same time period. Are the controls in this study selected independently from exposure?
Sources of Controls
There are three main sources of control subjects:
- Population Controls
- Hospital/Clinic Controls
- Friends, Neighbors, and Family Controls
Population Controls
A population-based case-control study is one in which the cases come from a precisely defined population, such as a fixed geographic area, and the controls are sampled directly from the same population. In this situation cases might be identified from a state cancer registry, for example, and the comparison group would logically be selected at random from the same source population.
Population controls can be identified from voter registration lists, tax rolls, drivers license lists, and telephone directories or by "random digit dialing" (which has the advantage that it includes unlisted numbers). High response rates are important regardless of the method of invitation to participate, because non-response bias can be introduced if response rates are low and non-responders differ from responders. For example, non-responders of lower socioeconomic status might not respond if they are forced to work multiple low-paying jobs.
Hospital/Clinic Controls
If cases are obtained from a medical facility, the controls should be obtained from the same facility provided they meet two criteria:
- Control patients must have diseases that are unrelated to the exposure being studied. For example, for a study examining the association between smoking and lung cancer, it would not be appropriate to include patients with cardiovascular disease or emphysema as controls, since smoking is a risk factor for these conditions. Including patients who are more likely to have the exposures of interest than the source population will result in an underestimate of the true association.
- Control patients should have diseases with similar referral patterns as the cases, in order to minimize selection bias. For example, if the cases are women with cervical cancer who have been referred from all over the state, it would be inappropriate to use controls consisting of women with diabetes who had been referred primarily from local health centers in the immediate vicinity of the hospital. Similarly, it would be inappropriate to use patients from the emergency room, because the selection of a hospital for an emergency is different than for cancer, and this difference might be related to the exposure of interest.
The advantages of using controls who are patients from the same facility are:
- They are easier to identify
- They are more likely to participate than general population controls.
- They minimize selection bias because they generally come from the same source population (provided referral patterns are similar).
- Recall bias (remembering past exposures to a different degree than the cases) would be minimized, because they are sick, but with a different diagnosis.
Friend, Neighbor, Spouse, and Relative Controls
Occasionally investigators will ask cases to nominate controls who are in one of these three categories because they have similar characteristics, such as genotype, socioeconomic status, or environment, i.e., factors that can cause confounding but are hard to measure and adjust for. By matching cases and controls on these factors, confounding by these factors will be controlled.
Test Yourself
In 1948 two British investigators conducted a case-control study to examine the association between smoking and lung cancer. Cases were patients who were being treated for lung cancer in 20 London hospitals. Controls were patients in the same hospitals who were being treated for non-cancer medical problems such as heart disease, pneumonia, emphysema, bronchitis. [Doll R, Hill AB: Smoking and carcinoma of the lung: preliminary report. British Medical Journal 1950;2:739-48.]
Is this an appropriate control group?
How Many Controls Are Needed?
For rare outcomes the number of cases that can be unrolled may be limited, making it difficult to achieve a precise estimate of the odds ratio. Statistical power can be increased somewhat by enrolling more controls than cases. Investigators will sometimes enroll 2, 3, or even 4 times as many controls as cases to increase statistical power, but there is very little advantage in exceeding a 4:1 ratio of controls to cases. Selecting more than four controls for each case usually means a lot more work to collect the additional data without any meaningful increase in statistical power.