# Selection Bias

Selection bias can result when the selection of subjects into a study or their likelihood of being retained in the study leads to a result that is different from what you would have gotten if you had enrolled the entire target population. If one enrolled the entire population and collected accurate data on exposure and outcome, then one could compute the true measure of association. We generally don't enroll the entire population; instead we take samples. However, if one sampled the population in a fair way, such the sampling from all four cells was fair and representative of the distribution of exposure and outcome in the overall population, then one can obtain an accurate estimate of the true association (assuming a large enough sample, so that random error is minimal and assuming there are no other biases or confounding). Conceptually, this might be visualized by equal sized ladles (sampling) for each of the four cells.

The contingency table has columns (diseased and non-diseased) and rows (exposed and non-exposed. In this illustration the 4 exposure / disease categories have equal-sized ladles in them to convey the idea of unbiased sampling.

Fair Sampling

Diseased

Non-diseased

Exposed

Non-exposed

However, if sampling is not representative of the exposure-outcome distributions in the overall population, then the measures of association will be biased, and this is referred to as selection bias. Consequently, selection bias can result when the selection of subjects into a study or their likelihood of being retained in a cohort study leads to a result that is different from what you would have gotten if you had enrolled the entire target population. One example of this might be represented by the table below, in which the enrollment procedures resulted in disproportionately large sampling of diseased subject who had the exposure.

This contingency table has a larger ladle in the cell tablulating the number of exposed subjects with disease. This is to indicate that there was a tendency to over-sample this category, for example, a case-control study in which cases were more likely to be selected if they had been exposed.

Selection Bias

Diseased

Non-diseased

Exposed

Non-exposed

There are several mechanisms that can produce this unwanted effect:

1. Selection of a comparison group ("controls") that is not representative of the population that produced the cases in a case-control study. (Control selection bias)
2. Differential loss to follow up in a cohort study, such that the likelihood of being lost to follow up is related to outcome status and exposure status. (Loss to follow-up bias)
3. Refusal, non-response, or agreement to participate that is related to the exposure and disease (Self-selection bias)
4. Using the general population as a comparison group for an occupational cohort study ("Healthy worker effect")
5. Differential referral or diagnosis of subjects

# Selection Bias in Case-Control Studies

## 1. Control Selection Bias

In a case-control study selection bias occurs when subjects for the "control" group are not truly representative of the population that produced the cases. Remember that in a case-control study the controls are used to estimate the exposure distribution (i.e., the proportion having the exposure) in the population from which the cases arose. The exposure distribution in cases is then compared to the exposure distribution in the controls in order to compute the odds ratio as a measure of association.

In the module on Overview of Analytic Studies and in the module on Measures of Association we considered a rare disease in a source population that looked like this:

Diseased

Non-diseased

Total

Exposed

7

1,000

1,007

Non-exposed

6

5,634

5,640

Given the entire population, we could compute the risk ratio = 6.53. However, one often conducts a case-control study when the outcome is rare like this, because it is much more efficient. Consequently, in order to estimate the risk ratio we could use the relative distribution of exposure in a sample of the population, provided that these controls are selected by procedures such that the sample provides an accurate estimate of the exposure distribution in the overall population.

If a control sample was selected appropriately, i.e. such that is was representative of exposure status in the population, then the case-control results might look like the table below.

Cases

Controls

Exposed

7

10

Non-exposed

6

56

Note that the sample of controls represents only 1% of the overall population, but the exposure distribution in the controls (10/56) is representative of the exposure status in the overall population (1,000/5,634). As a result, the odds ratio = 6.53 gives an unbiased estimate ratio of the risk ratio.

In contrast, suppose that in the same hypothetical study controls were somewhat more likely to be chosen if they had the exposure being studied. The data might look something like this:

Cases

Controls

Exposed

7

16

Non-exposed

6

50

Here we have the same number of controls, but the investigators used selection procedures that were somewhat more likely to select controls who had the exposure. As a result, the estimate of effect, the odds ratio, was biased (OR = 3.65).

Conceptually, the bias here might be represented by the table below in which the large ladle indicates that non-diseased subjects with the exposure were over sampled.

In this table the greater tendency to enroll non-diseased controls who had been exposed is represented by a larger ladle in that cell.

Selection Bias

Diseased

Controls

Exposed

Non-exposed

Depending on which category is over or under-sampled, this type of bias can result in either an underestimate or an overestimate of the true association.

Example:

A hypothetical case-control study was conducted to determine whether lower socioeconomic status (the exposure) is associated with a higher risk of cervical cancer (the outcome). The "cases" consisted of 250 women with cervical cancer who were referred to Massachusetts General Hospital for treatment for cervical cancer. They were referred from all over the state. The cases were asked a series of questions relating to socioeconomic status (household income, employment, education, etc.). The investigators identified control subjects by going from door–to-door in the community around MGH from 9:00 AM to 5:00  PM. Many residents are not home, but they persist and eventually enroll enough controls. The problem is that the controls were selected by a different mechanism than the cases (immediate neighborhood for controls compared to statewide for cases), AND the door-to-door recruitment mechanism may have tended to select individuals of different socioeconomic status, since women who were at home may have been somewhat more likely to be unemployed. In other words, the controls were more likely to be enrolled (selected) if they had the exposure of interest (lower socioeconomic status).

### The "Would" Criterion

Epidemiologists sometimes use the "would" criterion" to test for the possibility of selection bias; they ask "If a control had had the disease, would they have been likely to be enrolled as a case?" If the answer is 'yes', then selection bias is unlikel

## 2. Self-Selection Bias

Selection bias can be introduced into case-control studies with low response or participation rates if the likelihood of responding or participating is related to both the exposure and the outcome.

Table 10-4 in the Aschengrau and Seage text shows a scenario with differential participation rates in which diseased subjects who had the exposure had a participation rate of 80%, which the other three categories had participation rates of 60%. This might be depicted as follows:

In this contingency table greater participation by subjects who had the exposure and the outcome of interest is represented by the larger ladle in that cell.

Selection Bias

Diseased

Non-diseased

Exposed

Non-exposed

Question: Can self-selection bias occur in prospective cohort studies? Reflect on your answer before you look at the answer below.

## 3. Differential Surveillance, Referral, or Diagnosis of Subjects

Aschengrau and Seage give an example in which investigators conducted a case-control study to determine whether use of oral contraceptives increased the risk of thromboembolism. The case group consisted of women who had been admitted to the hospital for venous thromboembolism. The controls were women of similar age who had been hospitalized for unrelated problems at the same hospitals. The interviews indicated that 70% of the cases used oral contraceptives, but only 20% of the controls used them. The odds ratio was 10.2, but in retrospect, this was an overestimate. There had been reports suggesting such an association. As a result, health care providers were vigilant of their patients on oral contraceptives and were more likely to admit them to the hospital if they developed venous thrombosis or any signs or symptoms suspicious of thromboembolism. As a result the study had a tendency to over sample women who had both the exposure and the outcome of interest.

Over-sampling of women with the exposure and the outcome is represented by a larger ladle for that category.

Selection Bias

Diseased

Non-diseased

Exposed

Non-exposed

Aschengrau and Seage suggest that this selection bias could have been minimized by more restrictive case selection criteria, such that only women who clearly required hospitalization would be enrolled in the case group.