Summarizing Data in a Cohort Study
Investigators often use contingency tables to summarize data. In essence, the table is a matrix that displays the combinations of exposure and outcome status. If one were summarizing the results of a study with two possible exposure categories and two possible outcomes, one would use a "two by two" table in which the numbers in the four cells indicate the number of subjects within each of the 4 possible categories of risk and disease status.
For example, consider data from a retrospective cohort study conducted by the Massachusetts Department of Public Health (MDPH) during an investigation of an outbreak of Giardia lamblia in Milton, MA in 2003. The descriptive epidemiology indicated that almost all of the cases belonged to a country club in Milton. The club had an adult swimming pool and a wading pool for toddlers, and the investigators suspected that the outbreak may have occurred when an infected child with a dirty diaper contaminated the water in the kiddy pool. This hypothesis was tested by conducting a retrospective cohort study. The cases of Giardia lamblia had already occurred and had been reported to MDPH via the infectious disease surveillance system (for more information on surveillance, see the Surveillance module). The investigation focused on an obvious cohort - 479 members of the country club who agreed to answer the MDPH questionnaire. The questionnaire asked, among many other things, whether the subject had been exposed to the kiddy pool. The incidence of subsequent Giardia infection was then compared between subjects who been exposed to the kiddy pool and those who had not.
The table below summarizes the findings. A total of 479 subjects completed the questionnaire, and 124 of them indicated that they had been exposed to the kiddy pool. Of these, 16 subsequently developed Giardia infection, but 108 did not. Among the 355 subjects who denied kiddy pool exposure, 14 developed Giardia infection, and the other 341 did not.
Swam in Kiddy Pool? |
Giardia |
No Giardia |
Total |
Cumulative Incidence |
Yes |
16 |
108 |
124 |
16/124 = 12.9% |
No |
14 |
341 |
365 |
14/365 = 3,9% |
Organization of the data this way makes it easier to compute the cumulative incidence in each group (12.9% and 3.9% respectively). The incidence in each group provides an estimate of risk, and the groups can be compared in order to estimate the magnitude of association. (This will be addressed in much greater detail in the module on Measures of Association.) One way of quantifying the association is to calculate the relative risk, i.e., dividing the incidence in the exposed group by the incidence in the unexposed group). In this case, the risk ratio is (12.9% / 3.9%) = 3.3. This suggest that subjects who swam in the kiddy pool had 3.3 times the risk of getting Giardia infections compared to those who did not, suggesting that the kiddy pool was the source.
Unanswered Questions
If the kiddy pool was the source of contamination responsible for this outbreak, why was it that:
- Only 16 people exposed to the kiddy pool developed the infection?
- There were 14 Giardia cases among people who denied exposure to the kiddy pool?
Before you look at the answer, think about it and try to come up with a possible explanation.
Optional Links of Potential Interest
Link to the 2003 Giardia outbreak
Link to CDC page on Organizing Data
Possible Pitfall: Contingency tables can be oriented in several ways, and this can cause confusion when calculating measures of association.
There is no standard rule about how to set up contingency tables, and you will see them set up in different ways.
- With exposure status in rows and outcome status in columns
- With exposure status in columns and outcome status in rows
- With exposed group first followed by non-exposed group
- With non-exposed group first followed by exposed group
If you aren't careful, these different orientations can result in errors in calculating measures of association. One way to avoid confusion is to always set up your contingency tables in the same way. For example, in these learning modules the contingency tables almost always indicate outcome status in columns listing subjects who have the outcome of interest to the left of subjects who do not have the outcome, and exposure status of the exposed (or most exposed) group is listed in a row above those who are unexposed (or have less exposure).
The table below illustrates this arrangement.
|
Those With the Outcome |
Those Without the Outcome |
Total |
Exposed (or most exposed) |
|
|
|
Non-exposed (or least exposed) |
|
|
|