Comparing More Than 2 Proportions

In many data sets, categories are often ordered so that you would expect to find a decreasing or increasing trend in the proportions with the group number. Let's look at a data set from a case-control study of esophageal cancer in Ile-et-Vilaine, France, available in R under the name "esoph".

Variable	Description
agegp	Age divided into the following categories: 25-34, 35-44, 45-54, 55-64, 65-74, 75+
alcgp	Alcohol consumption divided into the following categories: 0-39g/day, 40-79, 80-119, 120+
tobgp	Tobacco consumption divided into the following categories: 0-9g/day, 10-19, 20-29, 30+
ncases	Number of cases
ncontrols	Number of controls

These data do not contain the exact age of each individual in the study, but rather the age group. Similarly, there are groups for alcohol and tobacco use. One question of interest is whether there is any trend of occurrence of esophageal cancer as age increases, or the level of tobacco or alcohol use increases.

> table(agegp, ncases)

ncases

agegp 0 1 2 3 4 5 6 8 9 17

25-34 14 1 0 0 0 0 0 0 0 0

35-44 10 2 2 1 0 0 0 0 0 0

45-54 3 2 2 2 3 2 2 0 0 0

55-64 0 0 2 4 3 2 2 1 2 0

65-74 1 4 2 2 2 2 1 0 0 1

75+ 1 7 3 0 0 0 0 0 0 0…

To compare k ( > 2) proportions there is a test based on the normal approximation. It consists of the calculation of a weighted sum of squared deviations between the observed proportions in each group and the overall proportion for all groups. The test statistic has an approximate c² distribution with k −1 degrees of freedom.

To use prop.test on a table with multiple categories or groups, we need to convert it to a vector of "successes" and a vector of "trials", one for each group. In the esoph data, each age group has multiple levels of alcohol and tobacco doses, so we need to total the number of cases and controls for each group. First, what does the following plot show?

> boxplot(ncases/(ncases + ncontrols) ~ agegp)

To total the numbers of cases, and total numbers of observations for each age group, we use the tapply command:

> case.vector = tapply(ncases, agegp, sum)

> total.vector = tapply(ncontrols+ncases, agegp, sum)

> case.vector

25-34 35-44 45-54 55-64 65-74 75+

1 9 46 76 55 13

> total.vector

25-34 35-44 45-54 55-64 65-74 75+

117 208 259 318 216 57

After this it is easy to perform the test:

> prop.test(case.vector, total.vector)

6-sample test for equality of proportions without continuity correction

data: case.vector out of total.vector

X-squared = 68.3825, df = 5, p-value = 2.224e-13

alternative hypothesis: two.sided

sample estimates:

prop 1 prop 2 prop 3 prop 4 prop 5 prop 6

0.008547009 0.043269231 0.177606178 0.238993711 0.254629630 0.228070175

H₀: The proportion of cases is the same in each age group: p₁ = p₂ = p₃ = p₄ = p₅ = p₆
H_a: The proportion of cases is not the same in each age group: at least one p_i is different from the others

Conclusion: When testing the null hypothesis that the proportion of cases is the same for each age group we reject the null hypothesis (χ₅² = 68.38, p-value = 2.22e-13). The sample estimate of the proportions of cases in each age group is as follows:

Age group 25-34 35-44 45-54 55-64 65-74 75+

0.0085 0.043 0.178 0.239 0.255 0.228

You can test for a linear trend in the proportions using prop.trend.test. The null hypothesis is that there is no trend in the proportions; the alternative is that there is a linear increase/decrease in the proportion as you go up/down in categories. Note: you would only want to perform this test if your categorical variable was an ordinal variable. You would not do this, for, say, political party affiliation or eye color.

> prop.trend.test(case.vector, total.vector)

Chi-squared Test for Trend in Proportions

data: case.vector out of total.vector ,

using scores: 1 2 3 4 5 6

X-squared = 57.1029, df = 1, p-value = 4.136e-14

H₀: There is no linear trend in the proportion of cases across age groups
H_a: There is a linear trend in the proportion of cases across age groups

We reject the null hypothesis (χ₁² =57.10, df = 1, p-value = 4.14e-14) that there is no linear trend in the proportion of cases across age groups. The sample estimate of the proportions of cases in each age group is as follows:

Age group 25-34 35-44 45-54 55-64 65-74 75+

0.0085 0.043 0.178 0.239 0.255 0.228

There appears to be a linear increase in the proportion of cases as you increase the age group category.

return to top | previous page | next page