Comparing More Than 2 Proportions
In many data sets, categories are often ordered so that you would expect to find a decreasing or increasing trend in the proportions with the group number. Let's look at a data set from a case-control study of esophageal cancer in Ile-et-Vilaine, France, available in R under the name "esoph".
Variable |
Description |
---|---|
agegp |
Age divided into the following categories: 25-34, 35-44, 45-54, 55-64, 65-74, 75+ |
alcgp |
Alcohol consumption divided into the following categories: 0-39g/day, 40-79, 80-119, 120+ |
tobgp |
Tobacco consumption divided into the following categories: 0-9g/day, 10-19, 20-29, 30+ |
ncases |
Number of cases |
ncontrols |
Number of controls |
These data do not contain the exact age of each individual in the study, but rather the age group. Similarly, there are groups for alcohol and tobacco use. One question of interest is whether there is any trend of occurrence of esophageal cancer as age increases, or the level of tobacco or alcohol use increases.
> table(agegp, ncases)
ncases
agegp 0 1 2 3 4 5 6 8 9 17
25-34 14 1 0 0 0 0 0 0 0 0
35-44 10 2 2 1 0 0 0 0 0 0
45-54 3 2 2 2 3 2 2 0 0 0
55-64 0 0 2 4 3 2 2 1 2 0
65-74 1 4 2 2 2 2 1 0 0 1
75+ 1 7 3 0 0 0 0 0 0 0…
To compare k ( > 2) proportions there is a test based on the normal approximation. It consists of the calculation of a weighted sum of squared deviations between the observed proportions in each group and the overall proportion for all groups. The test statistic has an approximate c^{2} distribution with k −1 degrees of freedom.
To use prop.test on a table with multiple categories or groups, we need to convert it to a vector of "successes" and a vector of "trials", one for each group. In the esoph data, each age group has multiple levels of alcohol and tobacco doses, so we need to total the number of cases and controls for each group. First, what does the following plot show?
> boxplot(ncases/(ncases + ncontrols) ~ agegp)
To total the numbers of cases, and total numbers of observations for each age group, we use the tapply command:
> case.vector = tapply(ncases, agegp, sum)
> total.vector = tapply(ncontrols+ncases, agegp, sum)
> case.vector
25-34 35-44 45-54 55-64 65-74 75+
1 9 46 76 55 13
> total.vector
25-34 35-44 45-54 55-64 65-74 75+
117 208 259 318 216 57
After this it is easy to perform the test:
> prop.test(case.vector, total.vector)
6-sample test for equality of proportions without continuity correction
data: case.vector out of total.vector
X-squared = 68.3825, df = 5, p-value = 2.224e-13
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4 prop 5 prop 6
0.008547009 0.043269231 0.177606178 0.238993711 0.254629630 0.228070175
- H_{0}: The proportion of cases is the same in each age group: p_{1} = p_{2} = p_{3} = p_{4} = p_{5} = p_{6}
- H_{a}: The proportion of cases is not the same in each age group: at least one p_{i} is different from the others
Conclusion: When testing the null hypothesis that the proportion of cases is the same for each age group we reject the null hypothesis (χ_{5}^{2} = 68.38, p-value = 2.22e-13). The sample estimate of the proportions of cases in each age group is as follows:
Age group 25-34 35-44 45-54 55-64 65-74 75+
0.0085 0.043 0.178 0.239 0.255 0.228
You can test for a linear trend in the proportions using prop.trend.test. The null hypothesis is that there is no trend in the proportions; the alternative is that there is a linear increase/decrease in the proportion as you go up/down in categories. Note: you would only want to perform this test if your categorical variable was an ordinal variable. You would not do this, for, say, political party affiliation or eye color.
> prop.trend.test(case.vector, total.vector)
Chi-squared Test for Trend in Proportions
data: case.vector out of total.vector ,
using scores: 1 2 3 4 5 6
X-squared = 57.1029, df = 1, p-value = 4.136e-14
- H_{0}: There is no linear trend in the proportion of cases across age groups
- H_{a}: There is a linear trend in the proportion of cases across age groups
We reject the null hypothesis (χ_{1}^{2} =57.10, df = 1, p-value = 4.14e-14) that there is no linear trend in the proportion of cases across age groups. The sample estimate of the proportions of cases in each age group is as follows:
Age group 25-34 35-44 45-54 55-64 65-74 75+
0.0085 0.043 0.178 0.239 0.255 0.228
There appears to be a linear increase in the proportion of cases as you increase the age group category.