Comparing More Than 2 Proportions


In many data sets, categories are often ordered so that you would expect to find a decreasing or increasing trend in the proportions with the group number. Let's look at a data set from a case-control study of esophageal cancer in Ile-et-Vilaine, France, available in R under the name "esoph".

Variable

Description

agegp

Age divided into the following categories: 25-34, 35-44, 45-54, 55-64, 65-74, 75+

alcgp

Alcohol consumption divided into the following categories: 0-39g/day, 40-79, 80-119, 120+

tobgp

Tobacco consumption divided into the following categories: 0-9g/day, 10-19, 20-29, 30+

ncases

Number of cases

ncontrols

Number of controls

 These data do not contain the exact age of each individual in the study, but rather the age group. Similarly, there are groups for alcohol and tobacco use. One question of interest is whether there is any trend of occurrence of esophageal cancer as age increases, or the level of tobacco or alcohol use increases.

> table(agegp, ncases)

       ncases

agegp    0  1  2  3  4  5  6  8  9 17

  25-34 14  1  0  0  0  0  0  0  0  0

  35-44 10  2  2  1  0  0  0  0  0  0

  45-54  3  2  2  2  3  2  2  0  0  0

  55-64  0  0  2  4  3  2  2  1  2  0

  65-74  1  4  2  2  2  2  1  0  0  1

  75+    1  7  3  0  0  0  0  0  0  0…

To compare k ( > 2) proportions there is a test based on the normal approximation. It consists of the calculation of a weighted sum of squared deviations between the observed proportions in each group and the overall proportion for all groups. The test statistic has an approximate c2 distribution with k −1 degrees of freedom.

 

To use prop.test on a table with multiple categories or groups, we need to convert it to a vector of "successes" and a vector of "trials", one for each group. In the esoph data, each age group has multiple levels of alcohol and tobacco doses, so we need to total the number of cases and controls for each group. First, what does the following plot show?

>  boxplot(ncases/(ncases + ncontrols) ~ agegp)

 To total the numbers of cases, and total numbers of observations for each age group, we use the tapply command:

> case.vector = tapply(ncases, agegp, sum)

> total.vector = tapply(ncontrols+ncases, agegp, sum)

> case.vector

 

25-34 35-44 45-54 55-64 65-74   75+

    1     9    46    76    55    13

> total.vector

25-34 35-44 45-54 55-64 65-74   75+

  117   208   259   318   216    57

 

After this it is easy to perform the test:

> prop.test(case.vector, total.vector)

 

        6-sample test for equality of proportions without continuity correction

 

data:  case.vector out of total.vector

X-squared = 68.3825, df = 5, p-value = 2.224e-13

alternative hypothesis: two.sided

sample estimates:

     prop 1      prop 2      prop 3      prop 4      prop 5      prop 6

0.008547009 0.043269231 0.177606178 0.238993711 0.254629630 0.228070175

 

Conclusion: When testing the null hypothesis that the proportion of cases is the same for each age group we reject the null hypothesis (χ52 = 68.38, p-value = 2.22e-13).  The sample estimate of the proportions of cases in each age group is as follows:

Age group        25-34    35-44    45-54    55-64    65-74    75+

                        0.0085  0.043    0.178    0.239    0.255    0.228

You can test for a linear trend in the proportions using prop.trend.test. The null hypothesis is that there is no trend in the proportions; the alternative is that there is a linear increase/decrease in the proportion as you go up/down in categories. Note: you would only want to perform this test if your categorical variable was an ordinal variable. You would not do this, for, say, political party affiliation or eye color.

> prop.trend.test(case.vector, total.vector)

        Chi-squared Test for Trend in Proportions

data:  case.vector out of total.vector ,

 using scores: 1 2 3 4 5 6

X-squared = 57.1029, df = 1, p-value = 4.136e-14

We reject the null hypothesis (χ12 =57.10, df = 1, p-value = 4.14e-14) that there is no linear trend in the proportion of cases across age groups.  The sample estimate of the proportions of cases in each age group is as follows:

 

Age group        25-34    35-44    45-54    55-64    65-74    75+

                0.0085  0.043    0.178    0.239    0.255    0.228

 

There appears to be a linear increase in the proportion of cases as you increase the age group category.