Analysis of Multi-way (r ´ c) Tables
Our previous analyses only allow us to compare two or more proportions with each other. However, we may be interested in seeing whether two factors are independent of one another, in which case we need to consider all levels of each factor, which leads to a table with r rows and c columns (where both r and c can be bigger than 2, depending on the number of levels). For example, in the esophageal cancer data, we may want to determine whether the effects of tobacco and alcohol intake are independent as relating to cancer outcome.
For the analysis of tables with more than two classes on both sides, you can use chisq.test or fisher.test, although the latter can be very computationally demanding if the cell counts are large and there are more than two rows or columns.
Design
An r ´ c table can arise from several different sampling plans, and the notion of "no relation between rows and columns" is correspondingly different. The total in each row might be fixed in advance, and you would be interested in testing whether the distribution over columns is the same for each row, or vice versa if the column totals were fixed. It might also be the case that only the total number is chosen and the individuals are grouped randomly according to the row and column criteria. In the latter case, you would be interested in testing the hypothesis of statistical independence, that the probability of an individual falling into the ij-th cell is the product pi ´ pj of the marginal probabilities. However, mathematically the analysis of the table turns out to be the same in all cases!
Example
For the esoph data, test whether the effects of tobacco and alcohol intake are independent in terms of cancer case status.
First, construct a two-way contingency table for the data using the tapply command:
> tob.alc.table<-tapply(ncases,list(tobgp,alcgp),sum)
## notice the grouping using "list"
> tob.alc.table
0-39g/day 40-79 80-119 120+
0-9g/day 9 34 19 16
10-19 10 17 19 12
20-29 5 15 6 7
30+ 5 9 7 10
> chisq.test(tob.alc.table) ## what can you conclude about independence?
In some cases, you may get a warning about the c2 approximation being incorrect, which is prompted by some cells having an expected count less than 5.
Perform an appropriate test to determine whether the effects of age and alcohol independently lead to the occurrence of cancer. |
To summarize, let's review the tests for categorical data that we have looked at so far, where they are used, and what form the input data should be in.
Tests for categorical data
|
Single Proportion |
Two Proportions |
> 2 Proportions |
Two-way tables |
Input |
Comments |
---|---|---|---|---|---|---|
prop.test |
yes |
yes |
yes |
no |
vectors of successes and trials |
accurate for large samples only |
fisher.test |
no |
yes |
no |
yes |
matrix or contingency table |
exact test, but time-consuming for large tables |
chisq.test |
no |
yes |
no |
yes |
matrix or contingency table |
expected cell frequencies should be > 5 for accuracy |
For more on this topic,SPH offers BS 821: Categorical Data Analysis, taught by Prof. David Gagnon, or BS 852: Statistical Methods for Epidemiology, taught by Profs. Paola Sebastiani or Tim Heeren.
Chi-Square Test, Fishers Exact Test, and Cross Tabulations in R (R Tutorial 4.7) MarinStatsLectures [Contents]
Relative Risk, Odds Ratio and Risk Difference (aka Attributable Risk) in R (R Tutorial 4.8) MarinStatsLectures [Contents]
Reading:
- BS 704 R Notes 2.1, 2.3 and 2.5
Assignment:
- Homework 6 assigned
- Final Project due in 2 weeks