Analysis of Multi-way (r ´ c) Tables

Our previous analyses only allow us to compare two or more proportions with each other. However, we may be interested in seeing whether two factors are independent of one another, in which case we need to consider all levels of each factor, which leads to a table with r rows and c columns (where both r and c can be bigger than 2, depending on the number of levels). For example, in the esophageal cancer data, we may want to determine whether the effects of tobacco and alcohol intake are independent as relating to cancer outcome.

For the analysis of tables with more than two classes on both sides, you can use chisq.test or fisher.test, although the latter can be very computationally demanding if the cell counts are large and there are more than two rows or columns.

Design

An r ´ c table can arise from several different sampling plans, and the notion of "no relation between rows and columns" is correspondingly different. The total in each row might be fixed in advance, and you would be interested in testing whether the distribution over columns is the same for each row, or vice versa if the column totals were fixed. It might also be the case that only the total number is chosen and the individuals are grouped randomly according to the row and column criteria. In the latter case, you would be interested in testing the hypothesis of statistical independence, that the probability of an individual falling into the ij-th cell is the product p_i ´ p_j of the marginal probabilities. However, mathematically the analysis of the table turns out to be the same in all cases!

Example

For the esoph data, test whether the effects of tobacco and alcohol intake are independent in terms of cancer case status.

First, construct a two-way contingency table for the data using the tapply command:

> tob.alc.table<-tapply(ncases,list(tobgp,alcgp),sum)

## notice the grouping using "list"

> tob.alc.table

0-39g/day 40-79 80-119 120+

0-9g/day 9 34 19 16

10-19 10 17 19 12

20-29 5 15 6 7

30+ 5 9 7 10

> chisq.test(tob.alc.table) ## what can you conclude about independence?

In some cases, you may get a warning about the c² approximation being incorrect, which is prompted by some cells having an expected count less than 5.

Perform an appropriate test to determine whether the effects of age and alcohol independently lead to the occurrence of cancer.

To summarize, let's review the tests for categorical data that we have looked at so far, where they are used, and what form the input data should be in.

Tests for categorical data

	Single Proportion	Two Proportions	> 2 Proportions	Two-way tables	Input	Comments
prop.test	yes	yes	yes	no	vectors of successes and trials	accurate for large samples only
fisher.test	no	yes	no	yes	matrix or contingency table	exact test, but time-consuming for large tables
chisq.test	no	yes	no	yes	matrix or contingency table	expected cell frequencies should be > 5 for accuracy

For more on this topic,SPH offers BS 821: Categorical Data Analysis, taught by Prof. David Gagnon, or BS 852: Statistical Methods for Epidemiology, taught by Profs. Paola Sebastiani or Tim Heeren.

Chi-Square Test, Fishers Exact Test, and Cross Tabulations in R (R Tutorial 4.7) MarinStatsLectures [Contents text annotation indicator ]

Relative Risk, Odds Ratio and Risk Difference (aka Attributable Risk) in R (R Tutorial 4.8) MarinStatsLectures [Contents text annotation indicator ]

Reading:

BS 704 R Notes 2.1, 2.3 and 2.5

Assignment:

Homework 6 assigned
Final Project due in 2 weeks

return to top | previous page