2.5 Chi-square tests for categorical outcomes


2.5.1 The chi-square goodness-of-fit test for one sample

The following gives the syntax needed to calculate a chi-square goodness-of-fit test from a set of tabled frequencies. As an example, 45 subjects are asked which of 3 screening tests they prefer; 10 subjects prefer Test A, 15 prefer test B, and 20 prefer Test C. We wish to test the null hypothesis that the three screening tests are equally preferred, or equivalently, that 1/3 of subjects prefer each test. The data:

Preference

Observed Frequency

Expected Proportion

Under the Null

Test A

10

0.333

Test B

15

0.333

Test C

20

0.333

To analyze these data in R, first create an object (arbitrarily named 'obsfreq' in the example) that contains the observed frequencies. Second, we create an object that contains the expected probabilities under the null (arbitrarily named 'nullprobs'; the third probability was rounded to .334 because the probabilities must sum to 1.00; perhaps a better solution would have been to give the probabilities as 1/3,1/3,1/3, which would also work). Third, we compare the observed frequencies to the expected probabilities through the chisq.test( ) function:

> obsfreq <- c(10,15,20)

> nullprobs <- c(.333,.333,.334)

> chisq.test(obsfreq,p=nullprobs)

Chi-squared test for given probabilities

data: x

X-squared = 3.3018, df = 2, p-value = 0.1919

R gives a two-tailed p-value.

2.5.2 Contingency table analysis and the chi-square test of independence

2.5.2.1 The chi-square test of independence from per-subject data

From the Age at Walking example, suppose we want to compare the percent of males (coded sexmale=1) between the two groups in our age first walking example. We can first use the 'table( )' function to get the observed counts for the underlying frequency table:

> table(group,sexmale)

sexmale

group 0 1

1 17 16

2 9 8

In group 1, there are 16 males and 17 females, so 48.5% (16/33) of group 1 is male.

In group 2, 47.1% (8/17) are male. The 'prop.table( )' function will calculate these proportions in R:

> prop.table(table(group,sexmale),1)

sexmale

group 0 1

1 0.5151515 0.4848485

2 0.5294118 0.4705882

The 'prop.table( )' command calculates proportions from the indicated table; in this example we want to calculate proportions within groups, and the '1' in the 'prop.table( )' example above indicates that we want proportions calculated within groups for the first variable in the table (within group, so we're calculating the percent of males and females within group 1, and the percent of males and females within group 2). Had we indicated '2' in the above example, R would have calculated proportions within sex, giving the proportions in groups 1 and 2 for males, and the proportions within groups 1 and 2 for females.

Specifying the orientation for the prop.table( ) command can be confusing, and it may be easier (or safer) to just calculate proportions directly for the table of counts. R can be used as a calculator to find these proportions directly:

> 16/(16+17)

[1] 0.4848485

> 8/(8+9)

[1] 0.4705882

 

The chisq.test() function applied to a table object compares these two percentages through the chi-square test of independence:

> chisq.test(table(group,sexmale),correct=FALSE)

Pearson's Chi-squared test

data: table(group, sexmale)

X-squared = 0.0091, df = 1, p-value = 0.9238

The 'correct=FALSE' option in the chisq.test function turns off Yates' correction for the chi-square test (which is used with small sample sizes), and gives the standard chi-square test statistic. R gives a two-tailed p-value. Note that the title for the output, 'Pearson's Chi-squared test' indicates that these results are for the uncorrected (not Yates' adjusted) chi-square test.

2.5.2.2 The chi-square test of independence from tabled data

R can also perform a chi-square test on frequencies from a contingency table. For example, suppose we want to compare percent of subjects testing positive on a marker for an exposure across three groups:

 

Group 1

Group 2

Group 3

Test Positive

20 (40%)

5 (33.3%

40 (50%0

Test Negative

30

10

40

 

First, we create an object ('obsfreq' in the example) containing the observed frequencies from the observed table. I printed the object as a check that it was created correctly:

> obsfreq <- matrix(c(20,30, 5,10, 40,40),nrow=2,ncol=3)

> obsfreq

[,1] [,2] [,3]

[1,] 20 5 40

[2,] 30 10 40

The 'chisq.test( )' function will then calculate the chi-square statistic for the test of independence for this table:

> chisq.test(obsfreq)

Pearson's Chi-squared test

data: obsfreq

X-squared = 2.1378, df = 2, p-value = 0.3434

2.5.2.3 Fisher's exact test for small cell sizes

The usual chi-square test is appropriate for large sample sizes. For 2x2 tables with small samples (an expected frequency less than 5), the usual chi-square test exaggerates significance, and Fisher's exact test is generally considered to be a more appropriate procedure. The fisher.test() function performs Fisher's exact test in R:

> fisher.test(group,sexmale)

Fisher's Exact Test for Count Data

data: group and sexmale

p-value = 1

alternative hypothesis: true odds ratio is not equal to 1

95 percent confidence interval:

0.2480199 3.5592990

sample estimates:

odds ratio

0.9455544

R gives the two-tailed p-value, as indicated by the wording of the alternative hypothesis. The odds ratio and a 95% confidence interval for the odds ratio are also given. Since Fisher's test is usually used for small sample situations, the CI for the odds ratio includes a correction for small sample sizes.

2.5.2.4 Relative Risk and Confidence interval for the RR

Epidemiologic analyses are available through 'epitools', an add-on package to R. To use the epitools functions, you must first do a one-time installation. In R, click on the 'Packages' menu, then 'Install Package(s)', then select a download site (from the US), then select the epitools package. This will install the add-on package onto your computer. To use the package, you must also load it into R: click on the 'Packages' menu, then 'Load Package', then select epitools. While you only need to install the package once onto your computer, you will need to load the package into R each time you want to use it.

The data layout matters for calculating RRs. For the riskratio( ) function from epitools, data should be set up in the following format:

 

No Disease

Disease

Control

 

 

Exposed

 

 

 

riskratio( ) calculates the RR of disease for those in the exposed group relative to the control group.

For the Age at Walking example, I categorized age at walking as early walking (under 12 months, coded 0) and late walking (12 months or older, coded 1). To find the relative risk for late walking, for kids in Group 2 vs. Group 1, I first printed the 2x2 table as a check, then used the riskratio() function to calculate the relative risk and large sample 95% confidence interval.

> table(group,LateWalker)

LateWalker

group FALSE TRUE

1 28 5

2 8 9

> riskratio.wald(group,LateWalker)

$data

Outcome

Predictor FALSE TRUE Total

1 28 5 33

2 8 9 17

Total 36 14 50

$measure

risk ratio with 95% C.I.

Predictor estimate lower upper

1 1.000000 NA NA

2 3.494118 1.387688 8.797984

$p.value

two-sided

Predictor midp.exact fisher.exact chi.square

1 NA NA NA

2 0.008000253 0.007949207 0.004814519

$correction

[1] FALSE

attr(,"method")

[1] "Unconditional MLE & normal approximation (Wald) CI"

Warning message:

In chisq.test(xx, correct = correction) :

Chi-squared approximation may be incorrect

 

The RR here is 3.49 ( (9/17) / (5/33) ) , with a 95% CI of (1.39 , 8.80). There are several versions of a CI for a relative risk, and using 'riskratio.wald( )' requests the standard normal approximation formula; 'riskratio.small( )' uses a correction to the CI for small samples. R will choose the appropriate version of the CI if 'riskratio( )' is specified.

2.5.2.5 Odds ratios and 95% CI for the OR

The epitools add-on package also has a function to calculate odds ratios and confidence intervals for odds ratios. You must first load the epitools package into R (see Section 16d). Orientation of the table matters when calculating the OR, and the orientation described above for the relative risk also applies for the odds ratio. Calculating the odds ratio ( (9/8) / (5/28) = 6.3 ) and 95% CI for late walkers, for Group 2 vs. Group 1 in the Age at Walking example:

> oddsratio.wald(group,LateWalker)

$data

Outcome

Predictor FALSE TRUE Total

1 28 5 33

2 8 9 17

Total 36 14 50

$measure

odds ratio with 95% C.I.

Predictor estimate lower upper

1 1.0 NA NA

2 6.3 1.639283 24.2118

$p.value

two-sided

Predictor midp.exact fisher.exact chi.square

1 NA NA NA

2 0.008000253 0.007949207 0.004814519

$correction

[1] FALSE

attr(,"method")

[1] "Unconditional MLE & normal approximation (Wald) CI"

Warning message:

In chisq.test(xx, correct = correction) :

Chi-squared approximation may be incorrect

 

The 'oddsratio.wald" option gives the usual estimate for the odds ratio, with OR=6.3 and 95% CI of 1.64 , 24.21. 'oddsratio.small( )' uses a correction for small sample size in calculating the CI.