Chi-Square Test of Independence

chisq.test(varname1, varname2, correct = FALSE)

chisq.test(table(varname1, varname2), correct = FALSE)

Performs chi-square test on a contingency table. Use the correct=FALSE option with reasonably large sample sizes, ie., if expected counts in any of the cells in the contingency table have more than 5 observations. Use the correct = TRUE option, if expected counts in any cell in the contingency table are less than 5. When the correct=TRUE option is used, R performs the chi-squared test with Yates' continuity correction. If neither option is specified, R will perform the Yates correction automatically.

Chi-squared Test from a Raw Data Set in R

If you are working with a raw data set, the simplest way to do a chi-squared test is to use the first equation above. If I am working with the Weymouth Heatlh Survey data set, and I want to conduct a chi-squared test to determine whether there was an association between having a history of diabetes (hx_mi) and having had a myocardial infarction, i.e., a heart attack (hx_mi), I can use the following line of code:

chisq.test(hx_dm,hx_mi, correct=FALSE)

Pearson's Chi-squared test

data: hx_dm and hx_mi
X-squared = 132.67, df = 1, p-value < 2.2e-16

The output provides the chi-squared statistic, the degrees of freedom (df=1) and the p-value (2.2x10^-16). When this line of code is run, R creates the contingency table in the background and then does the chi-squared test.

I can also do the chi-squared test from a contingency table that has already been created using the table() command.

table(hx_dm,hx_mi)
***hx_mi
hx_dm 0 1
0 *2832 142
1 **233 *65

chisq.test(table(hx_dm,hx_mi), correct=FALSE)

Pearson's Chi-squared test

data: table(hx_dm, hx_mi)
X-squared = 132.67, df = 1, p-value < 2.2e-16

Notice that this method gives the same results as the first method.

Creating a Contingency Table in R When You Don't Have the Raw Data Set

Suppose you don't have the raw data set. Instead, you are given the counts in a contingency table like this. This table summarizes whether HIV positive individuals disclosed their HIV status to their partners, and the the rows indicate how the subject had become infected with HIV.

Transmission

Disclosure

No Disclosure

Intravenous drug use

Gay contact

Heterosex. contact

Since we do not have access to the raw data, we need to construct the contingency table in R using a matrix() command. First, we create a contingency table, which I have arbitrarily named "datatable", and I give the command "datatable" so R will show me the table so I can check for errors.

datatable <- matrix(c(35,13,29,17,12,21),nrow=3,ncol=2)
datatable

    [,1] [,2]
[1,] 35   17
[2,] 13   12
[3,] 29   21

This agrees with the counts that I started with, so I can go ahead and run the chi-squared test by simply referring to the name I gave to the contingency table and indicating that correct=FALSE, since the expected counts in all cells under the null hypothesis would be great than or equal to 5.

chisq.test(datatable,correct=FALSE)

Pearson's Chi-squared test

data: datatable
X-squared = 1.8963, df = 2, p-value = 0.3875

Computing the P-value from the Chi-squared Statistic and Degrees of Freedom

1-pchisq(chi-squared, df)

If you have a chi-squared statistic and the number of degrees of freedom (df), you can compute the p-value using the pchisq() command in R. However, the p-value is the area to the right of the chi-squared distributions, and pchisq() gives the area to the left. Therefore you have to subtract the pchisq() result from 1 to get the correct p-value.

Suppose a report you are reading indicates that the chi-squared value was 11.11 and there was 1 degree of freedom (df). The p-value would be computed as follows:

1-pchisq(11.11,1)
[1] 0.0008586349

return to top | previous page | next page