Using R for the Chi-squared Test of Independence


The calculations for the chi-squared test can be tedious, but R can do these quite easily. Here are two useful techniques.

For Tabulated Frequencies

Suppose you have data that is already summarized in a contingency table as in the table below. (Note this is the same data that was used on the preceding page.)

 

Disclosed

Non-Disclosed

Injection drug use

35

17

Homosexual contact

13

12

Heterosexual contact

29

21

One can create a "matrix" in R and perform a chi-squared test as follows:

> datatable<-matrix(c(35,13,29,17,12,21),nrow=3,ncol=2)

> datatable

[,1] [,2]

[1,] 35 17

[2,] 13 12

[3,] 29 21

> chisq.test(datatable,correct=FALSE)
# Note: the correct=FALSE option indicates that we do not need a correction for small sample size since all of the expected frequencies are greater than 5.

Pearson's Chi-squared test

data: datatable

X-squared = 1.8963, df = 2, p-value = 0.3875

 

For Raw Data from a CSV File

The dataset FramCHDStudy.CSV has a variable "hypert" for hypertension (high blood pressure), which is coded 0 if absent and 1 if present. The development of coronary heart disease ("chd") is coded 0 if it did not occur and 1 if it occurred. Suppose we want to conduct a chi-squared test to assess whether individuals with hypertension have a greater risk of developing coronary heart disease. We could use the following code:

> FramCHDStudy <- read.csv("C:/Users/wlamorte/Desktop/Quant Core/Data sets/FramCHDStudy.csv")

> View(FramCHDStudy)

> fram2<-FramCHDStudy

> attach(fram2)

> table(hypert,chd)

chd hypert --0 --1

0 748 140

1 380 142

> prop.table(table(hypert,chd),1)

# Note: the 1 at the end of the preceding command asks R to compute proportions across each row. 

# For example, among those who were not hypertensive 84.23% did not have CVD and 15.77% did have CVD

---chd

----hypert --------0 --------1

0 0.8423423 0.1576577

1 0.7279693 0.2720307

> chisq.test(table(hypert,chd),correct=FALSE)

Pearson's Chi-squared test

data: table(hypert, chd)

X-squared = 26.8777, df = 1, p-value = 2.168e-07

The data table has 2 rows and 2 columns, so df=(2-1) x (2-1) = 1.

The resulting p-value is 2.168 x 10-7. So, there is strong evidence that the risk of CHD was greater in subjects with hypertension.