Chi-Square Analysis Using R
Analysis With Contingency Table Data
If you are given frequency data in a contingency table, you can create a data matrix to analyze the data. For example, here are the observed frequencies from the examples above.
Transmission Risk |
Disclosed | No Disclosure | Total |
IV Drug Use |
35 (67%) 13 (52%) 29 (58%) |
17 12 21 |
52 25 50 |
Total |
77 (60.6%) | 50 | 127 |
To create the contingency table in R we would create a data object (let's arbitrarily call it "datatable") and then use the matrix function and entering the frequencies in a very specific order as shown below. First we enter the three observed counts in the first column in order and then enter the three observed counts in the the second column. We also have to specify the number of rows and the number of columns. Once we enter the data, we can check the results by simply giving the command "datatable". Lastly, we use the chisq.test function.
> datatable <- matrix(c(35,13,29,17,12,21),nrow=3,ncol=2)
> datatable
[,1] [,2]
[1,] 35 17
[2,] 13 12
[3,] 29 21
> chisq.test(datatable,correct=FALSE)
Output:
Pearson's Chi-squared test
data: datatable
X-squared = 1.8963, df = 2, p-value = 0.3875
Analysis with Raw Data from a Data Set
Alternatively, if you have raw data, you can create a contingency table from the raw data using the "table" function and then use chisq.test as shown in the example below which uses raw data from the Framingham Heart Study to look at the association between hypertension (hypert) and risk of developing coronary heart disease (chd). Note that when R creates the contingency table from raw data, it always lists the lowest alphanumeric category first. In this case the categories of hypertension are 0 (no) first and then 1 (yes). Similarly, the columns of the table list chd=0 first and chd=1 second.
>fram<-FramCHDStudy
>attach(fram)
>table(hypert,chd)
chd
hypert 0 1
0 748 140
1 380 142
> prop.table(table(hypert,chd),1)
chd
hypert 0 1
0 0.8423423 0.1576577
1 0.7279693 0.2720307
> chisq.test(table(hypert,chd),correct=FALSE)
Ouput:
Pearson's Chi-squared test
data: table(hypert, chd)
X-squared = 26.8777, df = 1, p-value = 2.168e-07
The p-value is 2.168 x 10^{-7} or 0.0000002168, a very small p-value, suggesting an extremely low probabily of seeing the observed differences if the null hypothesis were true.
Conclusion:
Hypertensive adults have significantly higher risk of developing coronary heart disease (chd) compared to non-hypertensive adults (27.2% vs. 15.8%, chi-square (1 df) = 26.877, p<0.0001). [Note that when reporting results with such a small p-value, one should simplify things and simply report the difference as p<0.0001. If the p-value had been 0.0002168, we could report that as p<0.001.]
Instead of using prop.table, you can compute proportions directly in R using the math functions, e.g., the proportion of non-hypertensive adults who developed CHD = 140/(140+748) as shown below.
> 140/(140+748)
[1] 0.1576577
Test Yourself
Investigators tested a new intervention to promote smoking cessation. Among subjects exposed to the new intervention, 40/120 had quit at 6 months. Among the 80 subjects who did not receive the intervention, 10 had quit at 6 months. What was the "risk ratio" for quitting in the intervention group compared to the placebo group? Was the success rate significantly higher in the intervention group compared to the non-intervention group?