Using R to Generate Contingency Tables from a Data Set

So far you have been given the counts needed to summarize data in a contingency table, but what do you do if you don't have the counts and you have to get them from a large data set? Statistical packages like R make this simple. In order to illustrate, I created a subset of data from the Framingham Heart Study that I restricted to non-smokers with BMI > 20, and I called the file "fram-nosmoke-nolow.csv".

Our goal was to create a contingency table with three categories of exposure based on BMI (normal, overweight, or obese) and two categories of outcome based on the variable MI-FCHD (whether the subject had been hospitalized for a myocardial infarction or had died of coronary heart disease). MI-FCHD is coded 1 if true (occurred) and 0 if false. BMI is a continuously distributed variable in the data set, and I wanted to create a new variable with the three categories listed above using the "ifelse" function in R.

The "ifelse" function used below can be very useful. Here it is used to examine the continuous variable BMI and create a new variable called bmicat, which categorizes individuals based on BMI into one of three categories:

The ifelse function uses the format:

ifelse(test_expression, x, y)

"test_expression" is a statement that can be evaluated as true or false. If it is true, "x" is used, and if false "y" is used.
In the code below I wanted to create a new variable called "bmicat" whose value was defined using a nested ifelse function:

> bmicat<-ifelse(BMI>29.99, "obese", ifelse(BMI>24.99, "over", "normal"))

This looks at the value of BMI in each record in the database; if BMI is greater than 29.99, it assigns a value of "obese" to bmicat and moves on to the next record. If BMI is not >29.9, it executes another ifelse function (i.e., nested), and if BMI>24.99 it assigns a value of "over" to bmicat and moves on to the next record. However, if BMI is not >24.99, it assigns a value or "normal" to bmicat.

The R code used to do thés is shown below. The statements that begin with the # symbol are comments embedded in the code that are ignored by R, i.e., not executed. Executed statements are shown below that in blue, and the resulting output is shown below that

# I imported fram-nosmoke-nolow.csv and nicknamed it "fr" # na.omit removes records that have missing values > fr<-na.omit(fram_nosmoke_nolow) > attach(fr) # The next command uses the ifelse command to create three categories of BMI
> bmicat<-ifelse(BMI>29.99, "obese", ifelse(BMI>24.99, "over", "normal"))
> table(bmicat,MI_FCHD)

(Output)
MI_FCHD
bmicat  0   1
norm  694  79
obese 296  50
over  800 123

# You can also generate the other format for the contingency table by switching the order of the variables
> table(M1_FCHD,bmicat)

(Output)

bmicat
M1_FCHD normal obese over
0    650   258 767
1     72    44  114

Note that MI-FCHD was coded I if it occurred and 0 if it did not. However, R defaults to listing the lower value first. It also defaults to listing the categories of bmicat in alphabetical order.

From this output I created the following contingency table:

 

Obese

Overweight

Normal BMI

Total

MI

44

114

72

245

No MI

258

767

650

1746

Total

302

881

722

1991

 

We can also ask R for the proportion of "events" in each category of BMI as follows. Notice the use of a ",1" flag at the end of the command to get the row proportions.

prop.table(table(bmicat,MI_FCHD),1)

MI_FCHD
bmicat      0           1
normal 0.90027701 0.09972299
obese  0.85430464 0.14569536
over   0.87060159 0.12939841

 

Test Yourself

Using the contingency table above do the following before looking at the answers:

Answers