Categorical Predictor Variables

In regression analyses, categorical predictors are represented using 0 and 1 for dichotomous variables or using indicator (or dummy) variables for ordinal or categorical variables.

Suppose we wanted to conduct an analysis to determine whether systolic blood pressure is lower in people who exercise regularly compared to people who don't after adjusting for AGE (in years), use of blood pressure lowering medications (BPMEDS), and BMI. However, instead of treating BMI as a continuous variable, we want to collapse BMI into four categories (underweight, normal, overweight, & obese).

How to Collapse a Continuous Variable into Categories

I want to make the "normal" BMI category the reference, so I need to make indicator variables for being underweight, overweight, or obese. To do this in R, we can use ifelse() statements to create the indicator variables. The general form of an ifelse() statement is:

(new_variable_name)<-ifelse(conditional_test, value_assigned_if true, value_assigned_if_not_true)

For example, I can create new dummy variables to create indicators for the underweight, overweight, and obese BMI categories as follows:

> underwgt<-ifelse(BMI<18.5, 1, 0)
>
overwgt<-ifelse(BMI>=25 & BMI<30, 1, 0)
>
obese<-ifelse(BMI>=30, 1, 0)

The first command above evaluates each subject's BMI, and if it is less than 18.5, it assigns a value of 1 to a new variable called "underwgt"; if BMI is not less than 18.5, R assigns a value of 0 to "underwgt". Similarly, the seconde command creates a new variable called "overwgt" and assigns it a value of 1 if BMI is greater than 25 and less than 30, and it assigns a value of 0 if BMI is either below or above this range. Finally, the third command creates a variable called "obese" which has a value of 1 of BMI is greater than or equal to 30, and has a value of 0 if BMI is less than 30.

Note that I don't need to include a statement to define "normal" because all the normal subjects (who are not in any of the other three categories) will not have an indicator variable assigned, so they will become the reference group for the other three variables by default. The "ifelse" commands above would assign the values shown in the table below.

BMI Categories

Underweight

Normal

Overweight

Obese

underwgt

1

0

0

0

overwgt

0

0

1

0

obese

0

0

0

1

Once the dummy variables have been created, we can perform a multiple linear regression that includes this set of indicators in addition to other independent variables. Note that BPMEDS is a dichotomous variable coded 1 if any BP medications are used and coded 0 if not. The variable MALE was coded 1 for males and 0 for females. To perform this regression analysis in R, we use the following code:

> lm3<-lm(SYSBP~underwgt+overwgt+obese+AGE+MALE+BPMEDS)
> summary(lm3)

Call:
lm(formula = SYSBP ~ underwgt + overwgt + obese + AGE + MALE + BPMEDS)

Residuals:
Min 1Q Median 3Q Max
-57.521 -12.845 -2.209 9.979 139.606

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 84.67507 1.73970 48.672 < 2e-16 ***
underwgt -6.46970 2.58809 -2.500 0.0125 *
overwgt 7.20117 0.64426 11.177 < 2e-16 ***
obese     14.85372 0.92743 16.016 < 2e-16 ***
AGE  0.87289 0.03431 25.444 < 2e-16 ***
MALE -2.40472 0.59802 -4.021 5.89e-05 ***
BPMEDS 25.02151 1.65839 15.088 < 2e-16 ***
--- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.24 on 4347 degrees of freedom (80 observations deleted due to missingness)
Multiple R-squared: 0.2585, Adjusted R-squared: 0.2574
F-statistic: 252.5 on 6 and 4347 DF, p-value: < 2.2e-16

Interpretation:

The overall p-value for the model on the last row of output indicates that the model is highly significant, so we can interpret the results for individual variables in the model. After controlling for confounding with multiple linear regression, each of these predictor variables has a statistically significant association with systolic blood pressure. Subjects who are underweight have systolic blood pressures that are about 6 mm Hg lower that that in subjects with BMI in the normal range after adjusting for other variables in the model. Those who are overweight have systolic blood pressures that are about 7 mm Hg higher than those of normal individuals, and obese people have systolic blood pressures about 15 mm Hg higher than subjects with a normal BMI. AGE is associated with a small but statistically significant increase in systolic blood pressure, i.e., an increase of just less than 1 mm Hg for each additional year of age. Not surprisingly, those using blood pressure lowering medications had pressures 25 mm Hg lower than those not using such medications. [There is a lot of undiagnosed/untreated high blood pressure in the population.]

Dummy Variable Analysis When the Data Set Has a Categorical Variable

Suppose that instead of a continuous variable for BMI, the data set already has a categorical variable called "bmicat" that is coded 1, 2, 3, 4 to indicate those who are underweight, normal weight, overweight, or obese.

In this situation you can use the factor( ) command in R to create dummy variables using a coding statement like the one shown below.

> summary(lm(sysbp ~ age + studygrp + factor(bmicat)))

The factor( ) command makes the lowest coded category the reference unless you specify otherwise. For example, if the data is coded 1, 2, 3, 4, R will use category 1 (underweight subjects) as the reference. However, you can use the relevel( ) command to specify the reference. Here it makes sense to use 'normal weight', i.e., BMIcat = 2, as the reference, as below.

> summary(lm(sysbp~age + studygrp + relevel(factor BMIcat),ref="2")))

This would produce the following output:

Call: lm(formula = sysbp ~ age + studygrp + relevel(factor(BMIcat), ref = "2"))

Coefficients:
Estimate Std. Error    t value  Pr(>|t|)
(Intercept)                       86.3514     6.1109     14.131   < 2e-16 ***
age                                0.6676     0.1065      6.266  4.04e-09 ***
studygrp                           4.4550     2.6644      1.672   0.09668 .
relevel(factor(BMIcat),ref="2")1 -30.3576    11.0720     -2.742   0.00689 **
relevel(factor(BMIcat),ref="2")3   2.0878     2.6448      0.789   0.43118
relevel(factor(BMIcat),ref="2")4  15.4479     6.0609      2.549   0.01186 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.29 on 144 degrees of freedom
Multiple R-squared: 0.2884, Adjusted R-squared: 0.2637
F-statistic: 11.67 on 5 and 144 DF, p-value: 1.767e-09