4.2 Multiple linear regression for a measurement outcome


4.2.1 Multiple regression analysis

Multiple regression analysis is also performed through the 'lm( )' function. The syntax is the same as for simple regression except that more than one predictor variable is specified:

> summary(lm(fev1_litres ~ ht_cm + sexM))

Call:

lm(formula = fev1_litres ~ ht_cm + sexM)

Residuals:

Min 1Q Median 3Q Max

-1.02900 -0.33864 -0.08322 0.36404 1.45297

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -10.27991 4.42528 -2.323 0.03284 *

ht_cm 0.08196 0.02531 3.238 0.00484 **

sexM -0.28059 0.29034 -0.966 0.34738

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6159 on 17 degrees of freedom

Multiple R-Squared: 0.3903, Adjusted R-squared: 0.3186

F-statistic: 5.441 on 2 and 17 DF, p-value: 0.01491

4.2.2 Multiple regression with categorical predictors

In regression analyses, categorical predictors are represented through a set of 0/1 indicator (or dummy) variables. The factor( ) command in R identifies categorical variables and creates dummy variables for a categorical variable. For example, the variable 'bmicat' is coded 1, 2, 3, 4 to indicate those who are underweight, normal weight, overweight, or obese. In creating dummy variables, the factor( ) command chooses the lowest coded category as the reference category, so here will choose 'underweight' as the reference group (see below to see how to change the reference category). The factor( ) command can be used inside the lm( ) command to indicate categorical predicators:

> summary(lm(sysbp ~ age + studygrp + factor(bmicat)))

Call:

lm(formula = sysbp ~ age + studygrp + factor(BMIcat))

Residuals:

Min 1Q Median 3Q Max

-38.276 -8.802 -1.720 6.922 56.383

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 55.9938 12.6939 4.411 2.00e-05 ***

age 0.6676 0.1065 6.266 4.04e-09 ***

studygrp 4.4550 2.6644 1.672 0.096681 .

factor(BMIcat)2 30.3576 11.0720 2.742 0.006885 **

factor(BMIcat)3 32.4454 11.0745 2.930 0.003946 **

factor(BMIcat)4 45.8055 12.3038 3.723 0.000282 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.29 on 144 degrees of freedom

Multiple R-squared: 0.2884, Adjusted R-squared: 0.2637

F-statistic: 11.67 on 5 and 144 DF, p-value: 1.767e-09

 

Note that three dummy variable were included in the regression representing the four bmi categories.

In this example, it would be more natural to use 'normal weight', which is coded as BMIcat of 2, as the reference group. You can specify the reference group for a categorical variable with the 'relevel( )' command (for reference level, I think). Here, to specify '2' as the reference category, we would use relevel(factor(BMIcat,ref="2")) (getting a bit involved, using R functions within functions within functions):

> summary(lm(sysbp ~ age + studygrp + relevel(factor(BMIcat),ref="2")))

Call:

lm(formula = sysbp ~ age + studygrp + relevel(factor(BMIcat), ref = "2"))

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 86.3514 6.1109 14.131 < 2e-16 ***

age 0.6676 0.1065 6.266 4.04e-09 ***

studygrp 4.4550 2.6644 1.672 0.09668 .

relevel(factor(BMIcat), ref = "2")1 -30.3576 11.0720 -2.742 0.00689 **

relevel(factor(BMIcat), ref = "2")3 2.0878 2.6448 0.789 0.43118

relevel(factor(BMIcat), ref = "2")4 15.4479 6.0609 2.549 0.01186 *

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.29 on 144 degrees of freedom

Multiple R-squared: 0.2884, Adjusted R-squared: 0.2637

F-statistic: 11.67 on 5 and 144 DF, p-value: 1.767e-09

 

These results use normal weight as the reference category.

4.2.3 Finding standardized regression coefficients in R

R gives (unstandardized) regression coefficients and the model R-square as part of the standard output from a regression analysis, but does not include the standardized regression coefficients as part of the standard output.

For example, the following regression model predicts systolic blood pressure from sex, age, BMI, and cholesterol levels:

> summary(lm(SYSBP ~ AGE + SEX + BMI + TOTCHOL))

Call:

lm(formula = SYSBP ~ AGE + SEX + BMI + TOTCHOL)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 40.98768 11.81398 3.469 0.000642 ***

AGE 0.89031 0.16052 5.546 9.41e-08 ***

SEX 5.49818 2.81491 1.953 0.052222 .

BMI 1.43825 0.29289 4.910 1.92e-06 ***

TOTCHOL 0.00636 0.03330 0.191 0.848733

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.19 on 195 degrees of freedom

Multiple R-squared: 0.2879, Adjusted R-squared: 0.2733

F-statistic: 19.71 on 4 and 195 DF, p-value: 1.219e-13

 

One way to calculate standardized regression coefficients in R is to do it 'by hand'. The to find the standardized coefficients, we can first convert every variable in the analysis to a z-score, using the 'scale' function (I've named these new z-score variables with a '.z' as a reminder that these are z-scores, but this is not necessary, and it's not necessary):

> sysbp.z <- scale(SYSBP)

> age.z <- scale(AGE)

> sex.z <- scale(SEX)

> bmi.z <- scale(BMI)

> totchol.z <- scale(TOTCHOL)

Just to check, the mean for these z-score variables should be 0 and the standard deviation should be 1:

> mean(totchol.z)

[1] -2.196409e-16

> sd(totchol.z)

[1] 1

We can find the standardized coefficients by running the regression analysis on the z-score version of the variables:

> summary(lm(sysbp.z ~ age.z + sex.z + bmi.z + totchol.z))

Call:

lm(formula = sysbp.z ~ age.z + sex.z + bmi.z + totchol.z)

Residuals:

Min 1Q Median 3Q Max

-1.7566 -0.6151 -0.1565 0.5738 3.2835

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.902e-16 6.028e-02 3.15e-15 1.0000

age.z 3.529e-01 6.363e-02 5.546 9.41e-08 ***

sex.z 1.200e-01 6.141e-02 1.953 0.0522 .

bmi.z 3.030e-01 6.170e-02 4.910 1.92e-06 ***

totchol.z 1.188e-02 6.223e-02 0.191 0.8487

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8525 on 195 degrees of freedom

Multiple R-squared: 0.2879, Adjusted R-squared: 0.2733

F-statistic: 19.71 on 4 and 195 DF, p-value: 1.219e-13

 

The 'slopes' from this analysis are the standardized slopes. Note that the p-values for the (now standardized) slopes match the p-values from the original version of the analysis, and that the model R-square is the same as in the original version of the analysis.\