Multiple Variable Regression
Multiple Linear Regression
In the example below I am using part of the data from the Framingham Heart Study to determine whether body mass index (bmi) is associated with systolic blood pressure (sysbp).
First, I run a simple linear regression to assess the crude (unadjusted) association between bmi and sysbp.
# Simple Linear Regression
> summary(lm(sysbp~bmi))
Call:
lm(formula = sysbp ~ bmi)
Residuals:
Min 1Q Median 3Q Max
-49.960 -15.973 -2.991 13.389 116.933
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 108.9627 2.6391 41.29 <2e-16 ***
bmi 1.1959 0.1008 11.87 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22.12 on 2996 degrees of freedom
Multiple R-squared: 0.04489, Adjusted R-squared: 0.04457
F-statistic: 140.8 on 1 and 2996 DF, p-value: < 2.2e-16
Interpretation: BMI is significant related to systolic blood pressure (p<0.0001). Each increment in BMI is associated with an increase in sysbp of 1.2 mm Hg.
> #Multiple linear Regression
> summary(lm(sysbp~bmi + age + male + ldl + hdl +cursmoke))
Call:
lm(formula = sysbp ~ bmi + age + male + ldl + hdl + cursmoke)
Residuals:
Min 1Q Median 3Q Max
-54.981 -14.886 -2.319 11.703 105.840
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.885480 4.723993 9.713 <2e-16 ***
bmi *1.263627 0.097482 *12.963 <2e-16 ***
age 0.925972 0.047140 *19.643 <2e-16 ***
male -0.896683 0.818306 *-1.096 0.273
ldl 0.021030 0.008265 2.545 *0.011 *
hdl 0.041834 0.026281 1.592 *0.112
cursmoke -0.499081 0.836412 **-0.597 *0.551
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.68 on 2991 degrees of freedom
Multiple R-squared: 0.166, Adjusted R-squared: 0.1643
F-statistic: 99.22 on 6 and 2991 DF, p-value: < 2.2e-16
Interpetation: After adjusting for age, sex, LDL, HDL, and current smoking, BMI was still significantly associated with systolic blood pressure. Each unit increase in BMI was associated with a modest increase in systolic blood pressure of about 1.3 mm Hg on average (p<0.0001). The adjusted estimate for BMI was similar to the crude estimate.
Multiple Logistic Regression
I am again using the Framinghams data set, but now my goal is to test whether there is an association between diabetes and odds of being hospitalized for a myocardial infarction (hospmi). I begin by looking at the crude association. So, now the outcome of interest (hospmi) is a dichotomous variable, and I have to use logistic regression instead of linear regression.
# First I will examine the crude association with a simple chi-squared test
> table(diabetes,hospmi)
hospmi
diabetes 0 1
0 **2557 210
1 *183 48
> library(epitools)
> oddsratio.wald(table(diabetes,hospmi))
$data
hospmi
diabetes 0 1 Total
***0 2557 210 2767
***1 183 48 231
*Total **2740 *258 2998
$measure
odds ratio with 95% C.I.
diabetes estimate lower upper
0 1.000000 NA **NA
1 3.193755 2.256038 4.521233
$p.value
two-sided
diabetes midp.exact fisher.exact chi.square
0 NA NA NA
1 1.945462e-09 1.88176e-09 6.548224e-12
$correction
[1] FALSE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
Interpretation: Subjects with diabetes had 1.95 times the odds of being hospitalized for a myocardial infarction compared to subjects without diabetes (p<0.0001)
# Multiple Logistic Regression
> log.out <-glm(hospmi~diabetes + age + male + bmi + sysbp + diabp + hdl + ldl,
+ family=binomial (link=logit))
> summary(log.out)
Call:
glm(formula = hospmi ~ diabetes + age + male + bmi + sysbp +
diabp + hdl + ldl, family = binomial(link = logit))
Deviance Residuals:
Min 1Q Median *3Q *Max
-1.3615 -0.4634 -0.3282 -0.2277 2.7531
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.946249 0.961247 -6.186 6.17e-10 ***
diabetes 0.946976 0.192002 4.932 8.13e-07 ***
age 0.016407 0.009186 1.786 0.074088 .
male 1.197212 0.153976 7.775 7.52e-15 ***
bmi -0.003890 0.018470 -0.211 0.833172
sysbp 0.012896 0.004114 3.135 0.001721 **
diabp -0.005405 0.007965 -0.679 0.497354
hdl -0.017798 0.005306 -3.354 0.000795 ***
ldl 0.007328 0.001374 *5.334 9.62e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1758.7 on 2997 degrees of freedom
Residual deviance: 1579.5 on 2989 degrees of freedom
AIC: 1597.5
Number of Fisher Scoring iterations: 6
# The next command asks R to provide the adjusted odds ratios for each variable in the model
> exp(log.out$coeff)
(Intercept) diabetes age male **bmi sysbp diabp
0.002615633 2.577901246 1.016542819 3.310874851 0.996117094 1.012979017 0.994609173
**hdl ldl
0.982359551 1.007354740
# The next command asks for the 95% confidence intervals for the adjusted odds ratios.
> exp(confint(log.out))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.000392781 0.01703735
diabetes 1.755767291 3.73167495
age 0.998370449 1.03500180
male 2.457227924 4.49622942
bmi 0.960202678 1.03233392
sysbp 1.004774973 1.02112476
diabp 0.979236759 1.01032182
hdl 0.972038277 0.99246563
ldl 1.004631473 1.01006555
Interpretation: After adjusting for age, sex, BMI systolic and diastolic blood pressure, HDL cholesterol, and LDL cholesterol, diabetics had 0.95 times the odds of being hospitalized for a myocardial infarction compared to non-diabetics. Since the crude odds ratio was 1.95, this indicates that the association between diabetes and hospitalizaiton for MI was confounded by one or more of these other risk factors.