4.3 Logistic regression for a Yes/No outcome
In R, logistic regression is performed using the glm( ) function, for general linear model. This function can fit several regression models, and the syntax specifies the request for a logistic regression model.
As an example, we will look at factors associated with smoking among a sample of n=300 high school students from the Youth Risk Behavior Survey. The outcome variable is 'eversmokedaily1', coded as 1 for those who have smoked vs. 0 for those who have not. As a preliminary analysis, we calculate the proportion of respondents who have ever smoked daily:
> table(eversmokedaily1)
eversmokedaily1
0 1
229 69
> 69/(69+229)
[1] 0.2315436
In the following example, the glm( ) function performs the logistic regression, and the summary( ) function requests the default output summarizing the analysis. The 'family=binomial(link=logit)' syntax specifies a logistic regression model. As with the linear regression routine and the ANOVA routine in R, the 'factor( )' command can be used to declare a categorical predictor (with more than two categories) in a logistic regression; R will create dummy variables to represent the categorical predictor using the lowest coded category as the reference group.
In entering this command, I hit the 'return' to type things in over 2 lines; R will allow you to continue a command onto a second or third line.
In this example, 'bmi_cat' is a categorical variable coded 1,2,3 or 4 for those in BMI categories of underweight, normal weight, overweight, or obese. By default, R creates 3 dummy variables to represent BMI category, using the lowest coded group (here 'underweight') as the reference. You can change the reference category by using the 'relevel( )' command (see dummy variables in multiple linear regression, above). The format of the relevel( ) command is:
relevel(factor(bmi_cat,ref="2")
This command would treat bmi_cat as a categorical predictor, and use category '2' (normal weight) as the reference category when creating dummy variables:
> summary(glm(eversmokedaily1 ~ age + sex1F2M +
relevel(factor(bmi_cat),ref='2') + alc_30days,
family=binomial(link=logit)))
Call:
glm(formula = eversmokedaily1 ~ age + sex1F2M + relevel(factor(bmi_cat),
ref = "2") + alc_30days, family = binomial(link = logit))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2067 -0.8476 -0.3406 -0.1983 2.5519
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.46565 1.98469 -1.242 0.2141
age -0.07688 0.12163 -0.632 0.5273
sex1F2M 0.55572 0.32236 1.724 0.0847 .
relevel(factor(bmi_cat), ref = "2")1 0.24505 0.55573 0.441 0.6593
relevel(factor(bmi_cat), ref = "2")3 0.19814 0.40889 0.485 0.6280
relevel(factor(bmi_cat), ref = "2")4 -0.70254 0.60298 -1.165 0.2440
alc_30days 2.30101 0.45628 5.043 4.58e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 292.83 on 268 degrees of freedom
Residual deviance: 247.20 on 262 degrees of freedom
(31 observations deleted due to missingness)
AIC: 261.20
Number of Fisher Scoring iterations: 5
In logistic regression, slopes can be converted to odds ratios for interpretation. Below we calculate the odds ratio for those in the BMI overweight category, and we calculate the OR and the 95% CI for the OR for those having had a drink in the past month vs. those not having had a drink in the past month (the # indicates a comment that is ignored by R):
> exp(0.55572) #OR for males compared to females
[1] 1.743196
> exp(0.55572 - 1.96*0.32236) # lower limit of 95% CI for OR
[1] 0.9267183
> exp(0.55572 + 1.96*0.32236) # upper limit of 95% CI for OR
[1] 3.279023
The C-statistic (also called the AUC statistic) for the logistic regression can be obtained from the lroc( ) command, which is in the 'epicalc' add-on package. To find the C-statistic, you must first install and then load the epicalc package. Once the package is loaded, you can find the C-statistic by first saving the results of the logistic regression, and then using the lroc( ) command:
> logisticresults <- glm(eversmokedaily1 ~ age + sex1F2M, family=binomial(link=logit)))
> lroc(logisticresults)
$model.description
[1] "eversmokedaily1 ~ age + sex1F2M"
$auc
[1] 0.5787582
The lroc( ) command gives a lot of additional output (more detail than we generally need) and a graph of the ROC curve; the C-statistic is given at the top of the output, labeled 'auc'.