Logistic Regression and Survival Analysis

Table of Contents»

 

Contributing Authors:

Ching-Ti Liu, PhD, Associate Professor, Biostatistics

Jacqueline Milton, PhD, Clinical Assistant Professor, Biostatistics

Avery McIntosh, doctoral candidate

 

 

This module will enable you to perform logistic regression and survival analysis in R. The following topics will be introduced:

1. Why use logistic regression?

2. Overview of logistic regression.

3. How to perform logistic regression in R.

4. Why use survival analysis?

5. Overview of survival analysis (Kaplan-Meier plots and Cox regression)

6. How to perform survival analysis in R.

Learning Objectives


By the end of this session students will be able to:

  1. Use R to perform logistic regression analysis and interpret the results.
  2. Use R to perform survival analysis and interpret the results. 

 

 

Logistic Regression


Why use logistic regression?

Previously we discussed how to determine the association between two categorical variables (odds ratio, risk ratio, chi-square/Fisher test). Suppose we want to explore a situation in which the dependent variable is dichotomous (1/0, yes/no, case/control) and the independent variable is continuous. 

Let's examine the Outbreak dataset in the epicalc library in R. Suppose we want to examine the association between vomiting (yes/no) as a dependent variable and age.  Suppose we try to examine this association using linear regression.

 

E[ Vomiting ] = β0 + β1*age

If we run the linear regression model in R, what happens?

> library(epicalc)

> data(package="epicalc",Outbreak)

> attach(Outbreak)

  

> model.1<-lm(vomiting~age)

> summary(model.1)

> plot(model.1,which=1:4)

We see that just about all the assumptions of linear regression are violated in these plots. This is due to the fact that linear regression can only work well if the dependent variable is continuous. In this case the predicted values still fall between 0 and 1, which is good, but in many cases we can get an expected value of the probability of being a case that is above 1, or below 0; this is clearly erroneous.

Overview of Logistic Regression


When the assumptions of linear regression are violated, oftentimes researchers will transform the independent or dependent variables. In logistic regression the dependent variable is transformed using what is called the logit transformation:

 

Then the new logistic regression model becomes:

Covariates can be of any type:

and the outcome is binary: 0/1. Since Y is either 0 or 1, expected value of Y for a set of covariates X is thought of as "the probability that event Y occurs, given the covariates X." So we have that if p is the probability of vomiting, then

In linear regression, we were able to predict the outcome Y given new data by plugging in covariates on new data into the model. Here, we can do that for odd, odds ratios, or predicted probabilities (more on this later).

In standard linear regression, the coefficients are estimated based on the "least-squares" criterion. Here, the estimates for the coefficients are performed via Maximum Likelihood Estimation.

Due to a transformed outcome, there is a concomitant change in the interpretation of the regression coefficients:

log(odds of vomiting for those aged 20) =  β0 + β1*20

log(odds of vomiting for those aged 21) =  β0 + β1*21

 

odds of vomiting for those aged 20 =  e β0 + β1*20

odds of vomiting for those aged 21 =  e β0 + β1*21

 

If we want to determine the odds ratio to compare the odds of vomiting for those who are 20 years old versus the odds of vomiting for those who are 21 years old we can do the following:

By laws of exponents, we can combine terms, and we get:

Thus  = exp(β1) represents the odds ratio of vomiting comparing groups that differ by a one-unit change in age.

If we want to get a predicted probability of the event of interest (p), given our covariates, then we can do a little algebra, and we have that

 

Generalized Linear Models

There is an entire sub-field of statistical modeling called generalized linear models, where the outcome variable undergoes some transformation to enable the model to take the form of a linear combination, i.e. f (E[Y]) = β0 + β1X1+…+ βkXk.

Logistic regression is just one such type of model; in this case, the function f (・) is

f (E[Y]) = log[ y/(1 - y) ].

 

There is Poisson regression (count data), Gamma regression (outcome strictly greater than 0), Multinomial regression (multiple categorical outcomes), and many, many more.

If you are interested in these topics, SPH offers

·       BS 853: Generalized Linear Models (logistic regression is just one class)

·       BS 820: Logistic Regression and Survival Analysis

·       BS 852: Statistical Methods in Epidemiology (covers some logistic and survival)

 

And in the math department there is

·       MA 575: Linear Models

·       MA 576: Generalized Linear Models

 

Logistic Regression in R


To perform logistic regression in R, you need to use the glm() function.  Here, glm stands for "general linear model." Suppose we want to run the above logistic regression model in R, we use the following command:

> summary( glm( vomiting ~ age, family = binomial(link = logit) ) )

 

Call:

glm(formula = vomiting ~ age, family = binomial(link = logit))

 

Deviance Residuals:

    Min       1Q   Median       3Q      Max 

-1.0671  -1.0174  -0.9365   1.3395   1.9196 

 

Coefficients:

             Estimate Std. Error z value Pr(>|z|)   

(Intercept) -0.141729   0.106206  -1.334    0.182   

age         -0.015437   0.003965  -3.893 9.89e-05 ***

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

(Dispersion parameter for binomial family taken to be 1)

 

Null deviance: 1452.3  on 1093  degrees of freedom

Residual deviance: 1433.9 on 1092  degrees of freedom

 

AIC: 1437.9

 

Number of Fisher Scoring iterations: 4

 

To get the significance for the overall model we use the following command:

> 1-pchisq(1452.3-1433.9, 1093-1092)

 [1] 1.79058e-05

The input to this test is:

This is analogous to the global F test for the overall significance of the model that comes automatically when we run the lm() command. This is testing the null hypothesis that the model is no better (in terms of likelihood) than a model fit with only the intercept term, i.e. that all beta terms are 0.

Thus the logistic model for these data is:

E[ odds(vomiting) ] = -0.14 – 0.02*age

 

This means that for a one-unit increase in age there is a 0.02 decrease in the log odds of vomiting. This can be translated to e-0.02 =  0.98. Groups of people in an age group one unit higher than a reference group have, on average, 0.98 times the odds of vomiting.

How do we test the association between vomiting and age?

When testing the null hypothesis that there is no association between vomiting and age we reject the null hypothesis at the 0.05 alpha level (z = -3.89, p-value = 9.89e-05).

On average, the odds of vomiting is 0.98 times that of identical subjects in an age group one unit smaller.

 

Finally, when we are looking at whether we should include a particular variable in our model (maybe it's a confounder), we can include it based on the "10% rule," where if the change in our estimate of interest changes more than 10% when we include the new covariate in the model, then we that new covariate in our model. When we do this in logistic regression, we compare the exponential of the betas, not the untransformed betas themselves!

 

Test the hypothesis that being nauseated was not associated with sex and age (hint: use a multiple logistic regression model).  Test the overall hypothesis that there is no association between nausea and sex and age.  Then test the individual main effects hypothesis (i.e. no association between sex and nausea after adjusting for age, and vice versa).

 

Survival Analysis


Why use survival analysis?

Let's examine the dataset luekemia in the "survival" package.

> rm(list = ls())

> library(survival)

> data(leukemia)

> detach(Outbreak)

> data(package="survival",leukemia)

> attach(leukemia)

 

Survival in patients with Acute Myelogenous Leukemia. The question at the time was whether the standard course of chemotherapy should be extended ('maintenance') for additional cycles.

 

Variable

Description

time

Survival or censoring time

status

censoring status (0 if an individual was censored, 1 otherwise)

x

was maintenance chemotherapy given? ("Maintained" if yes, "Not maintained" if no

 

Suppose we want to examine the association between the length of survival of a patient (how long they survived leukemia) and whether or not chemotherapy was maintained.  Suppose we create a histogram of the survival times.

> hist(time, xlab="Length of Survival Time", main="Histogram of Survial Time in AML Patients")

We see that the survival times are highly skewed due to the fact that there is a person who survived much longer than everyone else. In addition, there were quite a few people who survived for fewer than 10 years. This type of skewed distribution is typical when dealing with survival data and thus the normality assumption of linear regression is often violated, making it inappropriate to use.

In addition to this problem, we also see a problem known as censoring with survival data.

> leukemia[1:5,]

 

   time  status     x

1    9      1 Maintained

2   13      1 Maintained

3   13      0 Maintained

4   18      1 Maintained

5   23      1 Maintained

 

The third observation has a status of 0. This means that the person was followed for 13 months and after that was lost to follow up. So we only know that the patient survived AT LEAST 13 months, but we have no other information available about the patient's status.  This type of censoring (also known as "right censoring") makes linear regression an inappropriate way to analyze the data due to censoring bias.

Overview of Survival Analysis


One way to examine whether or not there is an association between chemotherapy maintenance and length of survival is to compare the survival distributions. This is done by comparing Kaplan-Meier plots.

> survfit(Surv(time, status)~1)

 

Call: survfit(formula = Surv(time, status) ~ 1)

 

records   n.max n.start  events  median 0.95LCL 0.95UCL

     23      23      23      18      27      18      45

This tells us that for the 23 people in the leukemia dataset, 18 people were uncensored (followed for the entire time, until occurrence of event) and among these 18 people there was a median survival time of 27 months (the median is used because of the skewed distribution of the data). The 95% confidence interval for the median survival time for the 18 uncensored individuals is (18, 45).

>summary(survfit(Surv(time, status)~1))

 

Call: survfit(formula = Surv(time, status) ~ 1)

 

 time  n.risk  n.event  survival   std.err   lower 95% CI upper 95% CI

    5     23       2        0.9130   0.0588       0.8049        1.000

    8     21       2        0.8261   0.0790       0.6848        0.996

    9     19       1        0.7826   0.0860       0.6310        0.971

   12     18       1        0.7391   0.0916       0.5798        0.942

   13     17       1        0.6957   0.0959       0.5309        0.912

   18     14       1        0.6460   0.1011       0.4753        0.878

   23     13       2        0.5466   0.1073       0.3721        0.803

   27     11       1        0.4969   0.1084       0.3240        0.762

   30      9       1        0.4417   0.1095       0.2717        0.718

   31      8       1        0.3865   0.1089       0.2225        0.671

   33      7       1        0.3313   0.1064       0.1765        0.622

   34      6       1        0.2761   0.1020       0.1338        0.569

   43      5       1        0.2208   0.0954       0.0947        0.515

   45      4       1        0.1656   0.0860       0.0598        0.458

   48      2       1        0.0828   0.0727       0.0148        0.462

The following summary goes through each time point in the study in which an individual was lost to follow up or died and re-computes the total number of people still at risk (n.risk), the number of events at that time point (n.event), the proportion of individuals who survived up until that point (survival) and the standard error (std.err) and 95% confidence interval (lower 95% CI, upper 95% CI) for the proportion of individuals who survived at that point.

 

You can also plot the survival curves using the following commands:

> surv.aml <- survfit(Surv(time,status)~1)

> plot(surv.aml, main = "Plot of Survival Curve for AML Patients",

xlab = "Length of Survival", ylab = "Proportion of Individuals who have Survived")

This plot shows the survival curve (also known as a Kaplan-Meier plot), the proportion of individual who have survived up until that particular time as a solid black line and the 95% confidence interval (the dashed lines).

 

Our original question was to examine the association between chemotherapy maintenance and length of survival. To do this in the context of survival analysis we compare the survival curves of those who received chemotherapy maintenance and those who did not.  First, let's examine how to compare the survival statistics and create Kaplan-Meier plots for each chemotherapy group.

> print(survfit(Surv(time,status)~x))

 

Call: survfit(formula = Surv(time, status) ~ x)

 

                                       records n.max n.start events median 0.95LCL 0.95UCL

x=Maintained               11           11      11           7          31            18            NA

x=Nonmaintained       12           12      12          11        23              8             NA

 

In this output we see that there is a total of 11 people who were on maintained chemotherapy, 7 died, the median follow up was 31. The 95% confidence interval of survival time for those on maintained chemotherapy is (18, NA); NA in this case means infinity. A 95% upper confidence limit of NA/infinity is common in survival analysis due to the fact that the data is skewed.

>plot(survfit(Surv(time,status)~x), main = "Plot of Survival Curves by Chemotherapy Group", xlab = "Length of Survival",ylab="Proportion of Individuals who have Survived",col=c("blue","red"))

>legend("topright", legend=c("Maintained", "Nonmaintained"),fill=c("blue","red"),bty="n")

In order to determine if there is a statistically significant difference between the survival curves, we perform what is known as a log-rank test, which tests the following hypothesis:

The test is performed in R as follows:

> survdiff(Surv(time,status)~x)

 

Call:

survdiff(formula = Surv(time, status) ~ x)

 

                  N Observed Expected   (O-E)^2/E    (O-E)^2/V

x=Maintained      11        7           10.69          1.27         3.4

x=Nonmaintained   12       11           7.31           1.86         3.4

 

Chisq= 3.4 on 1 degrees of freedom, p= 0.0653

We see that when testing the null hypothesis that there is no difference in the survival function for those who were on chemotherapy maintenance versus those who were not on chemotherapy maintenance we fail to reject the null hypothesis χ12 = 3.4 with a p-value = 0.0653.

Suppose we want to see if there is a difference in survival functions between two groups after adjusting for a potential confounder.

We will not go into much detail on this model here, but briefly, a proportional hazard model is given as follows:

We define the hazard of an event as the risk of that event, as the time frame shrinks to 0. Usually when we calculate risk ratios, we have some time in mind, either cross-sectional, or, say, risk of dying after a year for two groups. Hazard is the risk, taken as the time frame vanishes to time t = 0.

h0(t) is the "baseline hazard," which we don't worry too much about, because when we look at the ratio of hazards for two conditions, we get the following:

 

Hazard ratio for individual with X = x vs. X = (x+1):

This term is the hazard ratio for the event of interest for people with covariate x+1 vs. people with covariate x. If the term is >1, then those people who have a one-unit increases in their covariate compared against a reference group are at a higher "risk" (hazard) for the event.

This is just the bare-bones basics of Cox Proportional Hazards models. There is a lot more to these models, including various assumptions that need to be tested for the model validity to hold, and issues in interpretation. However, for our purposes, just seeing how to run these models is enough.

Using a new dataset in the survival package called "cancer" we want to examine the survival in 228 patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.

Variable Name

Description

inst

Institution code

time

Survival time in days

status

censoring status, 1=censored, 2=dead

age

age in years

sex

1=Male, 2= Female

ph.ecog

ECOG performance score (0= good, 5=dead)

ph.karno

Karnofsky performance score (0=bad, 100=good) rated by physician

pat.karno

Karnofsky performance score as rated by patient

meal.cal

calories consumed at meals

wt.loss

Weight loss in last six months

 First, let's do some non-adjusted analysis:

Test the null hypothesis that there is no difference in the survival function of patients with advanced lung cancer between males and females.

 

Suppose we want to determine if there is an association between length of survival and gender after adjusting for age. The log-rank test discussed previously will only compare groups, it does not take into account adjusting for other covariates/confounding variables.  To adjust for other covariates we perform the Cox's proportional hazards regression using the coxph() function in R. To perform exercise 2 using Cox regression we use the following commands:

> detach(leukemia)

> attach(cancer)

 

> coxph(Surv(time,status)~sex)

 

Call:

coxph(formula = Surv(time, status) ~ sex)

 

      coef     exp(coef)  se(coef)     z      p

sex -0.531     0.588     0.167   -3.18   0.0015

 

Likelihood ratio test=10.6  on 1 df, p=0.00111  n= 228, number of events= 165

We see that when using the Cox regression to perform the test, the results are very similar to the log rank test (χ12 = 10.6 with p-value = 0.00111).  The Cox regression estimates the hazard ratio of dying when comparing males to females. Here, sex is significantly related to survival (p-value = 0.00111), with better survival in females in comparison to males (hazard ratio of dying  = 0.588). Females have 0.588 times the hazard of dying in comparison to males.

 

To extend the cox regression to adjust for other covariates, we will extend this to test the following hypothesis:

> coxph(formula = Surv(time, status) ~ sex + age)

 

Call:

coxph(formula = Surv(time, status) ~ sex + age)

 

 

           coef       exp(coef) se(coef)     z      p

sex   -0.513     0.599  0.16746 -3.06 0.0022

age    0.017     1.017  0.00922  1.85 0.0650

 

Likelihood ratio test = 14.1  on 2 df, p=0.000857  n= 228, number of events= 165

When testing the null hypothesis that there is no difference in the survival function between males and females after adjusting for age we see that we reject the null hypothesis (z = -3.06, p-value = 0.0022). After adjusting for age, females have significantly better survival in comparison to males. Females have 0.599 times the hazard of dying in comparison to males, adjusting for age (HR<1).

 

Assignment:

Things we did not cover (or only touched on)


 

 

For those who want to continue with R, SPH offers BS 845: Applied Statistical Modeling and Programming in R, taught by Professor Yang.