Section 4: Association between variables and multivariable methods to control for confounding


4.1 Simple correlation and regression

As an example of a study examining the association between two measurement variables, we will look at the association between forced expiratory volume (FEV1, a measure of lung function) and height (measured in centimeters) in a sample of 20 young adults. Data for the first 5 subjects:

ID

sexM

ht_cm

fev1_litres

1

1

174

4.3

2

1

181

4.8

3

0

184

4.7

4

1

177

5.4

5

1

177

3.1

4.1.1 Scatterplots

The plot( ) function will graph a scatter plot. To plot FEV1 (the dependent or outcome variable) on the Y axis, and height (the independent or predictor variable) on the X axis:

> plot(ht_cm ~ fev1_litres)

4.1.2 Correlation

The 'cor( )' function calculates correlation coefficients between the variables in a data set (vectors in a matrix object). For our height and lung function example, where 'fevheight' is the matrix object representing the data set:

> cor(fevheight)

ID sexM ht_cm fev1_litres

ID 1.00000000 0.02726935 -0.1624661 -0.4339991

sexM 0.02726935 1.00000000 0.1044337 -0.1196384

ht_cm -0.16246613 0.10443368 1.0000000 0.5973320

fev1_litres -0.43399905 -0.11963840 0.5973320 1.0000000

 

The 'cor.test( )' function gives more detail around the correlation coefficient between two measurement variables, testing the null hypothesis of zero correlation (no association) and giving a CI for the correlation coefficient. For our height and lung function example:

> cor.test(ht_cm,fev1_litres)

Pearson's product-moment correlation

data: ht_cm and fev1_litres

t = 3.16, df = 18, p-value = 0.005419

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.2104363 0.8224525

sample estimates:

cor

0.597332

4.1.3 Simple regression analysis

Regression analysis is performed through the 'lm( )' function. LM stands for Linear Models, and this function can be used to perform simple regression, multiple regression, and Analysis of Variance.

For simple regression (with just one independent or predictor variable), predicting FEV1 from height:

> summary(lm(fev1_litres ~ ht_cm) )

Call:

lm(formula = fev1_litres ~ ht_cm)

Residuals:

Min 1Q Median 3Q Max

-1.12043 -0.36014 -0.02043 0.32223 1.35898

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -10.01429 4.40863 -2.272 0.03562 *

ht_cm 0.07941 0.02513 3.160 0.00542 **

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6148 on 18 degrees of freedom

Multiple R-Squared: 0.3568, Adjusted R-squared: 0.3211

F-statistic: 9.985 on 1 and 18 DF, p-value: 0.005419

The syntax here is actually calling two functions, the lm( ) function performs the regression analysis, and the summary( ) function prints selected output from the regression. The 'Estimate' column in the output gives the intercept and slope for the regression:

fev1 = -10.014 + 0.079 (ht_cm).

The Pr(>|t|) column in the output gives the p-value for the slope. Here, the p-value for the slope for height is .00542.

4.1.4 Spearman's nonparametric correlation coefficient

The cor.test( ) function that calculates the usual Pearson's correlation will also calculate Spearman's nonparametric correlation coefficient (rho). With small samples and no ties, an exact p-value is calculated, otherwise a normal approximation is used to calculate the p-value. In this example, Lactate and Alanine are two variables measured on a sample of n=16 subjects.

> cor.test(Lactate,Alanine,method='spearman')

Spearman's rank correlation rho

data: Lactate and Alanine

S = 196, p-value = 0.002663

alternative hypothesis: true rho is not equal to 0

sample estimates:

rho

0.7117647