Introduction to R Statistical Programming

Introduction


The Framingham Heart Study (FHS) began in 1948 when investigators enrolled 5,209 men and women aged 30-62 from Framingham, Massachusetts. Subjects provided baseline information on many variables, and they returned to the study office every two years for a detailed medical history, physical examination, and lab tests. They also monitored the cohort carefully and recorded adverse health outcomes, focusing primarily on cardiovascular diseases. In 1971 they enrolled a second cohort consisting of the offspring cohort of the initial cohort, and in 2002 they enrolled a third cohort consisting of the grandchildren of the original cohort. The data that was collected, and the many subsequent analyses that have been conducted led to the identification of the major risk factors for cardiovascular disease: high blood pressure, high blood cholesterol, smoking, obesity, diabetes, physical inactivity, and many other risk factors.

One can think of the data collected in the Framingham study and others as a series of tables with many variables (demographic information, exposures and outcomes) in the columns and each subject listed on a separate row. The figure below shows a portion of an Excel spreadsheet that was used to download the first 10 subjects in a very small subset of the FHS data. There is a random identification number in the first column, and columns indicating each subject's sex, age in years, systolic blood pressure, and total cholesterol. There are many more variables, but we created a very small subset just to introduce you to using the R statistical package.

 Portion of a spreadsheet showing some data from the Framingham Heart Study with subjects listed in rows and variables for each listed in columns

Excel spreadsheets can be used to collect and store tables of data of this type, and Excel can also be used for a variety of simple statistical tests. For example, Excel can perform chi-squared tests, t-tests, correlations, and simple linear regression. However, it has a number of limitations. Public health researchers and practitioners frequently perform more complex analyses, such as multiple linear regression and multiple logistic regression, which cannot be performed in Excel, so it is useful to become familiar with more sophisticated packages like R.

[NOTE: An online learning module is available for students who would like to learn to use Excel at Using Spreadsheets - Excel & Numbers]

One key advantage to collecting or importing data into Excel is that the data can then be saved as a .CSV file, that is as a "comma-separated values" file that can easily be imported into the R statistical package. We will be using a number of data files that have been saved as .CSV files for this course.

The R statistical package is a powerful open-source program that is free. It will allow you to perform all of the necessary statistical procedures for this course, and it will likely be useful to many of you for professional projects. In addition, R will enable you to produce excellent graphics (even better than those produced with SAS).

This exercise will introduce you to using R, even if you have no prior experience. The exercise will walk you through how to install R, how to import data sets, and how to analyze your data.

Learning Objectives


After completing this section, you will be able to:
  1. Install R and R Studio on your personal computer
  2. Be familiar with some basic R programming functions:
  3. Navigate the R console
  4. Enter R commands
  5. Bring a dataset into R
  6. Perform basic descriptive statistics
  7. Create some simple graphs

Installing R and R Studio


You will first install the base system for R and then install the RStudio, which provides a much more user-friendly interface.

Install the base system for R

In your browser go to https://cran.r-project.org/

Select Download R for Mac or for Windows

 Instructions for downloading and installing R

Install RStudio

In your browser go to https://www.rstudio.com/products/rstudio/

Instructions for installing R Studio  

Once the RStudio is installed, it will have several windows as shown in the image below.

Arrangement of windows in R Studio

The modules for this course have many examples on how to use R for specific tasks, and you will be using these from week to week. The examples embedded into each week?s learning modules will provide you with all of the necessary instructions for using R in this course, and it shouldn't be necessary to seek other instruction. However, if you wish to learn more about R, here are links to additional resources.

Learn more about R

Data Sets

All data sets used in PH717 are comma separated value (.csv) files. For example, in this session we will be importing two data set files:

.csv files are like spreadsheets, but are saved in a simpler format that makes it possible for R to read them. If we want to create our own dataset, we could enter the data into an Excel spreadsheet and save it as a .csv using ?Save as?? to save it in a .CSV format.

Simple Math Functions and Commands in R


For the exercise below:

Try entering the following commands in the R Console at the lower left window. You do not have to enter the comments in green. Just enter the commands in blue and hit ?Enter?. You should see the result written in black after the [1].

# Addition
> 7+3
[1]   10

# Subtraction
> 7-3
[1]   4

 

# Multiplication
> 8*7
[1]   56

#Division
> 100/50
[1]   2

# Square root
> sqrt(81)
[1]   9

# Exponents
> 9^2
[1]   81

Practice Exercise

First, download the framstudy.csv data set onto your local computer. Then import the data set into R. There are two ways to do this.

The Easy Way

The easiest way is to click on the "Import Dataset" button at the upper right in the RStudio console and then browse your computer and follow the instructions. If the dialog asks if the data set has a header, say "yes", since this data set and others for this course have headers (i.e., titles) for their columns.

Once the file has been imported, we usually give it a short nickname to cut down on typing:

> fram <- framstudy

"fram" is a short name or a nickname that I made up for the data set to reduce typing during programming. The "less than" character followed by a dash (<- ) looks a bit like an arrow and functions as an assigner to tell R that we are assigning a name to a data set we are importing.

The Traditional Way (harder)

You can also use the traditional R command below as an alternative.

> fram <- read.csv(file.choose()) and then click "Run" or hit the "Enter" key.

> read.csv(file.choose()) is a function, a set of hard-coded instructions that tells R how to complete a task, e.g., how to open and read a chosen csv file.

Once the data set has been imported, you should "attach" the data so that R defaults to performing actions on this particular data set.

> attach(fram)

If you finish with an attached data set and want to work with another data set, you should detach the first one [ > detach(fram)] and then attach the new one.

View the Data Set in R

> View(fram) [Then hit the enter key or click on the ?Run? tab to execute the command]

View() is a function that tells R to open a new window so that we can look at the data set. To view the data you can also just click on the data set name in the upper right window.

Once you have imported the data set, it will be listed in the "Environment" tab in the window at the upper right. If you have a large data set, it is better to just click on the small blue arrow next to the data set name in the Environment section to view it.

Writing Code in R - Important Notes

You can execute code one step at a time in the console (lower left window), and this is useful for quick one step math calculations, but it is usually more convenient to list all of your coding statements in a script in the window at the upper left and then saving the script for future use. When you enter code into the script window at the upper left, do not begin the line with a > character. You can also execute the code from the script editing window by clicking on the "Run" tab at the top of the editing window. To save a script, click on the "File" tab at the upper left and select either Save as or Save.

To start a new script, click on the File tab, then on New File, then on R Script.

Beginng a new script in R

Then type in your coding statements.

Here is a short sample script:

# (A comment) Import the framstudy.csv file and call it "fram"
########################

fram<-framstudy
# Next attach the file to tell R it is the "go to" file (the default)
attach(fram)
View(fram)

# Determine the mean, median, and distribution of continuous variables
quantile(AGE)
summary(AGE)

# Determine the number and proportion of males and females
table(SEX)
prop.table(table(SEX))

# Make a mistake in nameing a variable (case)
summary(Age)

Enter the script above into the R editor and save it as follows:

Saving a script in R  

Then execute the script by hitting the "Run" tab repeatedly in the editor. When you get to the "View(fram)" step, the editor will show the data file, but you can return to the script by clicking on the tab for the script at the top of the window.

Enter code into a script

 

The output from the executed script will appear in the Console window at the lower left, and it should look like this:

> fram<- framstudy
> attach(fram)
> View(fram)
> quantile(AGE)
0% 25% 50% 75% 100%
39   45 52 59 65
> summary(AGE)
Min. 1st Qu. Median Mean 3rd Qu. Max.
39.00 45.00 52.00 52.41 59.00 65.00
> table(SEX)
SEX
1   2
19   30
> prop.table(table(SEX))
SEX
1 2
0.3877551 0.6122449
> summary(Age)
Error in summary(Age) :   object 'Age' not found

Note that there was an error message because we gave the command "summary(Age)", but the variable for age is all upper case (AGE), so R did not find it.

 

Interdisciplinary Case Study for the Core Curriculum


We are in the process of creating a series of integrated case studies to be used in all of the core courses to demonstrate the contributions each core course can make to addressing public health problems.

One interdisciplinary case study that we are developing focuses on efforts to improve health in Weymouth, MA (based on a real case).

The working draft of the case can be accessed at:

http://sphweb.bumc.bu.edu/otlt/MPH-Modules/Weymouth/

The online module begins by giving a brief history of the development of Weymouth from its colonial origins to the present day.

In 2003-2004 Weymouth conducted a town-wide health survey, the results of which raised concerns about the general health of the residents. The data set below is a subset of the actual data that was collected and analyzed by JSI.

Import the data set Weymouth_Adult_Part.csv.

Begin a new script in the RStudio editor with the following code:

 ### Part 1###
# The first command below creates a data frame object called 'wey'.
# The "na.omit" part asks R to omit missing data points, i.e., those for which data is "not available."

wey<-na.omit(Weymouth_Adult_Part)
# Next, attach the data set
attach(wey)
# Create a derived variable for the respondent's age in 2002 (when the data was collected) based on their reported birth year.
age=(2002-birth_yr)
# Compute body mass index (BMI) as shown below.
bmi=weight/(hgt_inch)^2*703
# Compute the mean and standard deviation of bmi and quantiles
mean(bmi)
[1] 26.03121
sd(bmi)
[1] 5.393322
quantile(bmi)
0% 25% 50% 75% 100%
10.96235 23.23926 25.84559 29.69308 54.54867

summary(bmi)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.96 23.26 25.85 26.94 29.70 54.55


The next page continues the exercise with code that produces histograms and boxplots.

Graphs


# Make a histogram of the age distribution
hist(age)

# Make a histogram and a boxplot of bmi
hist(bmi)


Next

boxplot(bmi)

Next

boxplot(bmi, horizontal = TRUE)

# Next, modify the boxplot to a horizontal orientation
boxplot(age, horizontal = TRUE)


# Modify the BMI histogram with the code below
hist(bmi, main="BMI Distribution in Weymouth Adults",
xlab = "BMI", border="blue", col="green", xlim=c(10,50),
las=1, breaks=8)

Once you have everything working, save your script for future reference. You will be using these coding functions throughout the course.

Finally, go back to page 8 of the online case on Weymouth and read pages 1-4 of the Executive Summary in the JSI report (http://sphweb.bumc.bu.edu/otlt/MPH-Modules/PH/SWNAS/Weymouth%20Health%20Survey%20Report.pdf ) based on the survey data.

Also scan the news article from the Boston Globe embedded on page 8 of the Weymouth case module. You can also access it from

http://archive.boston.com/yourlife/health/children/articles/2003/11/16/health_safety_survey_raises_concerns/