Getting Started with "R"
Data Basics

Table of Contents»

Contributing Authors:

Ching-Ti Liu, PhD, Associate Professor, Biostatistics

Jacqueline Milton, PhD, Clinical Assistant Professor, Biostatistics

Avery McIntosh, doctoral candidate

  

Why Learn R?


  1. It's free!
  2. It is a new, cutting edge tool for biostatistics.
  3. Many authors are using R and reporting their results in the context of R.
  4. Online support and resources are readily available for R
  5. It is highly extendable; it providing:

- Interactive Web Search

- Use in Unix environments for scientific computing: BU Secure Computing Cluster (SCC)

- Ability to be used in conjunction with Python, SAS, Excel, , JAGS, OpenBUGS, others

This introductory module will provide:

  1. An Introductory R Session
  2. R as a Calculator
  3. Import, export and manipulate datasets
  4. Getting Help and Loading Packages

Learning Objectives


By the end of this session students will be able to:

  1. Identify the components of the R interface for Windows
  2. Conduct standard arithmetic calculation: both numerical and matrix
  3. Import, export and manipulate datasets
  4. Access R help and load packages in R

 

Downloading, Installing and Running R


An up-to-date version of R may be downloaded from the site called CRAN (this stands for Comprehensive R Archive Network): http://cran.r-project.org/. Installation instructions are also provided on the website. In the following instructions, it is assumed that you are using a Windows system. You can click on "Download R for Mac" or "Download for Windows" and follow the instructions to download R.

Once you have installed R, you will see a blue desktop icon. To run R, just click on this icon. Alternatively, you can click on the R icon under Program by going to Start if you don't see it on your desktop. Then you will see a blank line with a command line prompt (>) symbol in the left hand margin under the statement. This prompt invites you to type in your commands or expressions. For example,

 Enter:

> 2+2

After you het the "Enter" button, you should get the following response

[1] 4

 

To quit the R program, use the command:

> q()

 

Also take note of the Stop Sign button, which will come in handy at times.

 

Note also that we usually don't save the workspace; instead, we save our code/script where the code will be saved as a .R file.

R Studio

There is an alternate version of R that some people prefer called R Studio. It is a little more GUI (graphical user interface) than the traditional R package. It looks nice, and you can make very nice documents with it:

 

We won't be using this program directly, but feel free to use it on your own for homeworks. If you have questions on it, let me know.

Introduction to R Programming: Download, Install and Setup R & RStudio (R Tutorial 1.0) MarinStatsLectures [Contents]

Customizing The Look of R Studio (R Tutorial 1.11) MarinStatsLectures [Contents]

 

R as a Calculator


The simplest way to use R is to use it as if it were a calculator. For example, if we want to know what two times two is, you may type

 

> 2*2;

[1] 4

The notation [1] indicates that the result is the first element. This is useful later when we have many elements for one variable such as a vector. If you carefully compare this to the previous command, you will notice that we added a semi-colon (;) here. However, the result remains the same since the semicolon is used as the separator for multiple commands. Sometimes, you may want to use built-in functions in your calculations. Here are some examples:

Natural logs

> log(10)

[1] 2.302585

Note that this is returning the "natural log" of 10, which uses the base "e," which is a constant with an approximate value of 2.71828.

Log using base 10

> log10(10)

[1] 1

Exponentiation

> exp(2)

[1] 7.389056

Square Root

> sqrt(4)

[1] 2

Absolute Value

> abs(-4)

[1] 4

Type in the following commands and describe your observations.

 

> # case 1 : 1+2

> 2+2; 2*3; 2/5

> # case 2

> 8/2-2*(2-3)

> # case 3:

> 3*5  * 4   /2

When you observe the results, remember the rules for the order of operations: "PEMDAS."

 

PEMDAS (From http://www.mathsisfun.com/operation-order-pemdas.html)

Order of Operations

Do things in Parentheses First. Examples:

yes   6 × (5 + 3) = 6 × 8 =

48

 
no   6 × (5 + 3) = 30 + 3 =

33

(wrong)

Exponents (Powers, Roots) before Multiply, Divide, Add or Subtract.

yes   5 × 22 = 5 × 4 =

20

 
no   5 × 22 = 102 =

100

(wrong)

 

Multiply or Divide before you Add or Subtract.

yes   2 + 5 × 3 = 2 + 15 =

17

 
no   2 + 5 × 3 = 7 × 3 =

21

(wrong)

Otherwise just go left to right.

yes   30 ÷ 5 × 3 = 6 × 3 =

18

 
no   30 ÷ 5 × 3 = 30 ÷ 15 =

2

(wrong)

How Do I Remember It All ... ? PEMDAS !

P

Parentheses first

E

Exponents (ie Powers and Square Roots, etc.)

MD

Multiplication and Division (left-to-right)

AS

Addition and Subtraction (left-to-right)

 The video below provides nice introduction to R, including a review of mathematical operations that can be performed using R as a calculator.

Getting Started With R (R Tutorial 1.1) MarinStatsLectures [Contents]

 

The Assignment Function


When performing calculations, we may want to save the intermediate results for later use. This can be achieved by assigning values to symbolic variables using an "assign" function. Once you assign an object a designation, it stays in the working memory until you close the program. To see what objects are in the working memory, type ls(), or select Show Workspace command from the dropdown menu.

So to create a scalar constant x with value of 2, we type

> assign("x", 2)

 

This can also be simplified by using the operator <-

[Note that the symbol <- is made up from "less than" and "minus" with NO space between them.]

For example,

> x <- 2

> x

[1] 2

 

Assignment can also be done in the opposite direction using the symbol ->. For example,

> 3 -> y

The arrow for the assignment symbol always points to the name assigned to the vector.

 

We can also assign multiple objects the same value, as follows:

> x <- y <- 2

> x

[1] 2

> y

[1] 2

Other commonly used operators are:

1.     arithmetic             + - * / ^

2.     relational              > >= < <=

                                    ==        (equals)

                                    !=        (not equal)

3.     logical                   !           (NOT)

                                    &         (AND)

                                    |           (OR)

4.     assignment                        <- ->

5.     create a sequence  :

A complete listing can be found here: http://stuff.mit.edu/afs/sipb/project/r-project/arch/i386_rhel3/lib/R/library/graphics/html/plotmath.html.

Vectors

Vectors are variables with one or more values of the same type, e.g., numerical, logical or character variables. For example, a numeric vector might consist of the numbers (1.2, 2.3, 0.2, 1.1). A vector can also have just a single variable.

Concatenation

Vectors with multiple variables can be created using concatenation as indicated by the symbol c(). Thec stands for concatenation. The variables themselves are placed inside the rounded parentheses, i.e. ( ), not square [ ]  or curly { } brackets.

 

To create a vector named x, consisting of four numbers, naming 1.2, 2.3, 0.2 and 1.1, we can use the R command

> x <- c(1.2, 2.3, 0.2, 1.1)

> x

[1] 1.2 2.3 0.2 1.1

 

Also, we can use the function length() to find out how many elements the vector has.

> length(x)

[1] 4

 

If we want to select only some elements in the vector, then we can use indices. For example, if we want to know what the last three elements are in variable x, then we can type

> x[c(2,3,4)]

[1] 2.3 0.2 1.1

What do you see in the screen if we type the following commands?

 

>x[-1]

>x[2:4]

 

Based on your observation, what does the negative sign do within the index?

Also, what does the function colon (:) mean? (This one will come in handy when we get to loops!)

You can clear the workspace by using the Edit dropdown menu and selecting "Clear console", or by typing:

> rm(list=ls())

Logical Vectors and Logical Operators

R allows us to create logical vectors and to manipulate logical quantities as well. To create logical vectors, you may use TRUE, FALSE, or NA (for missing / not available) directly, or type in the condition/logic operation. Note that in order to be used in arithmetic calculations, R treats TRUE as 1 and FALSE as 0.

Let's look at some examples to see how these operators work.

> 1>=3

[1] FALSE

 

> !(1>3)

[1] TRUE

 

> (3 != 1) & (2 >= 1.9)

[1] TRUE

 

> y <- c(TRUE, FALSE, 5 > 2)

> y

[1]  TRUE FALSE  TRUE

 

> sum(y)

[1] 2

 

If you type the following commands into R. What will (x, y, z, w) be?

 

> x <- !(5>=3)

> y <- ((2^4) > (2*3))

> z<- x|y

> w <- x&y

 

Answer

 

Vector Arithmetic


Let's go back to a vector from the previous page.

> x <- c(1.2, 2.3, 0.2, 1.1)

This vector consists of four numbers. In some circumstances, we may want to apply certain operations or calculations to each element in the vector. For example, suppose we wanted to use the vector x to create a new vector "y" with elements that are 2 time each x plus 3. One could do this element by element with the following command:

> y <- c(2*x[1]+3, 2*x[2]+3, 2*x[3]+3, 2*x[4]+3) which gives

[1] 5.4 7.6 3.4 5.2

but a simpler way to do this is to use the command:

> y- 2*x+3

Then if we enter

> y

We get back:

[1] 5.4 7.6 3.4 5.2

 

Logical operators can also be used to modify or select subsets of a data set. For example, in the previous example, we saw that x[c(2,3,4)], x[-1] and x[2:4] work exactly the same and select the last three elements of the vector x.

You can also use

 > x[c(FALSE, TRUE, TRUE)]

This instructs R to skip the first element and then select the next two, so it returns the following:

[1] 2.3 0.2

 

If we wanted to select the elements with values greater than 1, we could use the command: 

> x[x>1]

[1] 1.2 2.3

 

In summary, R can perform functions over entire vectors and can be used to select certain elements within a vector. In addition to the elementary arithmetic operations, R can also use the vector functions listed below.

Create the vector x as follows:

> x <- c(1.2, 2.3, 0.2, 1.1)

 

Then evaluate each of the vector functions listed above.

 

Matrices


The vectors that have been discussed previously in this module were one-dimensional, i.e., they consisted of a simple series of elements that you could imagine being organized in a single row or in a single column. Matrices are a multi-dimensional vectors. A two-dimensional matrix might be envisioned as a table with columns and rows, and a three-dimensional matrix might be envisioned as a series of tables stacked on top of one another.

A matrix can be created in a number of ways. For example,

> x <- matrix(c(1,2,3,4,5,6), ncol=2)

> x

     [,1] [,2]

[1,]    1    4

[2,]    2    5

[3,]    3    6

where the parameter ncol is the number of columns in this matrix, and the numbers are entered by default column-wise. There are two indices, row and column. We can use the commands below to extract portions of the matrix:

> x[3,2]

[1] 6

> x[3,]

[1] 3 6

These commands can be used to call the element of 3rd row and 2nd column in matrix x. If we don't specify any element in the index, R will return all the elements.

> p<- matrix(c(1,2,3,4,5,6), nrow=2)

> p

     [,1] [,2] [,3]

[1,]   1   3   5

[2,]   2   4    6

Other commonly used approaches to create matrix are cbind() and rbind(), which are counterparts of the c() function. The cbind() function is used to combine the variables such that its output contains the original variables in columns while rbind() combines the variables such that its output contains the original variables in row.

For example, the following commands will create the same matrix as x.

> cbind(c(1,2,3), c(4:6))

> rbind(c(1,4), c(2,5), c(3,6))

Suppose we have two vectors, x and y, defined as follows.

 

> x <- c(-3:3)

> y <- c(2, 5, -6, 3, -2, 10, -4)

 

  1. Create a matrix, say z, composed of x as the first column and y as the second column.
  2. What's the mean of the first row of matrix z?

Answer

Logic Statements (TRUE/FALSE) and cbind and rbind Command in R (R Tutorial 1.7) MarinStatsLectures [Contents]

Types of DATA

R has data "types," which are sometimes manipulated differently from each other under certain operations.

The data types are:

There is also lists and arrays, but these will not be covered here.

The str() Command

A useful command is the str() command, which tells you what data type is in a given object, as shown in these two examples:

Example 1

 

> k <-3

> k


[1] 3


> str(k)

  num 3

Example 2

 

> w <-"Homer"

> w


[1] "Homer"


> str(w)

   chr "Homer"

 

The video below provides a nice overview of vectors and matrices and the operations that can be performed on them.

Creating Vectors, Matrices, and Other Intro Topics (R Tutorial 1.2) MarinStatsLectures [Contents]

 

Matrices vs. Data Frames

Matrices and data frames are two ways to structure 2-dimensional information. They are different in a few ways. Generally, use data frames if the variable types are not all numeric.

> matrix.1<- matrix(1:16,4,4)

> matrix.1

     [,1] [,2] [,3] [,4]

[1,]    1    5     9   13

[2,]    2    6   10    14

[3,]    3    7   11    15

[4,]    4    8    12   16

> str(matrix.1) #str() asks for the structure of the object

 int [1:4, 1:4] 1 2 3 4 5 6 7 8 9 10 ...

 

> is.matrix(matrix.1) # The command form is.type(object) is basically asking if the ojbect matrix.1 is a matrix.

[1] TRUE

> is.data.frame(matrix.1)

[1] FALSE

 

> data.1 <- as.data.frame(matrix.1)

> data.1

   V1 V2 V3 V4

1  1  5  9 13

2  2  6 10 14

3  3  7 11 15

4  4  8 12 16

> str(data.1)

'data.frame':  4 obs. of  4 variables:

 $ V1: int  1 2 3 4

 $ V2: int  5 6 7 8

 $ V3: int  9 10 11 12

 $ V4: int  13 14 15 16

> object.size(matrix.1)

264 bytes

> object.size(data.1)

1048 bytes

The object.size commands indicate how much memory the two types of data structure take up in the computer; note that the matrix takes up much less memory than the data frame.

 

Reading and Writing Data to and from R


Reading files into R

Usually we will be using data already in a file that we need to read into R in order to work on it. R can read data from a variety of file formats—for example, files created as text, or in Excel, SPSS or Stata. We will mainly be reading files in text format .txt or .csv (comma-separated, usually created in Excel).

To read an entire data frame directly, the external file will normally have a special form

Here we use the example dataset called airquality.csv and airquality.txt

  

Input file form with names and row labels: 

Ozone Solar.R* Wind Temp Month Day

1 41***** 190** 7.4** 67**** 5 **1

2 36***** 118** 8.0** 72**** 5** 2

3 12***** 149* 12.6** 74**** 5** 3

4 18***** 313* 11.5 **62**** 5** 4

5 NA***** NA** 14.3** 56**** 5** 5

   ...

By default numeric items (except row labels) are read as numeric variables. This can be changed if necessary.

The function read.table() can then be used to read the data frame directly

     > airqual <- read.table("C:/Desktop/airquality.txt")

 

Similarly, to read .csv files the read.csv() function can be used to read in the data frame directly

[Note: I have noticed that occasionally you'll need to do a double slash in your path //. This seems to depend on the machine.]

> airqual <- read.csv("C:/Desktop/airquality.csv")

 In addition, you can read in files using the file.choose() function in R. After typing in this command in R, you can manually select the directory and file where your dataset is located.

  1. Read the airquality.csv file into R using the read.csv command.
  2. Read the airquality.txt file into R using the file.choose() command

Occasionally, you will need to read in data that does not already have column name information.  For example, the dataset BOD.txt looks like this:

1    8.3

2   10.3

3   19.0

4   16.0

5   15.6

7   19.8

Initially, there are no column names associated with the dataset.  We can use the colnames() command to assign column names to the dataset.  Suppose that we want to assign columns, "Time" and "demand" to the BOD.txt dataset.  To do so we do the following

> bod <- read.table("BOD.txt", header=F)

> colnames(bod) <- c("Time","demand")

> colnames(bod)

[1] "Time"   "demand"

The first command reads in the dataset, the command "header=F" specifies that there are no column names associated with the dataset.   

 

Read in the cars.txt dataset and call it car1.  Make sure you use the "header=F" option to specify that there are no column names associated with the dataset.  Next, assign "speed" and "dist" to be the first and second column names to the car1 dataset.

The two videos below provide a nice explanations of different methods to read data from a spreadsheet into an R dataset.

Import Data, Copy Data from Excel to R, Both .csv and .txt Formats (R Tutorial 1.3) MarinStatsLectures [Contents]

Importing Data and Working With Data in R (R Tutorial 1.4) MarinStatsLectures [Contents]

 

Writing Data to a File


After working with a dataset, we might like to save it for future use. Before we do this, let's first set up a working directory so we know where we can find all our data sets and files later.

Setting up a Directory

In the R window, click on "File" and then on "Change dir". You should then see a box pop up titled "Choose directory". For this class, choose the directory "Desktop" by clicking on "Browse", then select "Desktop" and click "OK". In the future, you may want to create a directory on your computer where you keep your data sets and codes for this class.

Alternatively, you can use the setwd() function to assign as working directory.

> setwd("C:/Desktop")

To find out what your current working directory is, type

> getwd()

 

Setting Up Working Directories in R (R Tutorial 1.8) MarinStatsLectures [Contents]

 

In R, we can write data frames easily to a file, using the write.table() command.

> write.table(cars1, file="cars1.txt", quote=F)

The first argument refers to the data frame to be written to the output file, the second is the name of the output file. By default R will surround each entry in the output file by quotes, so we use quote=F.

Now, let's check whether R created the file on the Desktop, by going to the Desktop and clicking to open the file. You should see a file with three columns, the first giving the index (or row number) and the other two the speed and distance. R by default creates a column of row indices. If we wanted to create a file without the row indices, we would use the command:

> write.table(cars1, file="cars1.txt", quote=F, row.names=F)

Datasets in R


Watch the video below for a concise intoduction to working with the variables in an R dataset

Working with Variables and Data in R (R Tutorial 1.5) MarinStatsLecures [Contents]

Around 100 datasets are supplied with R (in the package datasets), and others are available.

To see the list of datasets currently available use the command:

data()

We will first look at a data set on CO2 (carbon dioxide) uptake in grass plants available in R.

> CO2

[Note: capitalization matters here; also: it's the letter O, not zero. Typing this command should display the entire dataset called CO2, which has 84 observations (in rows) and 5 variables (columns).]

To get more information on the variables in the dataset, type in

> help(CO2)

Evaluate and report the mean and standard deviation of the variables "Concentration" and "Uptake".

 

Subsetting Data in R With Square Brackets and Logic Statements (R Tutorial 1.6) MarinStatsLecures [Contents]

 

Some Basic Tips


Dataset Files

gender id race ses schtyp prgtype read write math science socst

0 70 4 1 1 general 57 52 41 47 57

1 121 4 2 1 vocati 68 59 53 63 31

0 86 4 3 1 general 44 33 54 58 31

0 141 4 3 1 vocati 63 44 47 53 56

gender,id,race,ses,schtyp,prgtype,read,write,math,science,socst

0,70,4,1,1,general,57,52,41,47,57

1,121,4,2,1,vocati,68,59,53,63,61

0,86,4,3,1,general,44,33,54,58,31

0,141,4,3,1,vocati,63,44,47,53,56

 

 

Using > dim(dataset name), we get the dimensions of the dataset, i.e., the number of observations(rows) and variables(columns)

Using > str(dataset name), we get the structure of the dataset, including the class(type) of all variables

 

Getting Help


Before we get deeper into the use of R, it is good to know how to seek help when we get stuck. Two functions are illustrated here. If you know the name of a function or the topic, you may use the function help(…) with your function name or topic inside the parenthesis. For example, if you are interested in the function plot(), you can type

> help(plot)

 

and it will pop-up a help manual for plot function. This can be done more quickly by typing a question mark in front of the function in question.

> ?plot

 

Sometimes, you may want to find something related to a certain keyword. Then you may find help.search() useful. Function help.search() will search for all the functions that have the word you specified in their help document such as name, title, concept, keyword. For example,

> help.search("sort")

will list all functions that have the word 'sort' as an alias or in their title. The dunction help.search() also has a shortcut consisting of two question marks preceding the keyword (e.g., ??sort). If you want to learn more about the help function, you may type help(help). Besides the official help pages, you can also explore the Internet, which has many resources.

 

Loading Packages


R is open source software, so many packages are freely available. However, not every package is installed or loaded when you open R.

Watch the following video.

Installing Packages in R (R Tutorial 1.10) MarinStatsLectures [Contents]

If you want to use a package that is not yet installed, you need to install it and then load it into R. For example, you may want to use package ISwR, which is a package used in the textbook, to help you while reading along with the textbook. Here is the procedure:

  1. Install package ISwR:

                  Click Set CRAN Mirror… in the Packages options

                  -> Pick the closest server site, eg USA (MA), and click OK

                  -> Click Install Package(s)… in the Packages options

                  -> Pick ISwR

Once it is downloaded you will see something like the following statement. Note that I have deleted some output to save space.  Don't panic if you see a warning here; you may ignore this for the time being.

….

package 'ISwR' successfully unpacked and MD5 sums checked

The downloaded packages are in

C:\Users\ctliu\AppData\Local\Temp\RtmpjNPIgz\downloaded_packages

  1. Load the package ISwR

Once you install the package, you still need to load it into R before use. To load the package, type

> library(ISwR)

Now package ISwR is ready for you to use.

 

You can also install directly from your command line by typing in

> install.packages("packagename")

 

Summary:

Reading: 

 

Assignment: