1.3 Bringing data into R from an Excel file or a text file

 


For an analysis of a single variable, with a small number of observations, it is easy to enter a column vector directly into R as described above. But with larger data sets, it is easier to first create and save the data set in Excel, and then to bring information from the Excel file into R. There are several ways to do this. I find it easiest to use the 'read.csv(file.choose))' command, which is described first and uses a Windows-like file menu to find the data file and then bring data into R.

1.3.1 Bringing data into R from an Excel file using the read.csv(file.choose()) command

MS Excel is an excellent tool for entering and managing data from a small statistical study. Data are arranged with variables as columns and subjects as rows. The first row of the Excel file (the 'header') can be used to provide variable names (object names for vectors in R). For example, the following are data from the first 5 subjects in a study to compare age first walking between two groups of infants:

Subject

group

sexmale

agewalk

1

1

1

9.00

2

1

1

9.50

3

1

1

9.75

4

1

0

10.00

5

1

1

13.00

 

Here, "Subject" is an id code; "group" is coded 1 or 2 for the two study groups; "sexmale" is coded 1 for males and 0 for females; and "agewalk" is the age when the infant first walked, in months. Note that I've used single-word (no spaces) variable names; using the underscore '_' or period '.' are nice ways to separate words in a variable name (for example, age_years or age.years are viewed as one-word variable names by R).

To bring an Excel data file into R, it first has to be saved as a comma-delimited file. In Excel, click on 'Save as', and select '.csv' as the file type. Save the file and exit Excel. The .csv file can then be brought into R as a 'data frame' using the 'read.csv(file.choose())' command. Entering

>kidswalk <- read.csv(file.choose())

will open a menu with a file listing for the default directory. See Section 1.3.6 below for instructions on changing the default directory (Link to Changing Default Directory). Double clicking on the data file will bring it into R under the name 'kidswalk'. You can also navigate in the file menu to open .csv files saved in other directories. R will use these object names to identify data, and so the same name cannot be used for both a data frame and a variable name.

The file menu from the 'read.csv(file.choose())' command is illustrated below:

NOTE: Depending on your operating system, R may not be able to read a data file that is opened in another application, and so you may have to close the data set in Excel before being able to read it into R.

NOTE: While the 'read.cvs(file.choose())' function brings a data set into R, there are still some issues with accessing an individual variable from within the data set. Section 1.3.3 below discusses accessing individual variables within a data set.

1.3.2 (Optional) Bringing data into R from an Excel file using the read.csv() command

If you know the name of the file that you want to bring into R, you can read a .csv file directly into R. For example, suppose we saved the data for the Age at Walking example as the file 'agewalk4R.csv' in the R default directory. It can be read in as:

> kidswalk <- read.csv('agewalk4R.csv')

Here, the data set is being saved as a 'data frame' object named 'kidswalk'; the function 'read.csv' reads in the specified .csv file and creates the corresponding R object.

Data sets saved outside the default directory can also be read directly into R, by specifying the folder path (although it may be easier to use the 'file.choose()' command described above). For example:

> kidswalk <- read.csv("C:/Users/tch/Documents/BS703/Data Sets/agewalk4R.csv")

would bring in a data set saved under the BS703/Data Sets folder. Note that forward slashes ( / ) are used in giving the file directory, rather than the backslash ( \ ) used by Windows. R will not recognize paths designated using the usual backslash, and so you must change the slash when cutting-and-pasting directory paths from Windows.

1.3.3. Accessing individual variables from an imported data set

The 'dataframename$variablename' convention

The 'read.csv' command creates an object (dataframe) for the entire data set represented by an Excel file, but it does not create objects for the individual variables. So, while we could perform some analyses on this entire data set, we cannot yet perform analyses on specific variables. When variable names are specified as the first row of the imported Excel file, R creates objects using the 'dataframename$variablename' convention. For example, in the Age First Walking example, after reading in the data set

> kidswalk <- read.csv(file.choose())

the 'agewalk' variable is named 'kidswalk$agewalk', and the 'group' variable is named 'kidswalk$group'. So, to find the mean age at walking, we could enter

> mean(kidswalk$agewalk)

[1] 11.13

The attach( ) command

For convenience, the individual variables in a data set can also be named without the dataframename prefix. The 'attach()' function creates individual objects for each variable, where the data frame name is specified in the parentheses:

> attach(kidswalk)

This function does not give any visible output, but creates objects (column vectors) for each individual variable in the data set, using the variable names specified in the first row as the object names. For the Age at Walking example, it creates data objects named Subject, group, sexmale, and agewalk. We could then use any of these variable objects in analyses:

> mean(agewalk)

[1] 11.13

Note that R is case-sensitive, and so 'Subject' is a different name than 'subject'. Also, two objects cannot have the same name in R, and so you cannot use the same name for both a dataframe and a variable.

1.3.4 Viewing or editing a data frame using the R data editor

An R dataframe can be viewed and edited as a spreadsheet within R using the R data editor. In R, click on the 'Editor' menu at the top of the R screen, then click on 'Data editor'; this leads to a prompt for the name of the dataframe to view/edit. Or, from the command line, the fix( ) function will open the data editor:

> fix(kidswalk)

The data set appears in a spreadsheet format. Analyses cannot be performed while the data editor is open.

1.3.5 (Optional) Bringing data into R from a space-delimited text file

When using Excel to organize data, it is easiest to bring data into R as .csv files. But data may be computerized through other programs, and R can read data saved through other programs as well. The read.table()function reads data that was saved as a text file (with a .txt extension) through MS Word or other programs, with spaces separating the entries in each line of data. As with Excel files, the data set should be set up with columns representing variables and rows representing subjects, and it is helpful to specify variable names as the first row of the document.

> kidswalk <- read.table("agewalk4R.txt")

The same conventions apply to naming individual variables in the data set, as described in 1.3.3 above.

1.3.6 (Optional) Specifying the default folder for R

In order to import a saved data set into R, R needs to know which directory (or folder) the data set is saved under on your computer. When you use the 'read.csv(file.choose())' command, you can navigate through folders just as you can with most Windows menus. But you can specify the folder that R first open. The steps for setting the default folder in R differ for PCs and Macs, and instructions for both are given below.

The default folder only needs to be set once, and R will continue to look for files in the default folder.

The default folder for R can be over-written for a single session. After starting R, click on the 'File' menu in the R screen, then select 'Change dir', and specify the directory to be used for this session. R will look for files in this directory for the current session, but will go back to the default directory in future sessions. However, if you 'save the workspace', and the start R by clicking on the saved workspace, settings can be carried over to future sessions.