1.10 Handling missing data in R


Many research studies involve missing data – not all study variables are measured for on all study subjects. Most functions in R handle missing data appropriately by default, but a couple of basic functions require care when missing data are present. And it's always a good idea to check for missing data in a data set.

When inputting data directly into R, 'NA' is used to designate missing data. For example,

> xvar <- c(2,NA,3,4,5,8)

Creates a variable ('xvar') for a sample of 6 subjects, but the second subject is missing data for this variable. NA is also used to indicate missing data when R prints data:

> xvar

[1] 2 NA 3 4 5 8

When setting up a dataset using Excel, missing data can be represented either by 'NA' or by just leaving the cell blank in Excel. In either case, data will be treated as missing when imported into R.

To check for missing data with a measurement variable, we can use the 'summary( )' command,

> summary(xvar)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

2.0 3.0 4.0 4.4 5.0 8.0 1.0

Along with the minimum value, first quartile (25th percentile), median, mean, 3rd quartile and maximum value, the summary command also lists the number of observations with missing data under the NA's column (here there is one subject with missing data). For a categorical variable, we can check for missing data using the 'useNA='always' option in the table( ) command (see sections 15 through 17 for more on the table( ) command):

> table(currsmoke,useNA='always')

currsmoke

0 1 <NA>

11 6 3

In this example of current smoking status, there are 11 non-smokers, 6 smokers, and 3 with missing data.

Most R functions appropriately handle missing data, excluding it from analysis. There are a couple of basic functions where extra care is needed with missing data.

The length( ) command gives the number of observations in a data vector, including missing data. For example, there were 6 subjects in the data set for the 'xvar' variable in the example above, although there were only 5 subjects with actual data and one had a missing value. Using the length( ) function gives

> length(xvar)

[1] 6

which can be misleading, since there are only 5 subjects with valid values for this variable. To find the number of non-missing observations for a variable, we can combine the length( ) function with the na.omit( ) function. The na.omit( ) function omits missing data from a calculation. So, listing the values of xvar gives:

> xvar

[1] 2 NA 3 4 5 8

while listing the non-missing values of xvar gives

> na.omit(xvar)

[1] 2 3 4 5 8

To find the number of non-missing observations for xvar,

> length(na.omit(xvar))

[1] 5

Another common function that does not automatically deal with missing data is the mean( ) function. Trying to calculate a mean for a variable with missing data gives the following:

> mean(xvar)

[1] NA

We can calculate the mean for the non-missing values the 'na.omit( )' function:

> mean(na.omit(xvar))

[1] 4.4

Some functions also have options to deal with missing data. For example, the mean( ) function has the 'na.rm=TRUE' option to remove missing values from the calculation. So another way to calculate the mean of non-missing values for a variable:

> mean(xvar, na.rm=TRUE)

[1] 4.4

See the help( ) function documents in R for options for missing data for specific analyses.