Summary Statistics


Can you explain each of the following data types and give examples of each?

 Can you outline the summary statistics one would use for each of these data types?

The first module in this series provided an introduction to working with datasets and computing some descriptive statistics. We will continue this with the airquality data.

To recall the components of the data set, print out the first 5 rows.

>  airquality[1:5,]

 

  Ozone Solar.R Wind Temp Month Day

1    41     190  7.4   67     5   1

2    36     118  8.0   72     5   2

3    12     149 12.6   74     5   3

4    18     313 11.5   62     5   4

5    NA      NA 14.3   56     5   5

Recall that we can compute the mean Temp by "extracting" the variable Temp from the dataset using the $ function as follows: 

> mean(airquality$Temp)

[1] 77.88235

Similarly, we can compute the median Temp:

> median(airquality$Temp)

[1] 79

> var(airquality$Wind)

[1] 12.41154

The Attach Command 

If we don't want to keep using the "$" sign to point to the data set, we a can use the attach command to keep the data set as the current or working one in R, and then just call the variables by name. For example, the above can then be accomplished by:

> attach(airquality)

> var(Wind)

Once we are finished working with this data set, we can use the detach() command to remove this data set from the working memory.

Never attach two data sets that share the same variable names- this could lead to confusion and errors! A good idea is to detach a data set as soon as you have finished working with it.

For now, let's keep this data set attached, while we test out some other functions.

By default you get the minimum, the maximum, and the three quartiles — the 0.25, 0.50, and 0.75 quantiles. The difference between the first and third quartiles is called the interquartile range (IQR) and is sometimes used as an alternative to the standard deviation.

> quantile(airquality$Temp)

  0%  25%  50%  75% 100%

  56   72   79   85   97

It is also possible to obtain other quantiles; this is done by adding an argument containing the desired percentage cut points. To get the deciles, use the sequence function:

> pvec <- seq(0,1,0.1) #sequence of digits from 0 to 1, by 0.1

> pvec

[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

 

 

> quantile(airquality$Temp, pvec)

  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100%

56.0 64.2 69.0 74.0 76.8 79.0 81.0 83.0 86.0 90.0 97.0

 

How would you use this method to get quintiles? Answer

Apply() Commands 

We can also get summary statistics for multiple columns at once, using the apply() command. apply() is extremely useful, as are its cousins tapply() and lapply() (more on these functions later).

Let's first find the column means using the apply command:

> apply(airquality, 2, mean) #do for multiple columns at once

Error in FUN(newX[, i], ...) : missing observations in cov/cor

We get an error because the data contains missing observations! R will not skip missing values unless explicitly requested to do so. You can give the na.rm argument (not available, remove) to request that missing values be removed:

> apply(airquality, 2, mean, na.rm=T)

     Ozone    Solar.R       Wind       Temp      Month        Day

 42.129310 185.931507   9.957516  77.882353   6.993464  15.803922

This can be done even when calculating a summary for a single column as well:

> mean(airquality$Ozone, na.rm=T)

[1] 42.12931

Summary Function

There is also a summary function that gives a number of summaries on a numeric variable (or even the whole data frame!) in a nice vector format:

> summary(airquality$Ozone)  #note we don't need "na.rm" here

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's

   1.00   18.00   31.50   42.13   63.25  168.00   37.00

> summary(airquality)

     Ozone           Solar.R           Wind             Temp           Month            Day      

 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   Min.   :5.000   Min.   : 1.00 

 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   1st Qu.:6.000   1st Qu.: 8.00 

 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   Median :7.000   Median :16.00 

 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88   Mean   :6.993   Mean   :15.80 

 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   3rd Qu.:8.000   3rd Qu.:23.00 

 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00   Max.   :9.000   Max.   :31.00 

 NA's   : 37.00   NA's   :  7.0  

 

Notice that "Month" and "Day" are coded as numeric variables even though they are clearly categorical. This can be mended as follows, e.g.:

> airquality$Month = factor(airquality$Month)

> summary(airquality)

     Ozone           Solar.R           Wind             Temp       Month       Day      

 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   5:31   Min.   : 1.00 

 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   6:30   1st Qu.: 8.00 

 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   7:31   Median :16.00 

 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88   8:31   Mean   :15.80 

 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   9:30   3rd Qu.:23.00 

 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00          Max.   :31.00 

 NA's   : 37.00   NA's   :  7.0        

 

 Notice how the display changes for the factor variables.

 

Find the standard deviations (SDs) of all the numeric variables in the air quality data set, using the apply function.

 

Summary Statistics in R: Mean, Standard Deviation, Frequencies, etc (R Tutorial 2.7) MarinStatsLectures [Contents]

alternative accessible content