Summary Statistics
Can you explain each of the following data types and give examples of each?
- Categorical (Binary as a special case)
- Ordinal
- Continuous
- Time-to-Event
Can you outline the summary statistics one would use for each of these data types?
The first module in this series provided an introduction to working with datasets and computing some descriptive statistics. We will continue this with the airquality data.
To recall the components of the data set, print out the first 5 rows.
> airquality[1:5,]
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
Recall that we can compute the mean Temp by "extracting" the variable Temp from the dataset using the $ function as follows:
> mean(airquality$Temp)
[1] 77.88235
Similarly, we can compute the median Temp:
> median(airquality$Temp)
[1] 79
> var(airquality$Wind)
[1] 12.41154
The Attach Command
If we don't want to keep using the "$" sign to point to the data set, we a can use the attach command to keep the data set as the current or working one in R, and then just call the variables by name. For example, the above can then be accomplished by:
> attach(airquality)
> var(Wind)
Once we are finished working with this data set, we can use the detach() command to remove this data set from the working memory.
Never attach two data sets that share the same variable names- this could lead to confusion and errors! A good idea is to detach a data set as soon as you have finished working with it. |
For now, let's keep this data set attached, while we test out some other functions.
By default you get the minimum, the maximum, and the three quartiles — the 0.25, 0.50, and 0.75 quantiles. The difference between the first and third quartiles is called the interquartile range (IQR) and is sometimes used as an alternative to the standard deviation.
> quantile(airquality$Temp)
0% 25% 50% 75% 100%
56 72 79 85 97
It is also possible to obtain other quantiles; this is done by adding an argument containing the desired percentage cut points. To get the deciles, use the sequence function:
> pvec <- seq(0,1,0.1) #sequence of digits from 0 to 1, by 0.1
> pvec
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> quantile(airquality$Temp, pvec)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
56.0 64.2 69.0 74.0 76.8 79.0 81.0 83.0 86.0 90.0 97.0
How would you use this method to get quintiles? Answer
Apply() Commands
We can also get summary statistics for multiple columns at once, using the apply() command. apply() is extremely useful, as are its cousins tapply() and lapply() (more on these functions later).
Let's first find the column means using the apply command:
> apply(airquality, 2, mean) #do for multiple columns at once
Error in FUN(newX[, i], ...) : missing observations in cov/cor
We get an error because the data contains missing observations! R will not skip missing values unless explicitly requested to do so. You can give the na.rm argument (not available, remove) to request that missing values be removed:
> apply(airquality, 2, mean, na.rm=T)
Ozone Solar.R Wind Temp Month Day
42.129310 185.931507 9.957516 77.882353 6.993464 15.803922
This can be done even when calculating a summary for a single column as well:
> mean(airquality$Ozone, na.rm=T)
[1] 42.12931
Summary Function
There is also a summary function that gives a number of summaries on a numeric variable (or even the whole data frame!) in a nice vector format:
> summary(airquality$Ozone) #note we don't need "na.rm" here
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.00 18.00 31.50 42.13 63.25 168.00 37.00
> summary(airquality)
Ozone Solar.R Wind Temp Month Day
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000 Min. : 1.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000 1st Qu.: 8.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000 Median :16.00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 Mean :6.993 Mean :15.80
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000 3rd Qu.:23.00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000 Max. :31.00
NA's : 37.00 NA's : 7.0
Notice that "Month" and "Day" are coded as numeric variables even though they are clearly categorical. This can be mended as follows, e.g.:
> airquality$Month = factor(airquality$Month)
> summary(airquality)
Ozone Solar.R Wind Temp Month Day
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 5:31 Min. : 1.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 6:30 1st Qu.: 8.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 7:31 Median :16.00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 8:31 Mean :15.80
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 9:30 3rd Qu.:23.00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :31.00
NA's : 37.00 NA's : 7.0
Notice how the display changes for the factor variables.
Find the standard deviations (SDs) of all the numeric variables in the air quality data set, using the apply function. |
Summary Statistics in R: Mean, Standard Deviation, Frequencies, etc (R Tutorial 2.7) MarinStatsLectures [Contents]