1.9 Subgroup analyses: finding means and standard deviations for subgroups


There are (at least) three ways to do subgroup analyses in R.

  1. First (and I think easiest), we can use a 'select' statement to restrict an analysis to a subgroup of subjects.
  2. Second, the tapply() function can be used to perform analyses across a set of subgroups in a dataframe.
  3. Third, we can create a new data frame for a particular subgroup using the subset() function, and then perform analyses on this new data frame.

An analysis can be restricted to a subset of subjects using the 'varname[subset]' format. For example,

> mean(agewalk[group==1])

[1] 10.72727

finds the mean of the variable 'agewalk' for those subjects with group equal to 1. When specifying the condition for inclusion in the subset analysis ('Group==1' in this example), two equal signs '==' are needed to indicate a value for inclusion. Less than (<) and greater than (>) arguments can also be used. For example, the following command would find the mean systolic blood pressure for subjects with age over 50:

> mean(sysbp[age>50])

Another approach is to use the tapply() function to perform an analysis on subsets of the data set. The input for the tapply( ) function is 1) the outcome variable (data vector) to be analyzed, 2) the categorical variable (data vector) that defines the subsets of subjects, and 3) the function to be applied to the outcome variable. To find the means, standard deviations, and n's for the two study groups in the 'kidswalk' data set:

> tapply(agewalk, group, mean)

1

2

10.72727

11.91176

> tapply(agewalk, group, sd)

1

2

1.231684

1.277636

> tapply(agewalk, group, length)

1 2

33 17

The subset() function creates a new data frame, restricting observations to those that meet some criteria. For example, the following creates a new data frame for kids in Group 2 of the kidswalk data frame (named 'group2kids'), and finds the n and mean Age_walk for this subgroup:

> group2kids <- subset(kidswalk,Group==2)

> length(group2kids)

[1] 5

> mean(group2kids$Age_walk)

[1] 11.91176

In this example, there are two data sets open in R (kidswalk for the overall sample and group2kids for the subsample) that use the same set of variables names. In this situation, it is helpful to use the 'dataframe$variablename' format to specify a variable name for the appropriate sample.

When specifying the condition for inclusion in the subsample ('Group==2' in this example), two equal signs '==' are needed to indicate a value for inclusion. Less than (<), less than or equal to (<=), greater than (>), greater than or equal to (>=), or not equal to (!=) arguments can also be used. For example,

> age65plus <- subset(allsubjects,age>64)

would create a dataframe of subjects aged 65 and older.