Analyzing Data in Subsets Using R

The tapply() command

The tapply() function is useful for performing functions (e.g., descriptive statistics) on subsets of a data set. In effect this enables you to subset the data by one or more classifying factors and then performing some function (e.g., computing the mean and standard deviation of a given variable) by subset. Note that tapply() is used for descriptive statistics (e.g., mean, sd, summary) for continuously distributed variables. For categorical variables you should use the table() function to get counts of categorical variables and use the prop.table() function to get proportions. The basic structure of the tapply command is:


where <var> is the variable that you want to analyze, <by.var> is the variable that you want to subset by, and <function> is the function or computation that you want to apply to <var>.

For example, suppose I have a data set with continuous variables Dubow (Dubow Score), DrugExp (Drug Exposure) and Ppregwt (Pre-pregnancy weight). My goal is to sort the data set by DrugExp and then compute the mean and standard deviation of Dubow Scores and Pre-pregnancy weights for each category of DrugExp.

> tapply(Dubow,DrugExp,mean) # Gives means of Dubowitz score by drug exposure
> tapply(Dubow,DrugExp,sd) # Gives the standard deviations of Dubowitz score by drug exposure
> tapply(Ppregwt,DrugExp,mean)
# Gives the means of pre-pregnancy weight by drug exposure
> tapply(Ppregwt,DrugExp,sd)
# Gives the standard deviations of pre-pregnancy weight by drug
> tapply(Birthwt,DrugExp,t.test) # Gives 95% confidence interval for exposed and unexposed in one output

An Alternate Method of Subset Analysis

Getting descriptive statistics by category can also be achieved as follows:

> mean(Birthwt[DrugExp==1]); mean(Birthwt[DrugExp==0]) # means for each exposure group
> sd(Birthwt[DrugExp==1]); sd(Birthwt[DrugExp==0])# standard deviation for each exposure group
> t.test(Birthwt[DrugExp==1]) # 1-sample t-test to get 95% CI for those exposed to drugs
> t.test(Birthwt[DrugExp==0]) # 1-sample t-test to get 95% CI for those unexposed to drugs

Using the double equal sign (==) basically means "only if DrugExp equals 1".

Creating a Dichotomous Variable from a Continuous Variable

Suppose my data set has a continuously distributed variable called "birthwgt", which is each child's weight in grams at birth, but I wish to create a new variable that categorizes children as having Low Birth Weight (lowBW), i.e. less than 2500 grams or not. I can do this using the ifelse() function, which has the following format:

> ifelse(<logical statement>, <if true>, <if false>)


> lowBW <-ifelse(Birthwt<2500,1,0)

If the variable birthwt is less than 2500, then the new variable lowBW will have a value of 1, meaning "true"; if not, it will have a value of 0 meaning "false". When this command is executed, you should see the new variable show up in the global environment window at the upper right corner of RStudio. Note that you should reattach your data set so that the new variable will be recognized.
If you want the loBW category to include those whose weight was exactly 2500 grams, then use <= (less than or equal to) as below.

> lowBW <-ifelse(Birthwt<=2500,1,0)

Crude Measures of Association in a Cohort Study (or Intervention Study)

After generating the descriptive statistics for an epidemiologic study, the next step is to generate estimates for the magnitude of association between the primary exposure of interest (e.g., physical activity level in the Manson study) and the primary outcome of interest (e.g., development of cardiovascular disease). As noted above, there may be confounding factors that can distort the estimated measure of association, but one still begins by generating crude measures of association, i.e., estimates that have not yet been adjusted for confounding factors.

Test Yourself

The table below shows data from the top portion of Figure 2 from the study by Manson et al.

Table – Relative Risk of Coronary Events According to Quintile Group for Total Physical Activity


Quantile Group Based on Physical Activity













Number of coronary events






Person-years of follow up






Using the data in the table above, a) compute the incidence rate ratio and the incidence rate difference for moderate activity compared to the least active subjects, and b) write an interpretation of your findings. Complete both parts before comparing your answers to those at the link below.