1.4 Creating new variables in R


Many research studies involve some data management before the data are ready for statistical analysis. For example, in using R to manage grades for a course, 'total score' for homework may be calculated by summing scores over 5 homework assignments. Or for a study examining age of a group of patients, we may have recorded age in years but we may want to categorize age for analysis as either under 30 years vs. 30 or more years. R can be used for these data management tasks.

1.4.1 Calculating new variables

New variables can be calculated using the 'assign' operator. For example, creating a total score by summing 4 scores:

> totscore <- score1+score2+score3+score4

* , / , ^ can be used to multiply, divide, and raise to a power (var^2 will square a variable). As another example, weight in kilograms can be calculated from weight in pounds:

> weight.kg <- 0.4536*weight.lb

1.4.2 Creating categorical variables

The 'ifelse( )' function can be used to create a two-category variable. The following example creates an age group variable that takes on the value 1 for those under 30, and the value 0 for those 30 or over, from an existing 'age' variable:

> ageLT30 <- ifelse(age < 30,1,0)

The arguments for the ifelse( ) command are 1) a conditional expression (here, is age less than 30), then 2) the value taken on if the expression is true, then 3) the value taken on if the expression is false. The expression 'age<=30' would indicate those less than or equal to 30. Logical expressions can be combined as AND or OR with the & and | symbols, respectively. For example, the expression '30 < age & age <=39' would indicate those aged 30 to 39 (age greater than 30 and less than or equal to 39), and 'age<20 | age>70' would indicate those either under 20 or over 70.

In logical expressions, two equal signs are needed for 'is equal to'

(e.g., > obese <- ifelse(BMIgroup==4,1,0), and the 'not equal to' sign in R is '!='.

A series of commands are needed to create a categorical variable that takes on more than two categories. For example, to create an agecat variable that takes on the values 1, 2, 3, or 4 for those under 20, between 20 and 39, between 40 and 59, and over 60, respectively:

> agecat <- 99

> agecat[age<20] <- 1

> agecat[20<=age & age<=39] <- 2

> agecat[40<age & age<=59] <- 3

> agecat[60 <= age] <- 4

The first line creates an 'agecat' variable and assigns each subject a value of 99. The square brackets [ ] (further described in Section 7 below) are used to indicate that an operation is restricted to cases that meet the condition in the brackets. So the 'agecat[age<20] <- 1' statement will assign the value of 1 to the variable agecat, only for those subjects with age less than 20 (over-riding the 99's assigned in the first line of code). The set of four commands assign the values 1 through 4 to the appropriate age groups.