Creating New Variables from Variables in the Data Set

Using a Calculation to Create a New Continuously Distributed Variable

Suppose I have data from a survey that was conducted in 2002 in which one of the variables is birth_yr, i.e., the year in which the subject was born. I want to create a variable called "age" that shows how old the subject were at the time the survey was conducted. To do this I can create a derived variable as follows in one of my script lines:

# Create a derived variable for the respondent's age in 2002 (when the data was collected) based on their reported birth year.
age=(2002-birth_yr)

Suppose the data set also has a column for the subject's height in inches (hgt_inch) and weight in pounds (weight), but I want to compute each subject's body mass index (bmi). There is a formula for computing bmi from height in inches and weight in pounds, and I can compute bmi by including a line of code in my script.

# Compute body mass index (bmi)
bmi=weight/(hgt_inch)^2*703

This divides weight in pounds by height in inches squared times 703.

Collapsing a Continuous Variable into Categories

Body mass index is a continuous distributed variable, i.e., it can have an infinite number of values within a certain range. However, suppose I also wanted to categoize subjects as to whether they were obese (bmi of 30 or more) versus non-obese (bmi<30). I can use the ifelse() function in R to create a new variable called "obese" with a value of 1 if bmi is greater than or equal to 30 and a value of 0 if bmi is less than 30.

ifelse(conditional statement, result if true, result if false)

For each subject the conditional statement is evaluated to determine if it is true for that subject or not. If it is true, R assigns the next value to the new variable; if it is false, it assigns the second value to the new variable.

The first line of code below is a conditional statement that creates a new variable called "obese". It says to evaluate each subject's bmi, and, if it is less than 30, assign a value of 0 to obese (meaning that the individual is not obese). If that condition is not met, i.e. if the value of bmi is greater than or equal to 30, obese will have a value of 1, meaning that the individual is in the obese category.

obese=ifelse(bmi<30,0,1)

I can then include a line of code that asks R to give a tabulation of the variable obese.

table(obese)

The output:in the Console indicates that there were 3272 subjects, and onlky 3 of them were in the obese category.

**obese
*****0 1
3269 3

Collapsing Categorical Variables

One of the surveys conducted by the Youth Behavior Risk Surveillance System (YRBSS) had variables that reflected student response to two questions about bullying and three questions about suicide.

Variable Name

Quesion and Coding

bully.sch

During the past 12 months, have you been bullied on school property? 1=yes 0=no

bully.online

During the past 12 months, have you been electronically bullied (e-mail, chat rooms, instant messaging, Web sites, or texting)? 1=yes 0=no

suicide.consider

During the past 12 months, did you ever seriously consider attempting suicide? 1=yes, 0=no

suicide.plan

During the past 12 months, did you make a plan about attempting suicide? 1=yes 0=no

suicide.attempt

During the past 12 months, did you actually attempt suicide? 1=yes 0=no

For part of my analysis I want to examine whether any form of bullying (in school or online) was associated with any indications of suicide risk (i.e., considering, planning or attempting suicide).

I can collapse the two questions about bullying into a single variable I will call "anyB", meaning any form of bullying, and I can collapse the three questions about suicide into a single variable I will call "anyS", i.e., whether there was any indication of suicide risk. I will use two conditional statements to do this.

anyB<-ifelse(bully.sch+bully.online>0,1,0)

This adds the variables for bull.sch and bully.online. If both are positive responses, the result is 2; if one or the other is positive, the result is 1; if neither is positive, the result is 0. The conditional statement evaluates the sum of the two variables, and if the sum is greater than 0, it gives anyB a value of 1, meaning true. If neither form of bullying occurred, the sum will be 0, and R will give anyB a value of 0.

Similarly, I can create a new variable called "anyS", indicating whether the subject gave any responses indicating risk of suicide.

anyS<-ifelse(suicide.consider+suicide.plan+suicide.attempt>0,1,0)

This evaluates the sum of the three variables indicating some risk of suicide. If the sum is 0, anyS is assigned a value of 0, but if one or more of the three suicide indicators is true, then anyS is assigned a value of 1.

Having done this, I can then use the table() command to generate a contingency table for the association between anyB and anyS.

table(anyB,anyS)

      ****anyS
anyB     0    1
    0 *649 **108
    1 *165   78