Accounting for Missing Data
- Missing numeric data are denoted in SAS by a period (.)
- Missing character data are denoted in SAS by an empty space(' ')
Suppose we have collected the following data on the ages of 5 people:
id age
1 35
2 missing
3 24
4 38
5 29
Individual 2 has a missing age value, so the data would be entered as follows:
data ages;
input id age;
cards;
1 35
2 .
3 24
4 38
5 29
;
run;
If we run proc means, we would get the following:
SAS will automatically exclude missing values from calculations, if they are coded correctly. Notice that N=4 for the age variable, because there are only 4 observations with non-missing ages.
However, some data sources will code missing values as 9 or -9 or 99, or some other numeric value. If this is the case, you should immediately re-code these to periods. If you don't recode the missing values, here is what will happen:
Example:
data ages;
input id age;
cards;
1 35
2 -9
3 24
4 38
5 29
;
run;
If we run proc means, we would get the following:
Notice that the minimum age is negative! And the mean age is less than the youngest age in the data set.
Important: Always check output for results that make no sense. |
If a data set has missing values coded as anything other than a period, you need to convert these before running the SAS program. This can be done easily by adding an if statement to the data step as illustrated in the example below.
data ages;
input id age;
if age eq -9 then age=.;
cards;
1 35
2 -9
3 24
4 38
5 29
;
run;
If we run proc means, we would get the following:
The observation with missing age coded as -9, now has age coded correctly with a period. Notice that the minimum age is now (correctly) 24, and that the mean is actually in the range of the ages! Also notice that N=4 for the age variable, because there are only 4 observations with non-missing ages.
Creating a Derived Variable
Now we will create a variable called agecat which takes on the value of 1 if the age is less than or equal to 30 and 2 if the age is greater than 30.
We have checked (how?) and the missing age has been correctly coded.
data ages;
input id age;
if age<=30 then agecat = 1;
else if age>30 then agecat=2;
cards;
1 35
2 .
3 24
4 38
5 29
;
run;
What is wrong with this?
The 2nd observation had a missing age value but was categorized as "young". This person should have a missing age category!
The problem here is that SAS treats missing numeric values as negative infinity. Here, SAS treats the missing age value as negative infinity, which is definitely less than 30, so this observation will be assigned agecat=1.
To fix this problem we need to recode the agecat variable to specifically account for missing values:
data ages;
input id age;
if age = . then agecat = .;
else if age<=30 then agecat = 1;
else if age>30 then agecat=2;
cards;
1 35
2 .
3 24
4 38
5 29
;
run;
Now the observation with missing age also has a missing agecat variable.
Variable |
N |
Mean |
Std Dev |
Minimum |
Maximum |
id |
5 |
3.0000000 |
1.5811388 |
1.0000000 |
5.0000000 |
age |
4 |
32.5000000 |
6.2449980 |
24.0000000 |
38.0000000 |
agecat |
4 |
1.5000000 |
0.5773503 |
1.0000000 |
2.0000000 |
If your data had missing age coded as -9, you would first have to re-code missing age to a period, and then account for missing ages in creating agecat.
data ages;
input id age;
if age eq -9 then age=.;
if age = . then agecat = .;
else if age<=30 then agecat = 1;
else if age>30 then agecat=2;
cards;
1 35
2 -9
3 24
4 38
5 29
run;