Analyzing and Presenting Results from Descriptive Studies

Introduction


Disease surveillance systems and health data sources provide the raw information necessary to monitor trends in health and disease. Descriptive epidemiology provides a way of organizing and analyzing these data in order to understand variations in disease frequency geographically and over time, and how disease (or health) varies among people based on a host of personal characteristics (person, place, and time). This makes it possible to identify trends in health and disease and also provides a means of planning resources for populations. In addition, descriptive epidemiology is important for generating hypotheses (possible explanations) about the determinants of health and disease. By generating hypotheses, descriptive epidemiology also provides the starting point for analytic epidemiology, which formally tests associations between potential determinants and health or disease outcomes. Specific tasks of descriptive epidemiology are the following:

 

Key Questions:

How can I summarize data?

How do I produce basic figures and tables?

How can I analyze the correlation between two continuous variables?

How can I apply this to the analysis and description of an ecologic study?

How can I use R to do descriptive analyses?

Learning Objectives


After successfully completing this unit, the student will be able to:

 

 

 

Basic Concepts


Types of Variables

Procedures to summarize data and to perform subsequent analysis differ depending on the type of data (or variables) that are available. As a result, it is important to have a clear understanding of how variables are classified.

There are three general classifications of variables:

1) Discrete Variables: variables that assume only a finite number of values, for example, race categorized as non-Hispanic white, Hispanic, black, Asian, other. Discrete variables focus on the frequency of observations and can be presented as the number, the percentage, or the proportion of observations within a given category.

Discrete variables may be further subdivided into:

2) Continuous Variables: These are sometimes called quantitative or measurement variables; they can take on any value within a range of plausible values. For example, total serum cholesterol level, height, weight and systolic blood pressure are examples of continuous variables. Continuous variables (i.e., measurement variables) are summarized by finding a central measure, such as a mean or a median, as appropriate, and characterizing the variability of spread around the central measure.

3) Time to Event Variables: these reflect the time to a particular event such as a heart attack, cancer remission or death. This module will focus primarily on summarizing and presenting discrete variables and continuous variables; time to event variables will be addressed in a later module.

This module will introduce basic concepts for analyzing and presenting data from exploratory (descriptive) studies that are essential for disease surveillance, for assessing the health and health-related behaviors in a population, or for generating hypotheses about the determinants of health or disease. However, students may want to refer to other learning modules that address these concepts in greater detail. These can be found using the following links:

Link to module - Basic Concepts for Biostatistics

Link to module - Summarizing Data

Link to module - Data Presentation

Population Parameters versus Sample Statistics

A descriptive measure for an entire population is a ''parameter.'' There are many population parameters, for example, the population size (N) is one parameter, and the mean diastolic blood pressure or the mean body weight of a population would be other parameters that relate to continuous variables. Other population parameters focus on discrete variables, such as the percentage of current smokers in the population or the percentage of people with type 2 diabetes mellitus. Health-related behaviors can also be thought of this way, such as the percentage of the population that gets vaccinated against the flu each year or the percentage who routinely wear a seatbelt when driving.

However, it is generally not feasible to directly measure parameters, since it requires collecting information from all members of the population. We, therefore, take samples from the population, and the descriptive measures for a sample are referred to as ''sample statistics'' or simply ''statistics.'' For example, the mean diastolic blood pressure, the mean body weight, and the percentage of smokers in a sample from the population would be sample statistics. In the image below the true mean diastolic blood pressure for the population of adults in Massachusetts is 78 millimeters of mercury (mm Hg); this is a population parameter. The image also shows the mean diastolic blood pressure in three separate samples. These means are sample statistics which we might use in order to estimate the parameter for the entire population. However, note that the sample statistics are all a little bit different, and none of them are exactly the sample as the population parameter.

Map of Massachusetts with thousands of iconic people overlayed. Three random samples are drawn from the population and each sample has a slightly different mean value.

In order to illustrate some fundamentals, let's consider a very small sample with data shown in the table below.

Table - Data Values for a Small Sample

Subject

ID

Age

Length of Stay

in Hospital (days)

Current

Smoker

Body Mass

Index

Type 2

Diabetes

1

63

2

0

29.6

1

2

74

2

1

26.4

0

3

75

2

1

24.5

0

4

74

2

0

31.9

1

5

70

3

0

22.8

0

6

72

3

0

19.8

0

7

81

3

0

27.6

1

8

68

5

1

26.8

1

9

67

7

0

24.7

1

10

77

9

0

23.0

0

Note that the data table has continuous variables (age, length of stay in the hospital, body mass index) and discrete variables that are dichotomous (type 2 diabetes and current smoking). Let's focus first on the continuous variables which we will summarize by computing a central measure and an indication of how much spread there is around that central estimate.

Measures of Central Tendency and Variability

There are three sample statistics that describe the center of the data for a continuous variable. There are:

The mean and the median will be most useful to us for analyzing and presenting the results of exploratory studies.

One way to summarize age for the small data set above would be to determine the frequency of subjects by age group as show in the table below.

Age Group

Number of Subjects

Relative Frequency

60-64

1

0.1

65-69

2

0.2

70-74

4

0.4

75-79

2

0.2

89-85

1

0.1

This makes it easier to understand the age structure of the group. One could also summarize the age structure by creating a frequency histogram as shown in the figure below.

Frequency histogram of age groups showing that the greatest frequency is in the middle group of age 70-74 with fewer subjects at lower or higher age groups. The hsitogram is symmetrical.

If there are no extreme or outlying values of the variable (as in this case), the mean is the most appropriate summary of a typical value.

The Mean

The sample mean is computed by summing all of the values for a particular variable in the sample and dividing by the number of values in the sample. 

So, the general formula is

The X with the bar over it represents the sample mean, and it is read as "X bar". The Σ indicates summation (i.e., sum of the X's or sum of the ages in this example).

So, in the sample above the mean is

Sample Variance and Standard Deviation 

When the mean is appropriate to characterize the central values, the variability or spread of values around the mean can be characterized with the variance or the standard deviation. If all of the observed values in a sample are close to the sample mean, the standard deviation will be small (i.e., close to zero), and if the observed values vary widely around the sample mean, the standard deviation will be large.  If all of the values in the sample are identical, the sample standard deviation will be zero.

To compute the sample standard deviation we begin by computing the variance (s2) as follows:

The variance is essentially the mean of the squared deviations, although we divide by n-1 in order to avoid underestimating the population variance. We can compute this manually by first computing the deviations from the mean and then squaring them and adding the squared deviations from the mean as shown in the table below.

Table - Computation of Variance for Age

Subject

ID

Age

Deviation from The Mean

Squared Deviation from the Mean

1

63

-9.1

82.81

2

74

1.9

3.61

3

75

2.9

8.41

4

74

1.9

3.61

5

70

-2.1

4.41

6

72

-0.1

0.01

7

81

8.9

79.21

8

68

-4.1

16.81

9

67

-5.1

26.01

10

77

4.9

24.01

Totals

721

0.0

248.9

Therefore,

However, the more common measure of variability in a sample is the sample standard deviation (s), defined as the square root of the sample variance:

In this example the standard deviation is:

 Computing Mean, Variance, and Standard Deviation in R

These computations are easy using the R statistical package. First, I will create a data set with the ten observed ages in the example above using the concatenation function in R.

 

> agedata <- c(63, 74, 75, 74, 70, 72, 81, 68, 67, 77)

>

To calculate the mean:

> mean(agedata)

[1] 72.1

To calculate the variance:

> var(agedata)

[1] 27.65556

To calculate the standard deviation for age:

> sd(agedata)

[1] 5.258855

 

Next, we will examine length of stay in the hospital (days) which is also a continuous variable. As we did with age, we could summarize hospital length of stay by looking at the frequency, e.g., how many patients stayed 1, 2, 3, 4, etc. days.

Days in Hospital

Number of Subjects

Relative Frequency

1

0

0

2

4

0.4

3

3

0.3

4

0

0

5

1

0.1

6

0

0

7

1

0.1

8

0

0

9

1

0.1

And one again, we could also present the same information with a frequency histogram as shown below.

Frequency histogram of length of stay showing a skewed distribution with most patients staying 2 or 3 days. However, three patients stayed for 5, 7, and 9 days.

Here, most patients stayed in the hospital for only 2 or 3 days, but there were outliers who stayed 5, 7, and 9 days. This is a skewed distribution, and in this case the mean would be a misleading characterization of the central value. Rather than compute a mean, it would be more informative to compute the median value, i.e., the "middle" value, such that half of the observations are below this value, and half are above.

To compute the median one would first order the data.

However, R is a more convenient way to do this, because it will also enable you to see the interquartile range (IQR) which is a useful way of characterizing the variability or spread of the data.

Computing Median and Interquartile Range with R

We can again create a small data set for hospital length of stay using the concatenation function in R:

 

> hospLOS <- c(2,2,2,2,3,3,3,5,7,9)

and we can then compute the median.

> median(hospLOS)

[1] 3

However, it is more useful to use the "summary()" command.

> summary(hospLOS)

Min. 1st Qu. Median Mean 3rd Qu. Max.

2.0 2.0 3.0 3.8 4.5 9.0

>

The quartiles divide the data into 4 roughly equal groups as illustrated below.

An ordered array of the observed lengths of stay in hospital showing the minimum (2), quartile 1 (2), median (3), quartile 3 (4.5), and the maximum (9 days).

When a data set has outliers or extreme values, we summarize a typical value using the median as opposed to the mean.  When a data set has outliers, variability is often summarized by a statistic called the interquartile range, which is the difference between the first and third quartiles. The first quartile, denoted Q1, is the value in the data set that holds 25% of the values below it. The third quartile, denoted Q3, is the value in the data set that holds 25% of the values above it. 

To summarize:

No outliers: sample mean and standard deviation summarize location and variability.

• When there are outliers or skewed data, median and interquartile range (IQR) best summarize location and variability, where  IQR = Q3-Q1

Box-Whisker Plots

Box-whisker plots are very useful for comparing distributions. A box-whisker plot divides the observations into 4 roughly equal quartiles. The whiskers represent the minimum and maximum observed values. The right side of the box indicates Q1, below which are the lowest 25% of observations, and the left side of the box is Q3, above which are the highest 25% of observations. The lowest 25% of observations are below Q1 and the highest 25% are above Q3. The median value is shown within the box.

A box-shisker plot which divides the observations into 4 roughly equal quartiles. The whiskers represent the minimum and maximum observed values. The right side of the box indicates Q1, and the left side of the box is Q3. The lowest 25% of observations are below Q1 and the highest 25% are above Q3. The median value is shown withing the box.

Data Presentation

There are two fundamental methods for presenting summary information: tables and graphs.

For examples of how to create effective tables and graphs and how to avoid pitfalls in data presentation, please refer to the following two online learning modules:

Link to module - Summarizing Data

Link to module - Data Presentation

Case Series - Summary of Findings and Presentation


Nguyen Duc Hien, Nguyen Hong Ha, et al: Human infection with highly pathogenic avian influenza virus (H5N1) in Northern Vietnam, 2004–2005. Emerg Infect Dis. 2009 Jan; 15(1): 19–23.

Link to the complete article

This is a small, but important case series reported in 2009. Shown below are the abstract and slightly modified versions of the two tables presented in the report.

Abstract

"We performed a retrospective case-series study of patients with influenza A (H5N1) admitted to the National Institute of Infectious and Tropical Diseases in Hanoi, Vietnam, from January 2004 through July 2005 with symptoms of acute respiratory tract infection, a history of high-risk exposure or chest radiographic findings such as pneumonia, and positive findings for A/H5 viral RNA by reverse transcription–PCR. We investigated data from 29 patients (mean age 35.1 years) of whom 7 (24.1%) had died. Mortality rates were 20% (5/25) and 50% (2/4) among patients treated with or without oseltamivir (p = 0.24), respectively, and were 33.3% (5/15) and 14.2% (2/14) among patients treated with and without methylprednisolone (p = 0.39), respectively. After exact logistic regression analysis was adjusted for variation in severity, no significant effectiveness for survival was observed among patients treated with oseltamivir or methylprednisolone."

 

 

Note that both continuous and discrete variables are reported, and note that the authors used the mean and standard deviation for variables like age, but they used median and IQR for many other variables because their distributions were skewed. Note also that discrete variables and continuous variables can be presented in the same table, but it is essential to specify how each characteristic is being presented.

Table 1. Characteristics of 29 patients infected with highly pathogenic avian influenza virus (H5N1), northern Vietnam, 2004–2005*

Characteristic

Value

Age, y, mean ± SD

35.1 ± 14.4

M:F sex (%)

15:14 (52:48)

High-risk exposure, no. (%)†

 

Poultry

19 (65.5)

Sick poultry

12 (41.4)

Family infected with H5N1 virus subtype

6 (20.7)

Sick poultry or person

15 (51.7)

Hospitalization after disease onset, median, d (IQR)

6 (4–8)

Hospital stay, median, d (IQR)

14 (9–17)

Treated with oseltamivir, no. (%)

25 (86.2)

Began treatment with oseltamivir after disease onset, median, d (IQR)

7 (5–10)

Treated with methylprednisolone, no. (%)

15 (51.7)

Died, no. (%)

7(24.1)

Table Legend: *IQR, interquartile range;

†Poultry, a history of exposure to sick or healthy poultry; sick poultry or person, a history of exposure to sick poultry or a family infected with avian influenza (H5N1).

 

Table 2 below shows selected laboratory findings among survivors versus patients who died. Leukocytes are white blood cells, and neutrophils are a specific type of white blood cell; the lower numbers of these two counts in those who died suggests that the immune system was overwhelmed. Hemoglobin is a measure of red blood cells and oxygen carrying capacity. Platelets are essential elements for blood clotting. Albumin is the most abundant protein in blood. AST is an abbreviation for aspartate aminotransferase, an enzyme that is abundant in the liver; high levels of AST in the blood frequently indicate liver damage. Urea nitrogen is a measure of kidney function; high levels of urea nitrogen suggest compromised kidney function but could also be indicative of dehydration.

Table 2. Initial laboratory results for 29 patients infected with highly pathogenic avian influenza virus

Characteristic

Survived

Median (IQR)

Died

Median (IQR)

p-value

Leukocytes, x103/μL

7.8 (7.1-12.0)

3.4 (1.7-5.6)†

0.0093

Neutrophils, x103/μL

6.8 (4.8-9.9)

2.3 (1.1-3.8)†

0.0101

Hemoglobin, grams/L

130 (107-137)

121 (103-138)

0.6102

Platelets, x103/μL

214 (181-284)

86 (38-139)†

0.0101

Albumin, grams/L

34.5 (31.2-35.1)

21.7 (10.4-29.4)†

0.0265

AST, U/L

45 (28-69)

327 (77-352)

0.0077

Total bilirubin, μmol/L

10.3 (7.6-16.8)

11.4 (7.0-27.1)

0.7921

Urea nitrogen, mmol/L

4.5 (3.4-5.5)

9(3.4-14.3)

0.0462

†p<0.05, by Wilcoxon test or Fisher exact test.

We will address p-values and statistical tests like the Wilcoxon test and the Fisher exact test in subsequent modules.

A Cross-Sectional Survey


In 2002 John Snow, Inc. (JSI) worked with the town of Weymouth, Massachusetts to identify unmet health needs in the town and devise a plan to prioritize unmet health needs and key risk factors that may be modified through lifestyle changes. The project conducted a mail survey of a random sample of 5,000 households as well as a survey of all 3,400 Weymouth public school students in grades nine through twelve. The information assisted the Town's decision-making about priorities for improving services and designing interventions that may prevent or reduce the incidence of ill health. Below you will find links to PDF versions of the full surveys and a link to a subset of the data and a key for identifying the variables and the coded responses.

Link to the Adult Survey Questionnaire

Link to a subset of the data from the Adult Survey

Link to description of the variable names and codes for the adult survey data

link to the Student Survey Questionnaire

 

Open the link to the Adult Survey Questionnaire and scan through it to get an idea of how a carefully constructed survey tool looks. Note the efforts to make the question explicit and clear. 

Thinking man icon indicating a question for the student

 

 

Ecologic Studies


In ecologic studies the unit of observation for the exposure of interest is the average level of exposure in different populations or groups, and the outcome of interest is the overall frequency of disease for those populations or groups. In this regard, ecologic studies are different from all other epidemiologic studies, for which the unit of observation is exposure status and outcome status for individual people. As a result, ecologic studies need to be interpreted with caution. Nevertheless, they can be informative, and this module will focus on their analysis, interpretation, and presentation using correlation and simple linear regression.

Computing the Correlation Coefficient

The module on Descriptive Studies showed an ecologic study correlating per capita meat consumption and incidence of colon cancer in women from 22 countries. Investigators used commerce data to compute the overall consumption of meat by various nations. They then calculated the average (per capita) meat consumption per person by dividing total national meat consumption by the number of people in a given country. There is a clear linear trend; countries with the lowest meat consumption have the lowest rates of colon cancer, and the colon cancer rate among these countries progressively increases as meat consumption increases.

Graph of colon cancer indidence in 25 countries as a function of per capita meat consumption. Countries that eat more meat have greater colon cancer incidence.

Note that in reality, people's meat consumption probably varied widely within nations, and the exposure that was calculated was an average that assumes that everyone ate the average amount of meat. This average exposure was then correlated with the overall disease frequency in each country. The example here suggests that the frequency of colon cancer increases as meat consumption increases.

How can we analyze and present this type of information?

As noted in the module on Descriptive Studies, ecologic studies invite us to assess the association between the the independent variable (in this case, per capita meat consumption) and the dependent variable (in this case, the outcome, incidence of colon cancer in women) by computing the correlation coefficient ("r"). This section will provide a brief outline of correlation analysis and demonstrate how to use the R statistical package to compute correlation coefficients. Correlation analysis and simple linear regression are described in a later module for this course.

Link to module on Correlation and Linear Regression.

The most commonly used type of correlation is Pearson correlation, named after Karl Pearson, introduced this statistic around the turn of the 20th century. Pearson's r measures the linear relationship between two variables, say X and Y. A correlation of 1 indicates the data points perfectly lie on a line for which Y increases as X increases. A value of -1 also implies the data points lie on a line; however, Y decreases as X increases. The formula for r is:

 where Cov(x,y) is the covariance of x and y defined as

and and are the sample variances of x and y, defined as follows:

and

The variances of x and y measure the variability of the x scores and y scores around their respective sample means of X and Y considered separately. The covariance measures the variability of the (x,y) pairs around the mean of x and mean of y, considered simultaneously.

We can combine all of this into the following equation:

While this looks quite tedious, one can use R Studio to compute the correlation coefficient quite easily.

In the "Other Resources" listed to the left of this page there is a link to a data file called "Meat-CancerEcologic.csv" which has three columns: Country, Grams (per capita meat consumption), and Incidence (Incidence of colon cancer per 100,000 women).

If I import this data set into the R Studio, I can compute the correlations coefficient and then plot the points using the following commands. First, I created a data frame called "meat,":and then I computed the correlation coefficient.

> meat <- Meat-CancerEcologic

> attach(meat)

> cor (grams, incidence)

[1] 0.9005721

 

The correlation coefficient of 0.9005721 indicates a strong positive correlation between national per capita meat consumption and national incidence of colon cancer in women.

Next, I created a scatter plot of the data.

> plot(grams, incidence, col="red", pch =24)

Visual inspection of the plot suggests a linear relationship with a strong positive correlation, and the correlation coefficient r=0.90 confirms this.

Thinking man icon indicating questions for the student to answer

Download the data set and try it yourself.

 

Brief Comments About Data Presentation


In order to be useful, the data must be organized and analyzed in a thoughtful, structured way, and the results must be be communicated in a clear, effective way to both the public health workforce and the community at large. Some simple standards are useful to promote clear presentation. Compiled data are commonly summarized in tables, graphs, or some combination.

Simple guidelines for tables.

  1. Provide a concise descriptive title.
  2. Label the rows and columns.
  3. Provide the units in the column headers.
  4. Provide the column total, if appropriate.
  5. If necessary, additional explanatory information may be provided in a footnoted legend immediately beneath the title.

Table - Treatment with Anti-hypertensive Medication in Men and Women

Sex

Number on Treatment / n

Relative Frequency, %

Male

611/1,622

37.7

Female

608/1,910

31.8

Total

1,219/3,532

34.5

Simple guidelines for figures:

  1. Include a concise descriptive title.
  2. Label the axes clearly showing units where appropriate.
  3. Use appropriate scales for the vertical and horizontal axes that display the results without exaggerating them with ranges that are either too expansive or too restrictive.
  4. For line graphs with multiple groups include a simple legend if necessary.

Figure - Relative Frequency of Anti-hypertensive Medication Use in Men and Women

Bar graph of frequency of antihypertension medication in males and females

Additional resources for summarizing and presenting data:

  1. Online learning module on "Data Presentation." (Link to Data Presentation module)
  2. Online learning module on "Summarizing Data". (Link the Summarizing Data module)
  3. The CDC also provides another good resource for advice about organizing data. (Link to CDC page on organizing data.)

 

Answer to Question on Page 3 Regarding Confidence Interval for the Body Mass Index

Question:

The Framingham Heart Study reported that in a sample of 3,326 subjects the mean body mass index was 28.15, and the standard deviation was 5.32. What was the 95% confidence interval for the population's mean body mass index?

Answer:

We can use

with z=1.96 for 95% confidence.

So the 95% confidence interval is

Interpretation:

Our estimate of the mean BMI in the population is 28.15. With 95% confidence the true mean is likely to be between 27.97 and 28.33.

 

Answer to 95% Confidence Interval for the Case-Fatality Rate from Bird Flu - page 3

The point estimate is

There are 7 persons who died and 22 who did not, so we can use the following formula:

Substituting:

So, the 95% confidence interval is 0.085, 0.391.

Interpretation:

Our best estimate of the case-fatality rate from bird flu is 24%. With 95% confidence the true case-fatality rate is likely to be between 8.5% to 39.1%.

Note that this 95% confidence interval is quite broad because of the small sample size (n=29).