Comparing Frequencies

Introduction


One of the key steps in establishing the determinants of health and disease is to accurately estimate the degree to which specific exposures are associated with specific health outcomes.

Previous modules have provided concepts and tools for comparing groups with respect to continuously distributed variables (measurements). This module will similarly provide concepts and tools for comparing groups with respect to dichotomous, categorical, and ordinal outcomes for which we are comparing frequencies rather than mean values of a measurement, and we will focus on evaluating sampling error when making these types of comparisons.

We will first explore the utility of two very versatile and commonly used methods:

We will then turn our attention to computing and interpreting confidence intervals for commonly used measures of association.

 

Key Questions:

How do we measure the association between categorical outcomes and exposures?

How do we determine whether an exposure increases the risk of a particular health outcome?

What information can we assess from categorical data? What cannot be assessed?

When is it most appropriate to assess data categorically?

Learning Objectives


After successfully completing this module, the student will be able to:

 

Chi-Squared Tests


Chi-squared tests are used to test for differences between observed frequencies and frequencies that were expected under the null hypothesis. The general form of the test statistic is:

where O= the number of observed events, and E=the number of expected events under the null hypothesis. The probability of observing these differences under the null hypothesis can then be estimated using the chi-squared distribution. In a sense, the chi-squared distribution can be thought of as a series of distributions, which vary based on the degrees of freedom (df). Two examples are shown below for four degrees of freedom and ten degrees of freedom.

Critical values of chi-squared for different levels of significance (α levels) can be obtained from tables of the chi-square distribution, such as the one shown below. Note that the first column on the left indicates the degrees of freedom, and the next five columns list the corresponding critical values of χ2 for the α-levels listed in the row at the top. Therefore, for a two by two contingency table which has one degred of freedom, the critical value is 3.84 at an alpha level of 0.05. One can also estimate p-values from the table based on χ2 and the degrees of freedom by interpolating. Alternatively, p-values can be obtained by using R or the "CHITEST" function in Excel.

Chi-Squared Goodness of Fit Test


The chi-square goodness of fit test assesses whether the observed frequencies fit a specified distribution. 

Example:

A University conducted a survey of its recent graduates to collect demographic and health information for future planning purposes as well as to assess students' satisfaction with their undergraduate experiences. The survey revealed that a substantial proportion of students were not engaging in regular exercise, many felt their nutrition was poor and a substantial number were smoking. In response to a question on regular exercise, 60% of all graduates reported getting no regular exercise, 25% reported exercising sporadically and 15% reported exercising regularly as undergraduates. The next year the University launched a health promotion campaign on campus in an attempt to increase health behaviors among undergraduates. The program included modules on exercise, nutrition and smoking cessation. To evaluate the impact of the program, the University again surveyed graduates and asked the same questions. The survey was completed by 470 graduates and the following data were collected on the exercise question:

 

 

No Regular Exercise

Sporadic Exercise

Regular Exercise

Total

Observed #

255

125

90

470

Based on the data, is there evidence of a shift in the distribution of responses to the exercise question following the implementation of the health promotion campaign on campus?

In this example, we have one sample and a discrete (ordinal) outcome variable (with three response options). We specifically want to compare the distribution of responses in the sample to the distribution reported the previous year (i.e., 60%, 25%, 15% reporting no, sporadic and regular exercise, respectively). We now run the test using the five-step approach.  

First, we set up the hypotheses and determine level of significance. The null hypothesis again represents the "no change" or "no difference" situation. If the health promotion campaign has no impact then we expect the distribution of responses to the exercise question to be the same as that measured prior to the implementation of the program.

H0: p1=0.60, p2=0.25, p3=0.15,  or equivalently H0: The distribution of responses is 0.60, 0.25, 0.15  

H1:   H0 is false.          α =0.05

Notice that the research hypothesis is written in words rather than in symbols. The research hypothesis as stated captures any difference in the distribution of responses from that specified in the null hypothesis. We do not specify a specific alternative distribution, instead we are testing whether the sample data "fit" the distribution in H0 or not. With the χ2 goodness-of-fit test there is no upper or lower tailed version of the test.

 Based on the expected distribution, we can calculate the number of students we would have expected to see in each exercise category.

 

 

No Exercise

Sporadic Exercise

Regular Exercise

Total

# Observed

255

125

90

470

# Expected

282

117.5

70.5

470

Then, for each category, we compute (O-E)2/E.

 

No Exercise

Sporadic Exercise

Regular Exercise

Total

# Observed

255

125

90

470

# Expected

282

117.5

70.5

470

(O-E)2/E

2.59

0.48

5.39

8.46

 

Since there are three categories, the degrees of freedom = 2. (df=k-1=3-1=2).

From the chi-squared table above we find that the critical value is 5.99 for 2 degrees of freedom at the α=0.05 level. Since 8.46 is greater than the critical value, we reject the null hypothesis, and conclude that the distribution of exercise has changed; it is no longer 60%, 25%, 15%. 

Goodness of Fit Chi-Squared Test with R

The analysis above could also be performed using R by providing the observed number of responses and the expected frequencies as follows:

> obs <-c(255,125,90)

> null_p<-c(0.60,0.25,0.15)

> chisq.test(obs,p=null_p)

Chi-squared test for given probabilities

data: obs

X-squared = 8.4574, df = 2, p-value = 0.01457

The Chi-Squared Test of Independence


The chi-squared test of independence also uses the chi-squared statistic and chi-squared distribution, but it is used to test whether there is a difference in frequency among two or more groups. The outcome is categorical (2 or more levels) or ordinal. Therefore, there can be multiple rows or columns in our contingency table, and the degress of freedom are

where r= the number of rows in the contingency table, and c= the number of columns.

For example, in the following contingency table, df=(r-1)*(c-1)= (3-1)*(3-1)=4:

 

Good

Fair

Poor

High Exposure

 

 

 

Medium Exposure

 

 

 

Low Exposure

 

 

 

There are 3 exposure categories and 3 outcome categories, so df= (3-1) * (3-1) = 2*2 = 4

The research question can be phrased as either:

Therefore,

Example 1:

Investigators wanted to study factors related to whether an HIV individual would disclose the fact that they were HIV+ to their sexual partners.

[Stein MD, Freedberg KA, Sullivan LM, Savetsky J, Levenson SM, Hingson R, Samet JH. Sexual ethics. Disclosure of HIV-positive status to partners. Arch Intern Med. 1998 Feb 9;158(3):253-7.]

The abstract stated:

"We interviewed 203 consecutive patients presenting for primary care for HIV at 2 urban hospitals. One hundred twenty-seven reported having sexual partners during the previous 6 months. The primary outcome of interest was whether patients had told all the sexual partners they had been with over the past 6 months that they were HIV positive.

One study sought to determine whether the frequency of disclosure varied depending on the potential mode of transmission risk, and their findings are shown in the table below.

Table 1: Observed Data

HIV Transmission Risk

Disclosed

Not Disclosed

Total

Injection Drug Use

35 (67%)

17

52

Homosexual contact

13 (52%)

12

25

Heterosexual contact

29 (58%)

21

50

Total

77 (60.6%)

50

127

 

Note that a total of 77 individuals out of 127 reported disclosure, and the other 50 did not. Therefore, the overall frequency of disclosure was 77/127= 60.6%. If there were no differences among the three groups, one would expect the frequency of disclosure to be 60.6% for each of the three groups. We can then calculate the number of expected disclosure in each of the three risk categories by multiplying the number of subjects in each category by 0.606. For example, there wer 52 injection drug users, so expected disclosures would be 52 x 0.606 = 31.5. And we can compute the expected number of not disclosures in this category by simply subracting 31.5 from 52, so the number of non-disclosures for injection drug use is 52-31.5=20.5. If we repeat this procedure for the other two risk categories, we can create the table of frequencies that would be expected if the null hypothesis were true, as shown in Table 2 below.

Table 2: Expected Under the Null Hypothesis

HIV Transmission Risk

Disclosed

Not Disclosed

Total

Injection Drug Use

31.5 (60.6%)

52-31.5 = 20.5

52

Homosexual contact

15.2 (60.6%)

25-15.2 = 9.8

25

Heterosexual contact

30.3 (60.6%)

50-30.3 = 19.7

50

Total

77 (60.6%)

50

127

 

Now we can compute the chi-squared statistic using the formula

Next, we need to compute the degrees of freedom, which is

where r = the number of category rows and c = the number of category columns. In this case:

We can see from the chi-squared table that the critical value of χ2 with 2 degrees of freedom and α=0.05 is 5.99, but our computed is only 1.95, so we would fail to reject the null hypothesis, and we would conclude that there is insufficient evidence to conclude that the frequency of disclosure varies among these three risk categories.

However, we can get a better idea of the actual p-value by using the 1-pchisq()command in R and providinn the chi-squared statistic and the degrees of freedom in parentheses.

> 1-pchisq(1.95,2)

[1] 0.3771924

Therefore, the p-value is 0.38.

Note that we use 1-pchitest because we want the probability given by the upper tail of the chi-squared distribution.

 

Using R for the Chi-squared Test of Independence


The calculations for the chi-squared test can be tedious, but R can do these quite easily. Here are two useful techniques.

For Tabulated Frequencies

Suppose you have data that is already summarized in a contingency table as in the table below. (Note this is the same data that was used on the preceding page.)

 

Disclosed

Non-Disclosed

Injection drug use

35

17

Homosexual contact

13

12

Heterosexual contact

29

21

One can create a "matrix" in R and perform a chi-squared test as follows:

> datatable<-matrix(c(35,13,29,17,12,21),nrow=3,ncol=2)

> datatable

[,1] [,2]

[1,] 35 17

[2,] 13 12

[3,] 29 21

> chisq.test(datatable,correct=FALSE)
# Note: the correct=FALSE option indicates that we do not need a correction for small sample size since all of the expected frequencies are greater than 5.

Pearson's Chi-squared test

data: datatable

X-squared = 1.8963, df = 2, p-value = 0.3875

 

For Raw Data from a CSV File

The dataset FramCHDStudy.CSV has a variable "hypert" for hypertension (high blood pressure), which is coded 0 if absent and 1 if present. The development of coronary heart disease ("chd") is coded 0 if it did not occur and 1 if it occurred. Suppose we want to conduct a chi-squared test to assess whether individuals with hypertension have a greater risk of developing coronary heart disease. We could use the following code:

> FramCHDStudy <- read.csv("C:/Users/wlamorte/Desktop/Quant Core/Data sets/FramCHDStudy.csv")

> View(FramCHDStudy)

> fram2<-FramCHDStudy

> attach(fram2)

> table(hypert,chd)

chd hypert --0 --1

0 748 140

1 380 142

> prop.table(table(hypert,chd),1)

# Note: the 1 at the end of the preceding command asks R to compute proportions across each row. 

# For example, among those who were not hypertensive 84.23% did not have CVD and 15.77% did have CVD

---chd

----hypert --------0 --------1

0 0.8423423 0.1576577

1 0.7279693 0.2720307

> chisq.test(table(hypert,chd),correct=FALSE)

Pearson's Chi-squared test

data: table(hypert, chd)

X-squared = 26.8777, df = 1, p-value = 2.168e-07

The data table has 2 rows and 2 columns, so df=(2-1) x (2-1) = 1.

The resulting p-value is 2.168 x 10-7. So, there is strong evidence that the risk of CHD was greater in subjects with hypertension.

Application of the Chi-Squared Test of Independence


The chi-squared test of indendence can be used to analyze data from cross-sectional surveys, retrospective and prospective cohort studies, randomized clinical trials, and case-control studies.

A Cross-sectional Survey

Consider the results of a cross-sectional survey in which assistant professors at colleges were asked to indicate their sex (the exposure of interest) and whether their starting salary was greater or less than $60,000 per year.

 

< $60,000

> $60,000

Male

122

75

Female

64

50

The prevalence of a salary less than $60,000 per year was 122/(122+75) = 0.619=61.9% in males compared to a prevalence of 64/(64+50) = 0.561=56.1% in females. The prevalence ratio for lower salary in males compated to females was therefore:

I can compare the frequencies with the following code:

> salarytable<-matrix(c(122,64,75,50,12,21),nrow=2,ncol=2)

> salarytable

[,1] [,2]

[1,] 122 75

[2,] 64 50

> chisq.test(salarytable,correct=FALSE)

Pearson's Chi-squared test

data: salarytable

X-squared = 1.0066, df = 1, p-value = 0.3157

Since the p-value is 0.2157 there is not sufficient evidence to conclude that the frequency of a starting salary above $60,000 per year differs between males and females.

A Prospective Cohort Study

Antonia Trichopoulou, M.D., et al: Adherence to a Mediterranean Diet and Survival in a Greek Population. N Engl J Med 2003;348:2599-608.

From 1994 to1999 a study was conducted to identify nutritional and lifestyle behaviors associated with survival in Greek adults. A total of 28,572 participants, 20 to 86 years old, were recruited from all regions of Greece. One goal was to study the extent to which close adherence to a traditional Mediterranean (Greek) diet was associated with survival, but the investigators also examined a number of other potential risk factors. After enrollment (i.e. at the baseline or beginning of the study), subjects completed extensive questionnaires administered in person by specially trained interviewers. The dietary questionnaire documented food intake during the past year using a semi-quantitative food-frequency questionnaire that included 150 foods and beverages commonly consumed in Greece. Adherence to the traditional Mediterranean diet was assessed by a 10-point Mediterranean-diet scale. Some of the results in menare shown in the table below.

Adherence to Greek Diet

Died During Study

Not Dead

Total

Low

74

2383

2457

Medium

61

3747

3808

High

44

2586

2630

> diettable<-matrix(c(74,61,44,2383,3747,2586),nrow=3,ncol=2)

> diettable

[,1] [,2]

[1,] 74 2383

[2,] 61 3747

[3,] 44 2586

> chisq.test(diettable,correct=FALSE)

Pearson's Chi-squared test

data: diettable

X-squared = 17.2361, df = 2, p-value = 0.0001808

Therefore, I would reject the null hypothesis and conclude that the frequency of death among Greek makes does differ significantly among the three categories of dietary adherence.

I could also compute risk ratios by using the men with high adherence as a reference group and comparing the other two categories to them. For example, the cumulative incidence in men with low adherence was 74/2457 = 0.0301 =30.1 deaths per 1,000 men over the five years of observation. The cumulative incidence in men with high adherence was 44/2630 = 0.0167 =16.7 deaths per 1,000 men over the five years of observation. Therefore, the risk ratio for men with low adherence compared to those with high adherence was as follows:

One might interpret these results as follows: Men with low adherence to a traditional Greek diet had 1.8 times the risk of dying during a five year period of observation, and this difference was statistically significant (p=0.0002).

A Randomized Clinical Trial

In 1982 the Physicians' Health Study enrolled over 22,000 male physicians in the US between the ages of 40-84 in order to study whether low-dose aspiring (one tablet every other day) was protective against myocardial infarctions (heart attacks). The subjects were randomly assigned to take low-dose aspirin or a placebo, and they were followed for about five years. One of the endpoints of interest was whether aspirin reduced the incidence of fatal myocardial infarctions. There findings are summarized in the table below.

 

Fatal MI

No Fatal MI

Total

Aspirin

10

11,027

11,037

Placebo

26

11,008

11,0034

 

> MItable<-matrix(c(10,26,11027,11008),nrow=2,ncol=2)

> MItable

[,1] [,2]

[1,] 10 11027

[2,] 26 11008

> chisq.test(MItable, correct=FALSE)

Pearson's Chi-squared test

data: MItable

X-squared = 7.1271, df = 1, p-value = 0.007593

The cumulative incidence in the men treated with aspirin was 10/11037 = 0.00090604 = 9 per 10,000 over 5 years.

The cumulative incidence in the men receiving the placebo was 26/11034 = 0.002356 = 23 per 10,000 over 5 years.

Therefore the risk ratio was:

Interpretation: Male physicians who took an aspiri every other day had 0.39 times the risk (or a 61% reduction in risk) compared to male physicians treated with placebo (p=0.008).

A Case-Control Study

D'Souza et al. conducted a study on the association between human papillomavirus and oropharyngeal cancer (N Engl J Med 2007;356:1944-56). They identified 100 patients with newly diagnosed squamous-cell carcinomas of the head and neck in Baltimore from 2000 through 2005. The comparison group consisted of 200 patients without a history of cancer who were seen for benign conditions between 2000 and 2005 in the same clinic. All patients completed a computer-assisted self-administered interview that recorded information about demographic characteristics, past oral hygiene, medical history, family history of cancer, lifetime sexual behaviors, and lifetime history of marijuana, tobacco, and alcohol use. Part of their results focused on the association between oral hygiene and oropharyngeal cancer, as shown in this table.

 

Patients with Oropharyngeal  Cancer
(N=100)

Control Patients
(N=200)

Tooth Loss

 

 

None

62

163

Some

16

20

Complete

22

17

 

> CCtable<-matrix(c(62,16,22,163,20,17),nrow=3,ncol=2)

> CCtable

[,1] [,2]

[1,] 62 163

[2,] 16 20

[3,] 22 17

> chisq.test(C-Ctable,correct=FALSE)

Pearson's Chi-squared test

data: CCtable

X-squared = 14.7262, df = 2, p-value = 0.0006342

Since this is a case-control study, I cannot calculate the incidence, and I cannot calculate a risk ratio per se. However, I can compute odds ratios, for example, using the subjects with no tooth loss as the reference and comparing each of the other two exposure groups to them. For example, comparing the group with complete tooth loss to those with no tooth loss the odds ratio is as follows:

Interpetation: Those who had complete tooth loss had 3.4 times the odds of having oropharngeal cancer.

Confidence Interval for the Risk Ratio


When comparing frequencies, the chi-squared test is a useful way of assessing the strength of evidence for rejecting the null hypothesis, and the magnitude of association can be summarized by computing a risk ratio, odds ratio, risk difference, etc. depending on the study design and what we are interested in evaluating. However, these measures of association are also estimates of populations parameters, and our interpretations of the findings are greatly enhanced by computing a confidence interval for these estimates as well.

The samples obtained for a cohort study or a randomized clinical trial provide us with estimates of the cumulative incidence in comparison groups, and the differences can be summarized by computing a risk ratio as an estimate of the magnitude of association. The confidence interval for a risk ratio provides us with a range of plausible values.

An earlier module noted that the general form of a 95% confidence interval is:

 

where SE(estimate) is the standard error of the estimate..

One of the assumptions in building a confidence interval is that the possible values are normally distributed. However, risk ratios and odds ratios are not normally distributed; the distribution of possible values is skewed toward higher values. One can get around this problem by taking the natural logarithm (log) of the risk ratio, because log(RR) will be normally distributed.

The distribution of risk ratio values is skewed to the left.

If one takes the natural log of the risk ratio, the distribution will be normally distributed.

So, to compute a confidence interval for the risk ratio, we have to work on the log-scale and then take the antilogarithm of the lower and upper confidence limits to compute the confidence limits on the risk ratio scale.

The formula for the 95% Confidence Interval for the risk ratio is as follows:

The steps are:

  1. Convert from RR to ln(RR)
  2. Find CI for ln(RR)
  3. Convert CI from ln(RR) to RR

Consider the following results from a cohort study in which the investigators compared the risk of developing cardiovascular disease (CVD) in subjects with hypertension (HTN) compared to subjects without hypertension.

Table - Association Between Hypertension (HTN) and Cardiovascular Disease (CVD)

CVD

No CVD

Total

HTN

992 (0.305)

2260

3252

No HTN

165 (0.140)

1017

1182

Total

1157

3277

4434

RR=.305/.140 = 2.18      

Step 1: Take the natural log of the RR (using R):

> log(2.18)

[1] 0.7793249

Step 2: We compute the 95% confidence inteval for log(RR). We will use a Z=1.96 and the following equation:

Step 3: We convert the log limits back to a normal scale for risk ratios by taking the antilog using R.

> exp(0.628)

[1] 1.873859

> exp(0.930)

[1] 2.534509

 

Therefore, the 95% CI for RR: (1.87, 2.53)

Note: The confidence interval is not symmetric about RR   

Interpretation: In this study subject with hypertension had 2.18 times the risk of developing cardiovascular disease compared to subjects without hypertension. With 95% confidence, the true risk rato lies in the range of 1.87-2.53.

Confidence Interval for an Odds Ratio


Note that while we have discussed using the odds ratio as a measure of association in the context of a case-control study, odds ratios can also be computed in other types of study designs as well. Recall our example of a prospective cohort study to examine the association of hypertension and cardiovascular disease. In this study the risk ratio was RR=2.18, but we can also compute an odds ratio and then use these data to illustrate how to compute a confidence interval for an odds ratio.

Table - Association Between Hypertension (HTN) and Cardiovascular Disease (CVD)

CVD

No CVD

Total

HTN

992

2260

3252

No HTN

165

1017

1182

Total

1157

3277

4434

Just as we noted for risk ratios, odds ratios are also not normally distributed. They too are skewed toward the upper end of possible values. As a result, we must once again take the natural log of the odds ratio and first compute the confidence limits on a logarithmic scale, and then convert them back to the normal odds ratio scale.

The formula for the 95% Confidence Interval for the odds ratio is as follows:

The standard error for log(OR) is computed using the following equation:

We will illustrate computation of a 95% confidence interval for the data in the contingency table shown above. 

> log(2.71)

[1] 0.9969486

> exp(0.816)

[1] 2.261436

> exp(1.178)

[1] 3.247872

Interpretation: In this study, subjects with hypertension 2.71 times the odds of developing cardiovascular disease compared to non-hypertensive subjects. With 95% conficence the true odds ratio lies in the range of 2.26-3.24.

NOTE: The following sections are OPTIONAL and are only provided as a future resource.

Rate Ratios


Rate ratios are closely related to risk ratios, but they are computed as the ratio of the incidence rate in an exposed group divided by the incidence rate in an unexposed (or less exposed) comparison group.

Rate Ratio = (IRe) / (IRu

Consider an example from The Nurses' Health Study. This prospective cohort study was used to investigate the effects of hormone replacement therapy (HRT) on coronary artery disease in post-menopausal women. The investigators calculated the incidence rate of coronary artery disease in post-menopausal women who had been taking HRT and compared it to the incidence rate in post-menopausal women who had not taken HRT. The findings are summarized in this table:

Post-Menopausal Hormone Use

Number With Coronary Artery Disease

Person-Years of Disease-Free Follow-up

Yes

30

54,308.7

No

60

51,477.5

So, the rate ratio was 55.2 / 116.6 = 0.47.

Interpretation: Women who used postmenopausal hormones had 0.47 times the rate of coronary artery disease compared to women who did not use postmenopausal hormones.

(Rate ratios are often interpreted as if they were risk ratios, e.g., post-menopausal women using HRT had 0.47 times the risk of CAD compared to women not using HRT, but it is more precise to refer to the ratio of rates rather than risk.)

Confidence Interval for a Rate Ratio

We can also calculate a 95% confidence interval for the rate ratio to give us an idea of the range of plausible values for the measure based on our sample. Like the risk ratio, the rate ratio is not normally distributed, but the natural log of the rate ratio, log(IRR), where IRR stands for incidence rate ratio, is normally distributed. So, again, we have to work on the log-scale to satisfy the normality requirement, and then take the antilogarithm of the lower and upper confidence limits at the end to compute the confidence limits on the rate ratio scale. The formula for the 95% Confidence Interval for the risk ratio is as follows:

As with the risk ratio, lets walk through this formula step-by-step, using the hormone replacement therapy and coronary artery disease data as an example.

 

- The lower bound of the 95% confidence interval of the log rate ratio is:

- The upper bound of the 95% confidence interval of the log rate ratio is:

 

In this study, those on hormone replacement therapy had 0.47 times the rate of developing coronary artery disease compared to those who did not take hormone replacement therapy. Based on this sample, we are 95% confident that the true rate ratio lies between 0.303 and 0.729.

As a rule of thumb, you should use the following wording to interpret the 95% confidence interval for the rate ratio: "Based on this sample, we are 95% confident that the true rate ratio lies between [lower bound] and [upper bound]."

Hypothesis Test for the Rate Ratio

Hypothesis testing for the rate ratio can also be done, and its null and alternative hypotheses are similar to those for the risk ratio. For the rate ratio, the hypotheses are as follows:

H0: IRR = 1 vs. HA: IRR ≠ 1

Or, in words

H0: There is no association between the exposure and the rate of disease

vs.

H1: There is an association between the exposure and the rate of disease

Again, both forms of the null and alternative are equivalent since, if there is no association between the exposure and the rate of disease, the rate of disease among the exposed will be similar to that among the unexposed, which would make the rate ratio close to the null value of 1. In the example of hormone replacement therapy and coronary artery disease, the null and alternative hypotheses would be

A statistical test can be performed to evaluate these hypotheses. While the hand calculations of the test statistic are beyond the scope of this class, in later pages of this module we will see how to perform a statistical test for the rate ratio using the R statistical software.

of magnetic field exposure and leukemia, the null and alternative hypotheses would be

Again we can use a chi-squared statistic.

where O represents the observed frequencies and E represents the expected frequencies. Recall that the expected frequency of a given cell is calculated as the product of the row and column totals divided by the total sample size. For the magnetic field exposure and leukemia example, the expected values are shown below.

Observed and Expected Frequencies in the Leukemia Study Above

Category

Observed Frequency

Expected Frequency

High Exposure, Leukemia

30

(674 x 2,355)/69,567 = 22.82

High Exposure, No Leukemia

644

(674 x 67,212)/69,567 = 651.18

Medium Exposure, Leukemia

61

(1469 x 2,355)/69,567 = 49.73

Medium Exposure, No Leukemia

1,408

(1,469 x 67,212)/69,567 = 1419.27

Low Exposure, Leukemia

2,264

(67,424 x 2,355)/69,587 = 2,282.46

Low Exposure, No Leukemia

65,160

(67,424 x 67,212)/69,567 = 65,141.55

  Using these values, we can now calculate the chi-square statistic:


Now that we have calculated the chi-square statistic, we need to determine if this value would lead us to reject or fail to reject the null hypothesis at an α-level of 0.05. To do this, we compare our calculated chi-square statistic to a critical value with (3-1) x (2-1) = 2 x 1 = 2 df in the 0.05 column. From the chi-square table referenced on page 4, we see that this value is 5.99. Since our calculated test statistic is less than the critical value of 5.99, we fail to reject the null and conclude that there is no association between magnetic field exposure and development of leukemia. This result would not be considered "statistically significant."

Confidence Interval for a Risk Difference


As with the risk ratio and the rate ratio, a 95% confidence interval for the risk difference can be calculated to provide us with a range of plausible risk differences based on our sample. Recall that one of the assumptions in building a confidence interval is that , the measure you are building a confidence interval around, is normally distributed. Unlike the risk ratio and the rate ratio, the risk difference is normally distributed, so we can calculate the confidence interval for the risk difference directly and do not need to transform it to the log-scale. The formula for the 95% Confidence Interval for the risk difference is as follows:

 Risk Difference + [1.96 x SE(Risk Difference)]

To illustrate we will use data from a prospective cohort study in which one of the associations that was examined was the association between smoking and lung cancer. To simplify we regarded anyone who smoked regularly as a "smoker."

Table - Incidence of Lung Cancer in Smokers and Non-Smokers

 

Lung Cancer

No Lung Cancer

 Total

Smokers

40

20

60

Non-Smokers

10

130

140

 Total

50

150

200

 

 
Note: You may notice that this formula is different from the one in your textbook. The formula shown here is preferred; the one in the textbook is a formula that generates an approximation.

- The lower bound of the 95% confidence interval of the risk difference is:
Risk Difference - [1.96 x SE[Risk Difference]] = 0.596 - [1.96 x 0.067] = 0.596 - 0.131 = 0.465

- The upper bound of the 95% confidence interval of the risk difference is:
Risk Difference + [1.96 x SE[Risk Difference]] = 0.596 + [1.96 x 0.067] = 0.596 + 0.131 = 0.727

There were 596 excess lung cancer cases per 1000 subjects in the group that smoked, compared to the group who did not smoke during the study period. Based on this sample, we are 95% confident that the true risk difference lies between 465/1000 and 727/1000 excess cases of lung cancer.

As a rule of thumb, you should use the following wording to interpret the 95% confidence interval for the risk difference: "Based on this sample, we are 95% confident that the true risk difference lies between [lower bound] and [upper bound] excess/fewer cases of [disease]."

Confidence Interval for a Rate Difference


Similar to the other measures of association covered so far, the 95% confidence interval for the rate difference can be calculated to provide us with a range of plausible rate differences based on our sample. Like the risk difference, the rate difference is normally distributed, so we can calculate the confidence interval for the rate difference directly and do not need to transform it to the log-scale.

The formula for the 95% Confidence Interval for the rate difference is as follows:

Rate Difference + [1.96 x SE(Rate Difference)]

 Let's walk through this formula step-by-step, using the smoking and lung cancer data as an example.

-The lower bound of the 95% confidence interval of the rate difference is:

- The upper bound of the 95% confidence interval of the rate difference is:

Post-menopausal women who received HRT had 62 fewer cases of coronary artery disease per 100,000 person-years compared to post-menopausal women who did not receive hormone replacement therapy. Based on this sample, we are 95% confident that the true risk difference lies between 97 and 27 fewer cases of coronary artery disease per 100,000 person-years among post-menopausal women who received HRT compared to those who did not.

As a rule of thumb, you should use the following wording when interpreting the 95% confidence interval for the rate difference. "Based on this sample, we are 95% confident that the true rate difference lies between [lower bound] and [upper bound] excess/fewer cases of [disease] per [number] person-years."

Using R to Evaluate Relative and Absolute Measures


To save ourselves time and to reduce the risk of making an error, we can also use R to perform the computation of these measures of association. R can also be used to calculate the 95% confidence intervals for each measure of association and can also compute the test statistic for the corresponding hypothesis tests.

Note that while R can compute the rate difference as well as its corresponding confidence interval and hypothesis test, those computations are beyond the scope of this class. Here, we will be covering risk ratios, rate ratios, risk differences, and odds ratios.

To perform these calculations in R, we will use the epi.2by2 function in the epiR package. To install the epiR package in RStudio, you can simply type the following into the console:

>install.packages("epiR")

Note that installation of the package only has to be done the first time you want to use the epiR package. After installation of the package, you need to load the epiR package into the current RStudio session. This step has to be performed every time you open a new R session and want to use the epiR package.

To load the epiR package, type the following into your console:

>library(epiR)

[Insert video clip on installing and loading packages in R.]

 

To see how to use R to perform the hand calculations in this module, download smoking_lungca.csv, and then watch and follow-along with the video below.

[Insert epi.2by2 video here]

 

Important note: The z-test comparing two proportions, which we calculated by hand to test the risk difference, is equivalent to the chi-square test of independence, and the prop.test( ) procedure (which we used in the video to test whether the risk difference is different from 0) formally calculates the chi-square test.

The p-value from the z-test for two proportions is equal to the p-value from the chi-square test, and the z-statistic is equal to the square root of the chi-square statistic in this situation.