Module 5: Using Epidemiology to Identify the Cause of Disease

Module 5: Using Epidemiology to Identify the Cause of Disease

Module 5: Using Epidemiology to Identify the Cause of Disease

Complete/Work for your practice and review these problems
• Gordis Ch 7 Complete and Review Problems 1-11, PLEASE PROVIDE RATIONALES FOR CORRECT AND INCORRECT ANSWERS, Module 5: Using Epidemiology to Identify the Cause of Disease


• From Ch 9-review concepts and math required to calculate Sensitivity, Specificity, PPV, NPV.
• Review the supplied Practice Problems Worksheet
From Gordis:
• Discussion – Locate appropriate article related to concepts of Ch 7 or 8 such as:
1. case-control study and if possible, offer examples of information bias
2. cohort study with the design as retrospective approach and values of this design
3. cohort study with the design as prospective approach and values of this design
May be on any healthcare-related topic.
• Use Headings and Post with bullet or paragraph discussion format to communicate your findings of each of these topics to other group members for:
1. topic of Epidemiology concepts in the article,
2. article focus (health condition, intervention, what was explored).
3. population,
4. outcomes,
5. challenges or limitations,
6. what you learned about the topic.
• Attach a copy of the article.
• Cite source(s) in APA 7th ed.
• Post in Discussion Section for this Module on Canvas and respond to 1 other student.
The attached article is a prospective cohort study of children aged 0-2 years who were followed every two years for eight years through
cycles one to five in the National Longitudinal Survey of Children and Youths (NLSCY). Module 5: Using Epidemiology to Identify the Cause of Disease

Module 5–1 Week Module–Week of March March 8-14 (SUNDAY), 2021      1 Wk Module . Module 5: Using Epidemiology to Identify the Cause of Disease

Module 5: Using Epidemiology to Identify the Cause of Disease

1 Week Module – March 8-14, 2021

Module Objectives:

1.  The student will identify and locate appropriate population-focused sources of data and resources available to HCP. (CO 1,2,3,6)

2.  The student will synthesize evidence to evaluate health programs of population of interest at population level. (CO 1,3)

3.  The student will analyze a current peer-reviewed journal article and report related to the cohort and case-control studies. (CO 1,3)

4.  The student will compute and interpret epidemiologic problems and apply to population-based health strategies and/or outcomes.                                      CO: Course Objectives

Required Readings:

·    Read and Review Gordis Chapters 7, 8, 9

Complete/Work for your practice and review these problems

Gordis Ch 7 Complete and Review Problems 1-11

·    Gordis Ch 8 Complete and Review Problems 1-6

·    From Ch 9-review concepts and math required to calculate Sensitivity, Specificity, PPV, NPV.

·    Review the supplied Practice Problems Worksheet


·   Powerpoint Lecture

·   Supplied Practice Problems Worksheet – Attachment here and same one separately in Module 5

From Gordis:

·     Discussion – Locate appropriate article related to concepts of Ch 7 or 8 such as:

1.   case-control study and if possible, offer examples of information bias

2.   cohort study with the design as retrospective approach and values of this design

3.   cohort study with the design as prospective approach and values of this design

May be on any healthcare-related topic. Module 5: Using Epidemiology to Identify the Cause of Disease

·     Use Headings and Post with bullet or paragraph discussion format to communicate your findings of each of these topics to other group members for:

1.      topic of Epidemiology concepts in the article,

2.      article focus (health condition, intervention, what was explored).

3.      population,

4.      outcomes,

5.      challenges or limitations,

6.      what you learned about the topic.

·     Attach a copy of the article.

·     Cite source(s) in APA 7th ed.

·     Post in Discussion Section for this Module on Canvas and respond to 1 other student.


Gordis Homework Problems of Practice Problems Worksheet:

·     Post to Assignments:

·     Complete the supplied Practice Problems Worksheet on Sensitivity, Specificity, PPV, NPV. May be typed or hand-written. Show work for credit.

·     Post completed Practice Problems Worksheet to Assignments. Post to Canvas Assignments. (Need to know for Epi Quiz 2)


Ø Discussion To Discussions using Headings with bullet or paragraph format (see separate postings headings):

A) Post to Discussion – 1 appropriate article related to concepts of Ch 7, 8, or 9 as listed below

B) Post your article and information related to natural history of disease or survival data with article focus, population, outcomes, what you learned and copy of article. Include APA Citation.

C) Primary post should be done by Thursday this week (March 11). Respond to at least 1 other student by Sunday night module due date (March 14). (1 week module)


Ø  Gordis Complete Epi Practice Worksheet problems:

A)     Complete supplied Practice Problems Worksheet on Sensitivity, Specificity, PPV, NPV. May be typed or hand-written. Show work

B)     Discuss your findings with perspective of impact of disease prevalence and your answers. Review text as needed.

C)     Submit to Assignments-in Word document or scan and upload. Need your Name on the Worksheet

Post to Canvas Assignments. Due by closing date of Module 5 on March 14 (Sunday-is 1 wk module).

D)     *No submission of text chapters problems this week—just answer and review. Module 5: Using Epidemiology to Identify the Cause of Disease


Gordis Rubric:    (Submit to Discussion)

Using Epidemiology to Identify the Cause of Disease Ch 7-8-9 Met Not Met
Locate an appropriate article related to content listed above of Ch 6, 7, or 8. 15 points Not Done


1-Communicate and discuss: topic of Epidemiology concepts in the article 15 points 0
2- Communicate and discuss: article focus (health condition, intervention, what was explored) 10 points 0
3- Communicate and discuss: population 10 points 0
4- Communicate and discuss: outcomes 15 points 0
5- Communicate and discuss: challenges or limitations 10 points 0
6- Communicate and discuss: what you learned about topic 10 points 0
·     Attach a copy of the article. 5 points 0
·     Cite source(s).  APA 7th ed. 5 points 0
Respond to at least 1 other group member w meaningful post               5 points 0
Submit to Discussions. 100 points tbd


Gordis Rubric:    (Submit to Assignments) Module 5: Using Epidemiology to Identify the Cause of Disease

Practice Problems Worksheet Met Met
A-Complete Practice Problems Worksheet – Problems each worked correctly. 45 points

3.75 pts x 12

Not Done


B-Complete Practice Problems Worksheet – showing your work on each problem. 45 points

3.75 pts x 12

C-Discuss your findings with perspective of impact of disease prevalence and your answers. 10 points 0
On Time submission. Late is -5 pts per day 0 tbd
Submit to Assignments-in Word document or scan & upload 100 points tbd


Observational Studies


case reports and case series; ecological studies; cross-sectional studies; case-control studies; information bias; case cross-over studies; matching



  • To describe the motivations for and the design of observational studies. Module 5: Using Epidemiology to Identify the Cause of Disease
  • To discuss early origins of the research question including case reports, case series, and ecologic studies.
  • To describe the cross-sectional study design and its importance.
  • To discuss case-control studies, including selection of cases and controls.
  • To discuss potential selection biases in case-control studies.
  • To discuss information biases in case-control studies, including limitations in recall and recall bias.
  • To describe other issues in case-control studies, including matching and the use of multiple controls.
  • To introduce the case cross-over study design.


Case Reports and Case Series


Perhaps one of the most common and early origins of medical research questions is through careful observations by physicians and other health care providers of what they see during their clinical practice. Such individual-level observations can be documented in a case report, describing a particular clinical phenomenon in a single patient, or in a case series that describes more than one patient with similar problems. Both case reports and case series are considered the simplest of study designs (although some assert that they are merely “prestudy designs”). The main objective of case reports and case series is to provide a comprehensive and detailed description of the case(s) under observation. This allows other physicians to identify and potentially report similar cases from their practice, especially when they share geographic or specific clinical characteristics. For example, 2015 witnessed an outbreak of the Zika virus in Latin America. Zika virus is a flavivirus transmitted by Aedes mosquitoes, most commonly Aedes aegypti and possibly Aedes albopictus, and originally isolated from a rhesus monkey in the Zika forest in Uganda in 1947. 1 In early 2016, following increasing numbers of infants born with microcephaly in Zika virus-affected areas, the Centers for Disease Control and Prevention (CDC) published a descriptive case series from Brazil on the possible association between Zika virus infection and microcephaly, a condition in which the baby’s head is significantly smaller than expected, potentially due to incomplete brain development. 2 Another case report was published about the offspring of a Slovenian woman who lived and worked in Brazil and became pregnant in February 2015. She got ill with a high fever, followed by severe musculoskeletal and retro-ocular pain and an itching and generalized maculopapular rash. No virologic testing for Zika virus was performed. She returned to Europe in the 28th week of gestation when ultrasonographic imaging showed fetal anomalies. The pregnancy was terminated in the 32nd week of gestation at the mother’s request, following the approval of national and institutional ethical committees, and the Zika virus was found in the fetal brain tissue.


Despite the fact that case reports and case series are merely descriptive in nature with no reference group to make a strict comparison, the Brazilian case series was instrumental in the development of the CDC’s guidelines 4 (Fig. 7.1) for the evaluation and testing, by health care providers, of infants whose mothers traveled to or resided in an area with ongoing Zika virus transmission during their pregnancies (Fig. 7.2).


Case reports and case series are key hypothesis-generating tools, especially when they are simple, inexpensive, and easy to conduct in the course of busy clinical settings. However, the lack of a comparison group is a major disadvantage. Furthermore, external validity (generalizability) is limited, given the biased selection of cases (all identified in clinical practice). Finally, any association observed in a case report or a case series is prone to potentially unmeasured confounding unbeknown to the investigators. Module 5: Using Epidemiology to Identify the Cause of Disease


Ecologic Studies

The first approach in determining whether an association exists may be a study of group characteristics, the so-called ecologic studies. Fig. 7.3 shows the correlation of each country’s level of chocolate consumption and its number of Nobel laureates per capita. 5 In this figure, each dot represents a different country. As seen in this figure, the higher the average chocolate consumption for a country, the higher the number of Nobel laureates per capita. Chocolate, high in dietary flavanols, is thought to improve cognitive function and reduce the risk of dementia. We might therefore be tempted to conclude that chocolate consumption may be a causal factor for being awarded a Nobel Prize. What is the problem with drawing such a conclusion from this type of study? Consider Switzerland, for example, which has the highest number of Nobel laureates per capita and the highest average consumption of chocolate. The problem is that we do not know whether the individuals who won Nobel Prize in that country actually had a high chocolate intake. All we have are average values of chocolate consumption and the number of Nobel laureates per capita for each country.  In fact, one might argue that, given the same overall picture, it is conceivable that those who won the Nobel Prize ate very little chocolate. Fig. 7.3 alone does not reveal whether this might be true; in effect, individuals in each country are characterized by the average figures (level of consumption and per capita Nobel laureates) for that country. No account is taken of variability between individuals in that country with regard to chocolate consumption. This problem is called the ecologic fallacy—we may be ascribing to members of a group some characteristic that they in fact do not possess as individuals. This problem arises in an ecologic study because data are only available for groups; we do not have exposure and outcome data for each individual in the population.

Table 7.1 shows data from a study in Northern California exploring a possible relation between prenatal exposure to influenza during an influenza outbreak and the later development of acute lymphocytic leukemia in a child. 6 The table shows incidence data for children who were not in utero during a flu outbreak and for children who were in utero in the first, second, or third trimester of the pregnancy during the outbreak. Below these figures, the data are presented as relative risks, with the risk being set at 1.0 for those who were not in utero during the outbreak and the other rates being set relative to this. The data indicate a high relative risk for leukemia in children who were in utero during the flu outbreak in the first trimester.



Average Annual Crude Incidence Rates and Relative Risks of Acute Lymphocytic Leukemia by Cohort and Trimester of Flu Exposure for Children Younger Than 5 Years, San Francisco/Oakland (1969–1973)

blank cell             No Flu Exposure               flu exposure

trimester             Total

1st          2nd        3rd

Incidence rates per 100,000         3.19       10.32     8.21       2.99       6.94

Relative risks      1.0         3.2         2.6         0.9         2.2

Modified from Austin DF, Karp S, Dworsky R, et?al. Excess leukemia in cohorts of children born following influenza epidemics. Am J Epidemiol. 1977;10:77–83.


What is the problem? The authors themselves stated, “The observed association is between pregnancy during an influenza epidemic and subsequent leukemia in the offspring of that pregnancy. It is not known if the mothers of any of these children actually had influenza during their pregnancy.” 6 What we are missing are individual data on exposure (influenza infection). One might ask, why didn’t the investigators obtain the necessary exposure data? The likely reason is that the investigators used birth certificates and data from a cancer registry; both types of data are relatively easy to obtain. This approach did not require follow-up of the children and direct contact with individual subjects. If we are impressed by these ecologic data, we might want to carry out a study specifically designed to explore the possible relationship of prenatal flu and leukemia. However, such a study would probably be considerably more difficult and more expensive to conduct.

In view of these problems, are ecologic studies of value? Yes, they can suggest avenues of research that may be promising in casting light on etiologic relationships. In and of themselves, however, they do not demonstrate conclusively that a true association exists. Module 5: Using Epidemiology to Identify the Cause of Disease


For many years, legitimate concerns about the possibility of ecologic fallacy gave ecologic studies a bad name and diverted attention from the importance of studying potentially meaningful relationships that can be only studied ecologically, such as those between the individual and the community in which he or she lives. For example, Rose and associates 7 studied the relationship of socioeconomic and racial characteristics of a neighborhood and the receipt of angiography in a community-based sample who had a myocardial infarction (MI). Among the 9,941 people with MI participating in the Atherosclerosis Risk in Communities Study, compared to whites from high neighborhood-level income areas, blacks from low and medium neighborhood-level income areas as well as whites from low neighborhood-level income areas were less likely to be subjected to an angiographic examination. On the other hand, blacks from high neighborhood-level income areas and whites from medium neighborhood-level income areas were not disadvantaged with respect to receiving angiography. Thus future studies addressing both individual risk factors and ecologic risk factors, such as neighborhood characteristics, and the possible interactions of both types of factors may improve our understanding of access to an angiographic examination.

Another example of the importance of ecologic data is given by shistosomiasis, a disease caused by a freshwater parasite schistosomes that can affect the genitourinary or gastrointestinal tracts as well as the central nervous systems, and that is also a risk factor for bladder and liver cancer. Individuals are exposed from contact with infested water. Those in rural communities are at highest risk for contracting schistosomiasis; exposure may come from agriculture or fishing populations, women washing clothes, or children playing in infested water. Egypt has the highest endemic worldwide prevalence of schistosomiasis, dating back to its dynastic period. Parenteral antischistosomal therapy (PAT) use with potassium antimony tartrate, commonly called tartar emetic, has been used for mass-treatment in Egypt since the 1920s through 12 weekly intravenous injections. These injections are done with reusable glass syringes generally without proper sterilization procedures, which may have been responsible for Egypt having the highest hepatitis C prevalence in the world. 9 (Tartar emetic was the only treatment for schistosomiasis until praziquantel [Biltricide], a highly effective oral treatment, was introduced in the 1980s.) In 2000, Frank et?al. 10 studied the ecologic association in Egypt governorate areas between annual PAT use with tartar emetic and seroprevalence of antibodies to hepatitis C virus (HCV) in 8,499 Egyptians aged 10 to 50 years. Overall, age-adjusted prevalence of antibodies to HCV was found to be 21.9%. Fig. 7.4 shows the association between region-specific prevalence of antibodies to HCV with region-specific PAT exposure, which suggests that the variation in seroprevalence of antibodies to HCV between regions may be explained by PAT exposure (odds ratio 1.31 [95% confidence interval {CI}: 1.08−1.59]; P = .007). To date, massive HCV transmission through PAT use in Egypt is considered the largest iatrogenic transmission of a blood-borne pathogen ever recorded.

It has been claimed that because epidemiologists generally show tabulated data and refer to characteristics of groups, the data in all epidemiologic studies are group data. This is not true. In cross-sectional, case-control, cohort studies and randomized trials, data on exposure and disease outcome are available for every individual in the study, even though these data are commonly grouped in tables and figures. On the other hand, only grouped data are available in ecologic studies, such as, for example, country-by-country data on average salt consumption and average systolic blood pressure.


Interestingly, when variability of an exposure is limited, ecologic correlations may provide a more valid answer with regard to the presence of an association than studies based on individuals. Wynder and Stellman have summarized this phenomenon as follows: “If cases and controls are drawn from a population in which the range of exposures is narrow, then a study may yield little information about potential health effects.”

An example is the relationship of salt intake and blood pressure, which has not been consistently found in case-control and cohort studies; however, in an ecologic correlation using country populations as the analytic units, a strong and graded correlation has been observed. This phenomenon can be explained by the narrow range of salt intake in individuals within each country, but a fairly large variability of average salt intake between countries.

Cross-Sectional Studies


Another common study design used in initially investigating the association between a specific exposure and a disease of interest is the cross-sectional study. Let’s assume that we are interested in the possible relationship of increased serum cholesterol level (the exposure) to electrocardiographic (ECG) evidence of coronary heart disease (CHD, the disease). We survey a population, and for each participant we determine the serum cholesterol level and perform an ECG for evidence of CHD. The presence of CHD defines a prevalent case. This type of study design is called a cross-sectional study because both exposure and disease outcome are determined simultaneously for each study participant; it is as if we were viewing a snapshot of the population at a certain point in time. Another way to describe a cross-sectional study is to imagine that we have sliced through the population, capturing levels of cholesterol and evidence of CHD at the same time. Note that in this type of approach, the cases of disease that we identify are prevalent cases of the disease in question (which is the reason why a cross-sectional study is also called a “prevalence study”), because we know that they existed at the time of the study, but we do not know their duration (the interval between the onset of the disease and “today”), or whether the exposure happened before the outcome. The impossibility of determining a temporal sequence “exposure-disease” may result in temporal bias when it is the disease that causes the exposure. For example, prevalent cases of CHD may engage in leisure physical activity more often than normal subjects, as the occurrence of an acute episode of CHD may prompt physicians to recommend physical exercise to his or her CHD patients, a phenomenon that is also known as “reverse causality.” (Note, however, that when information on exposure is obtained by a questionnaire, it is possible to find out whether a given exposure [e.g., sedentary habits, smoking, or excessive alcohol drinking] was present prior to the disease onset, thus allowing the identification of the temporal sequence between the exposure and the disease.

In addition to temporal bias, survival/selection bias may also occur in a cross-sectional study when the exposure is related to the duration of the disease; thus, for example, if exposure-induced incident cases have a shorter survival than unexposed incident cases, prevalent cases, which are by definition survivors, may have a lower proportion of past exposure than those that would have been observed if incident cases had been included in the study. In other words, identifying only prevalent cases would exclude those who died sooner after the disease developed but before the study was carried out. For example, we know that a high serum cholesterol level causes CHD. However, when doing a cross-sectional study, the observed association may be a function of both the risk of developing CHD and with survival after CHD onset.


Another example of survival bias is given by smoking-induced lung emphysema. Smoking not only causes emphysema, but in addition, survival of patients with smoking-induced emphysema is worse than that of patients whose emphysema results from other causes (e.g., asthma or chronic bronchitis). As a result, past smoking will be observed less frequently in prevalent than in incident cases of emphysema. This type of survival bias is also known as prevalence-incidence bias.

In view of its biases, results of a cross-sectional study should be used to generate hypotheses that can then be evaluated using a study design that includes incident cases and allows establishing the temporal sequence of the exposure and the outcome. Nevertheless, cross-sectional studies, like political polls and sample surveys, are widely used and are often the first studies conducted before moving on to more valid study designs. Module 5: Using Epidemiology to Identify the Cause of Disease


The general design of a cross-sectional (or prevalence) study is seen in Fig. 7.5. We define a population and determine the presence or absence of exposure and the presence or absence of disease for each individual at the same time. Each subject then can be categorized into one of four possible subgroups.

As seen in the 2 × 2 table in the top portion of Fig. 7.6, there will be a persons, who have been exposed and have the disease; b persons, who have been exposed but do not have the disease; c persons, who have the disease but have not been exposed; and d persons, who have neither been exposed nor have the disease.

FIG. 7.6 Design of a hypothetical cross-sectional study—II: (top) A 2 × 2 table of the findings from the study; (bottom) two possible approaches to the analysis of results: (A) Calculate the prevalence of disease in exposed persons compared to the prevalence of disease in nonexposed persons, or (B) Calculate the prevalence of exposure in persons with disease compared to the prevalence of exposure in persons without disease.


In order to determine whether there is evidence of an association between exposure and disease from a cross-sectional study, we have a choice between two possible approaches, which in Fig. 7.6 are referred to as (A) and (B). If we use (A), we can calculate the prevalence of disease in persons with the exposure  and compare it with the prevalence of disease in persons without the exposure . If we use (B), we can compare the prevalence of exposure in persons

If we use (B), we can compare the prevalence of exposure in persons with the disease  to the prevalence of exposure in persons without the disease . The details of both approaches are shown in the bottom portion of Fig. 7.6.

If we determine in such a study that there appears to be an association between increased cholesterol level and CHD, we are left with several issues we have to consider. First, in this cross-sectional study, we are identifying prevalent (existing) cases of CHD rather than incident (new) cases; such prevalent cases may not be representative of all cases of CHD that have developed in this population. For example, identifying only prevalent cases would exclude those who died after the disease developed but before the study was carried out. Therefore, even if an association of exposure and disease is observed, the association may be with survival after CHD rather than with the risk of developing CHD. Second, because the presence or absence of both exposure and disease was determined at the same time in each participant in the study, it is often not possible to establish a temporal relationship between the exposure and the onset of disease. Thus, in the example given at the beginning of this section, it is not possible to tell whether or not the increased cholesterol level preceded the development of CHD. Without information on temporal relationships, it is conceivable that the increased cholesterol level could have occurred as a result of the CHD, in which case we call it “reverse causality,” or perhaps both may have occurred as a result of another factor. If it turns out that the exposure did not precede the development of the disease, the association cannot reflect a causal relationship. Module 5: Using Epidemiology to Identify the Cause of Disease

Farag et?al. used data from the National Health and Nutrition Examination Survey (NHANES), a nationally representative survey of the noninstitutionalized US civilian population, to examine a potential association between vitamin D and erectile dysfunction in men who were free from cardiovascular disease. 12 A dose-response relationship was found between vitamin D deficiency and erectile dysfunction (prevalence ratio 1.30, 95% CI: 1.08−1.57; Fig. 7.7). Notwithstanding the biases inherent to the cross-sectional design, the study’s findings suggest the need to perform a randomized trial on the association of vitamin D deficiency and erectile function.

FIG. 7.7 Restricted cubic spline of 25(OH)D and adjusted prevalence ratio of erectile dysfunction (ED), NHANES 2001–2004. Curves represent adjusted prevalence ratio (solid line) and the 95% confidence intervals (dashed lines) based on restricted cubic splines for 25(OH)D level with knots at 10, 20, 30, and 40?ng/mL. The reference values were set at 20?ng/mL. Model is adjusted for age, race, smoking, alcohol consumption, body mass index, physical activity, hypertension, diabetes, hypercholesterolemia, estimated glomerular filtration rate, C-reactive protein, and the use of antidepressants and beta blockers. (From Farag YM, Guallar E, Zhao D, et?al. Vitamin D deficiency is independently associated with greater prevalence of erectile dysfunction: the National Health and Nutrition Examination Survey (NHANES) 2001-2004. Atherosclerosis. 2016;252:61–67.)

Serial cross-sectional studies are also useful to evaluate trends in disease prevalence over time in order to inform health care policy and planning. Murphy and colleagues used annual NHANES data, yearly from 1988 to 1994 and every 2 years from 1999 to 2012, to examine trends in chronic kidney disease (CKD) prevalence. 13 Fig. 7.8 shows the temporal trends in adjusted prevalence of stages 3 and 4 CKD from NHANES 1988–1994 through 2011–2012, categorized by the presence or absence of diabetes. As shown in the figure, there was an initial increase in adjusted prevalence of stages 3 and 4 CKD that leveled off in the early 2000s among nondiabetic individuals but continued to increase in diabetic individuals.

FIG. 7.8 Adjusted prevalence of stage 3 and 4 chronic kidney disease (estimated glomerular filtration rate of 15 to 59?mL/min/1.73?m2 calculated with Chronic Kidney Disease Epidemiology Collaboration equation) in US adults, NHANES 1988–1994 through 2011–2012. (From Murphy D, McCulloch CE, Lin F, et?al. Trends in prevalence of chronic kidney disease in the United States. Ann Intern Med. 2016;165:473–481.) Module 5: Using Epidemiology to Identify the Cause of Disease

To minimize health research costs, researchers often depend on self-reported data. Weight and height are the most common self-reported variables. However, self-reports are prone to under- or overreporting. Cross-sectional data can help validate and correct errors in self-reported weight and height. For example, Jain compared self-reported with measured cross-sectional weight and height data from the NHANES for the period 1999–2006. This comparison allowed him to estimate a correction factor, which was then applied to the prevalence of obesity based on self-reported weight and height obtained from the Behavioral Risk Factor Surveillance System. Jain estimated that the weight/height self-reporting bias resulted in an approximately 5% lower obesity prevalence in both men and women.


Case-Control Studies

Suppose you are a clinician and you have seen a few patients with a certain disease. You observe that many of them have been exposed to a particular agent—biological or chemical. You hypothesize that their exposure is related to their risk of developing this disease. How would you go about confirming or refuting your hypothesis?


Let’s consider a real-life example:


It was long thought that hyperacidity is the cause of peptic ulcer disease (PUD). In 1982, Australian physicians Barry Marshall and Robin Warren discovered Helicobacter pylori (H. pylori) in the stomachs of PUD patients, and showed that H. pylori is able to adapt to the acidic environment of the stomach. However, their observations were not enough to establish the causal association between H. pylori and PUD. Subsequently, they suggested that antibiotics, not antacids, are the effective treatments for PUD, a suggestion that was heavily criticized at that time. It wasn’t until 1994 when the National Institutes of Health came to a consensus expert opinion based on the available evidence that detection and eradication of H. pylori are key in the treatment of PUD. Drs. Marshall and Warren were awarded the Nobel Prize in Physiology or Medicine in 2005. 15

To determine the significance of clinical observations in a group of cases reported by physicians, a comparison (sometimes called a control or reference) group is needed. Observations based on case series would have been intriguing, but no firm conclusion would be possible without comparing these observations in cases to those from a series of controls who are similar in most respects to the cases but are free of the disease under study. Comparison is an essential component of epidemiologic investigation and is well exemplified by the case-control study design.

Design of a Case-Control Study

Fig. 7.9 shows the design of a case-control study. To examine the possible relation of an exposure to a certain disease, we identify a group of individuals with that disease (called cases) and, for purposes of comparison, a group of people without that disease (called controls). We then determine what proportion of the cases was exposed and what proportion was not. We also determine what proportion of the controls was exposed and what proportion was not. In the example of children with cataracts, the cases would consist of children with cataracts, and the controls would consist of children without cataracts. For each child, it would then be necessary to ascertain whether or not the mother was exposed to rubella during her pregnancy with that child. We anticipate that if the exposure (rubella) is in fact related to the disease (cataracts), the prevalence of history of exposure among the cases (children with cataracts) will be greater than that among the controls (children with no cataracts). Thus in a case-control study, if there is an association of an exposure with a disease, the prevalence of history of exposure should be higher in persons who have the disease (cases) than in those who do not have the disease (controls). Module 5: Using Epidemiology to Identify the Cause of Disease

Table 7.2 presents a hypothetical schema of how a case-control study is conducted. We begin by selecting cases (with the disease) and controls (without the disease), and then measure past exposure by interview or by review of medical or employee records or of results of chemical or biologic assays of blood, urine, or tissues. If the exposure is dichotomous—that is, exposure has either occurred (yes) or not occurred (no)—breakdown into four groups is possible. There are a cases who were exposed and c cases who were not exposed. Similarly, there are b controls who were exposed and d controls who were not exposed. Thus the total number of cases is (a + c) and the total number of controls is (b + d). If exposure is associated with disease, we would expect the proportion of the cases who were exposed , to be greater than the proportion of the controls who were not exposed,

Design of Case-Control Studies
blank cell first, select:
Cases (With Disease) Controls (Without Disease)
Then, Measure Past Exposure:
Were exposed a b
Were not exposed c d
Totals a + c b + d
Proportions who were exposed


A hypothetical example of a case-control study is seen in Table 7.3. We are conducting a case-control study of whether smoking is related to CHD. We start with 200 people with CHD (cases) and compare them to 400 people without CHD (controls). If there is a relationship between a lifetime history of smoking and CHD, we would anticipate that a greater proportion of the CHD cases than of the controls would have been smokers (exposed). Let’s say we find that of the 200 CHD cases, 112 were smokers and 88 were nonsmokers. Of the 400 controls, 176 were smokers and 224 were nonsmokers. Thus 56% of CHD cases were smokers compared to 44% of the controls. This calculation is only a first step. Further calculations to determine whether or not there is an association of the exposure with the disease will be discussed later. This chapter focuses exclusively on issues of design in case-control studies. Module 5: Using Epidemiology to Identify the Cause of Disease




A Hypothetical Example of a Case-Control Study of CHD and Cigarette Smoking

CHD Cases                                         Controls

Smoke cigarettes        112                                                      176

Do not smoke cigarettes         88                                            224

Totals                          200                                                      400

% Smoking cigarettes 56                                                        44


Parenthetically, it is of interest to note that if we use only the data from a case-control study, we cannot estimate the prevalence of the disease. In this example we had 200 cases and 400 controls, but this does not imply that the prevalence is 33%, or 200\200+400. The decision as to the number of controls to select per case in a case-control study is in the hands of the investigator and does not reflect the prevalence of disease in the population. In this example, the investigator could have selected 200 cases and 200 controls (1 control per case), or 200 cases and 800 controls (4 controls per case). Because the proportion of the entire study population that consists of cases is determined by the ratio of controls per case, and this proportion is determined by the investigator, it clearly does not reflect the true prevalence of the disease in the population in which the study is carried out.

At this point, we should emphasize that the hallmark of the case-control study is that it begins with people with the disease (cases) and compares them to people without the disease (controls). This is in contrast to the design of a cohort study that will be discussed in Chapter 8, which begins with a group of exposed people and compares them to an unexposed group. Some people have the erroneous impression that the distinction between the two types of study design is that cohort studies go forward in time and case-control studies go backward in time. Such a distinction is not correct; in fact, it is unfortunate that the term retrospective has been used for case-control studies, because the term incorrectly implies that calendar time is the characteristic that distinguishes case-control from cohort design. As will be shown in an upcoming chapter, a retrospective cohort study also uses data obtained in the past. Thus calendar time is not the characteristic that distinguishes a case-control from a cohort study. What distinguishes the two study designs is whether the study begins with diseased and nondiseased people (case-control study) or with exposed and unexposed people (cohort study). One of the earliest studies of cigarette smoking and lung cancer was conducted by Sir Richard Doll (1912–2005) and Sir Austin Bradford Hill (1897–1991). Doll was an internationally known epidemiologist, and Hill was a well-known statistician and epidemiologist who developed the “Bradford Hill” guidelines for evaluating whether an observed association is causal. 16 Both men were knighted for their scientific work in epidemiology and biostatistics.

Table 7.4 presents data from their frequently cited study of 1,357 males with lung cancer and 1,357 controls according to the average number of cigarettes smoked per day in the 10 years preceding the present illness. 16 We see that there are fewer heavy smokers among the controls and very few nonsmokers among the lung cancer cases, a finding strongly suggestive of an association between smoking and lung cancer. In contrast to the previous example, exposure in this study is not just dichotomized (exposed or not exposed), but the exposure data are further stratified in terms of dose, as measured by the usual number of cigarettes smoked per day. Because many of the environmental exposures about which we are concerned today are not all-or-nothing exposures, the possibility of doing a study and analysis that takes into account the dose of the exposure is very important. Module 5: Using Epidemiology to Identify the Cause of Disease



Distribution of 1,357 Male Lung Cancer Patients and a Male Control Group According to Average Number of Cigarettes Smoked Daily Over the 10 Years Preceding Onset of the Current Illness

Average Daily Cigarettes       Lung Cancer Patients Control Group

0                                                          7                                  61

1–4                                                      55                                129

5–14                                                    489                              570

15–24                                                  475                              431

25–49                                                  293                              154

50+                                                      38                                12

Total                                                   1,357                           1,357


Potential Biases in Case-Control Studies

Selection Bias

Sources of Cases.

In a case-control study, cases can be selected from a variety of sources, including hospital patients, patients in physicians’ practices, or clinic patients. Many communities maintain registries of patients with certain diseases, such as cancer, and such registries can serve as valuable sources of cases for such studies.


Several problems must be kept in mind when selecting cases for a case-control study. If cases are selected from a single hospital, any risk factors that are identified may be unique to that hospital as a result of referral patterns or other factors, and the results may not be generalizable to all patients with the disease. Consequently, if hospitalized cases are to be used, it is desirable to select the cases from several hospitals in the community. Furthermore, if the hospital from which the cases are drawn is a tertiary care facility, which selectively admits a large number of severely ill patients, any risk factors identified in the study may be risk factors only in persons with severe forms of the disease. In any event, it is essential that in case-control studies, just as in randomized trials, the criteria for eligibility be carefully specified in writing before the study is begun.


Using Incident or Prevalent Cases.


An important consideration in case-control studies is whether to include incident cases of a disease (newly diagnosed cases) or prevalent cases of the disease (people who may have had the disease for some time). The problem with use of incident cases is that we must often wait for new cases to be diagnosed; whereas if we use prevalent cases, which have already been diagnosed, a larger number of cases is often available for study. However, despite this practical advantage of using prevalent cases, it is generally preferable to use incident cases of the disease in case-control studies of disease etiology. The reason is that any risk factors we may identify in a study using prevalent cases may be related more to survival with the disease than to the development of the disease (incidence). If, for example, most people who develop the disease die soon after diagnosis, they will be underrepresented in a study that uses prevalent cases, and such a study is more likely to include longer-term survivors. This would constitute a highly nonrepresentative group of cases, and any risk factors identified with this nonrepresentative group may not be a general characteristic of all patients with the disease, but only of survivors.

Even if we include only incident cases (patients who have been newly diagnosed with the disease) in a case-control study, we will of course be excluding any patients who may have died before the diagnosis was made. There is no easy solution to this problem or to certain other problems in case selection, but it is important that we keep these issues in mind when we finally interpret the data and derive conclusions from the study. At that time, it is critical to take into account possible selection biases that may have been introduced by the study design and by the manner in which the study was conducted. Module 5: Using Epidemiology to Identify the Cause of Disease


Selection of Controls

In 1929, Raymond Pearl, professor of biostatistics at Johns Hopkins University in Baltimore, Maryland, conducted a study to test the hypothesis that tuberculosis protected against cancer. 17 From 7,500 consecutive autopsies at Johns Hopkins Hospital, Pearl identified 816 cases of cancer. He then selected a control group of 816 from among the others on whom autopsies had been carried out at Johns Hopkins and determined the percentages of the cases and of the controls who had findings of tuberculosis on autopsy. Pearl’s findings are seen in Table 7.5.



Summary of Data From Pearl’s Study of Cancer and Tuberculosis

blank cell        Cases (With Cancer)   Controls (Without Cancer)

Total no. of autopsies 816                              816

No. (%) of autopsies with tuberculosis          54 (6.6)           133 (16.3)

From Pearl R. Cancer and tuberculosis. Am J Hyg. 1929;9:97–159.

Of the 816 autopsies of patients with cancer, 54 had tuberculosis (6.6%), whereas of the 816 controls with no cancer, 133 had tuberculosis (16.3%). From the finding that the prevalence of tuberculosis was considerably higher in the control group (no cancer findings) than in the case group (cancer diagnoses), Pearl concluded that tuberculosis had an antagonistic or protective effect against cancer.


Was Pearl’s conclusion justified? The answer to this question depends on the adequacy of his control group. If the prevalence of tuberculosis in the noncancer patients was similar to that of all people who were free of cancer, his conclusion would be valid. But that was not the case. At the time of the study, tuberculosis was one of the major reasons for hospitalization at Johns Hopkins Hospital. Consequently, what Pearl had inadvertently done in choosing the cancer-free control group was to select a group in which many of the patients had been diagnosed with and hospitalized for tuberculosis. Pearl thought that the control group’s rate of tuberculosis would represent the level of tuberculosis expected in the general population, but because of the way he selected the controls, they came from a pool that was heavily weighted with tuberculosis patients, which did not represent the general population. He was, in effect, comparing the prevalence of tuberculosis in a group of patients with cancer with the prevalence of tuberculosis in a group of patients in which many had already been diagnosed with tuberculosis. Clearly his conclusion was not justified on the basis of these data.


How could Pearl have overcome this problem in his study? Instead of comparing his cancer patients with a group selected from all other autopsied patients, he could have compared the patients with cancer to a group of patients admitted for some specific diagnosis other than cancer (and not tuberculosis). In fact, Carlson and Bell 18 repeated Pearl’s study but compared the patients who died of cancer with patients who died of heart disease at Johns Hopkins Hospital. They found no difference in the prevalence of tuberculosis at autopsy between the two groups. (It is of interest, however, that despite the methodologic limitations of Pearl’s study, bacillus Calmette-Guérin [BCG], a vaccine against tuberculosis, is used today as a form of immunotherapy in several types of cancer.)

The problem with Pearl’s study exemplifies the challenge of selecting appropriate controls as the fundamental component in drawing epidemiologically sound conclusions from case-control studies. Yet it remains one of the most difficult problems we confront in the conduct of epidemiologic studies using the case-control approach. The challenge is this: If we conduct a case-control study and find more exposure in the cases than in the controls, we would like to be able to conclude that there is an association between the exposure and the disease in question. The way the controls are selected is a major determinant of whether such a conclusion is valid.


A fundamental conceptual issue relating to selection of controls is whether the controls should be similar to the cases in all respects other than having the disease in question, or whether they should be representative of all persons without the disease in the population from which the cases are selected. This question has stimulated considerable discussion, but in actuality, the characteristics of the nondiseased people in the population from which the cases are selected are often not known, because the reference population may not be well defined. Module 5: Using Epidemiology to Identify the Cause of Disease

Consider, for example, a case-control study using hospitalized cases. We want to identify the reference population that is the source of the cases so that we can then sample this reference population to select controls. Unfortunately, it is usually either not easy or not possible to identify such a reference population for hospitalized patients. Patients admitted to a hospital may come from the surrounding neighborhood, may live farther away in the same city, or may, through a referral process, come from another city or another country. Under these circumstances it is virtually impossible to define a specific reference population from which the cases emerged and from which we might select controls. Nevertheless, we want to design our study so that when it is completed, we can be reasonably certain that if we find a difference in exposure history between cases and controls, there are not likely to be any other important differences between them that might limit the inferences we may derive.


Sources of Controls.

Controls may be selected from nonhospitalized persons living in the community, from outpatient clinics, or from hospitalized patients admitted for diseases other than that for which the cases were admitted.


Use of Nonhospitalized People as Controls.


Nonhospitalized controls may be selected from several sources in the community. Ideally, a probability sample of the total population might be selected, but as a practical matter, this is rarely possible. Other sources include school rosters, registered voters lists, and insurance company lists. Another option is to select, as a control for each case, a resident of a defined area, such as the neighborhood in which the case lives. Such neighborhood controls have been used for many years. In this approach, interviewers are instructed to identify the home of a case as a starting point, and from there walk past a specified number of houses in a specified direction and seek the first household that contains an eligible control. Because of increasing problems of security in urban areas of the United States, however, many people will no longer open their doors to interviewers. Nevertheless, in many other countries, particularly in developing countries, the door-to-door approach to obtaining controls may be ideal.

Because of the difficulties in many cities in the United States in obtaining neighborhood controls using the door-to-door approach, an alternative for selecting such controls is to use telephone survey methods. Among these is random-digit dialing. Because telephone exchanges generally match neighborhood boundaries (being in the same area code), a case’s seven-digit telephone number, of which the first three digits are the exchange, can be used to select a control telephone number, in which the terminal four digits of the phone number are randomly selected and the same three-digit exchange is used. In many developing countries this approach is impractical, as only government offices and business establishments are likely to have telephones. With the nearly universal mobile telephone coverage that now exists almost worldwide, the telephone is an intriguing method of control selection. Nevertheless, many persons screen their calls, and response rates are woefully low in many cases. Module 5: Using Epidemiology to Identify the Cause of Disease

Another approach to control selection is to use a best friend control. In this approach, a person who has been selected as a case is asked for the name of a best friend who may be more likely to participate in the study knowing that his or her best friend is also participating. However, there are also disadvantages to this method of selecting controls. A best friend control obtained in this fashion may be similar to the case in age and in many other demographic and social characteristics. A resulting problem may be that the controls are too similar to the cases in regard to many variables, including the variables that are being investigated in the study. Sometimes, however, it may be useful to select a spouse or sibling control; a sibling may provide some control over genetic differences between cases and controls.

Use of Hospitalized Patients as Controls.


Hospital inpatients are often selected as controls because of the extent to which they are a “captive population,” easily accessible and clearly identified; it should therefore be relatively more economical to carry out a study using such controls. However, as just discussed, they represent a sample of an ill-defined reference population that usually cannot be characterized and thus to which results cannot be generalized. Moreover, hospital patients differ from people in the community. For example, the prevalence of cigarette smoking is known to be higher in hospitalized patients than in community residents; many of the diagnoses for which people are admitted to the hospital are smoking related.


Given that we generally cannot characterize the reference population from which hospitalized cases come, there is a conceptual attractiveness to comparing hospitalized cases with hospitalized controls from the same institution, who presumably would tend to come from the same reference population a (Fig. 7.10). Whatever selection factors in the referral system affected the cases’ admission to a particular hospital would also pertain to the controls. However, referral patterns at the same hospital may differ for various clinical services; such an assumption may be questionable and generally it is often impossible to know whether it has been met.

When the decision has been made to use hospital controls, the question arises of whether to use a sample of all other patients admitted to the hospital (other than those with the cases’ diagnosis) or whether to select a specific “another diagnosis” or “other diagnoses.” If we wish to choose specific diagnostic groups, on what basis do we select those groups, and on what basis do we exclude others? The problem is that although it is attractive to select as hospitalized controls a disease group that is obviously unrelated to the putative causative factor under investigation, such controls are unlikely to be representative of the general reference population of noncases. Taken to its logical end, it will not be clear whether it is the cases or the controls who differ from the general population.

The issue of which diagnostic groups would be eligible for use as controls and which would be ineligible (and therefore excluded) is very important. Let’s say we are conducting a case-control study of lung cancer and smoking: we select as cases patients who have been hospitalized with lung cancer, and as controls we select patients who have been hospitalized with emphysema. What problem would this present? Because we know that there is a strong relationship between smoking and emphysema, our controls, the emphysema patients, would include a high number of smokers. Consequently, any relationship of smoking to lung cancer would not be easy to detect in this study, because we would have selected as controls a group of persons in which there is a greater-than-expected prevalence of smoking than exists in the population. We might therefore want to exclude from our control group those persons who have other smoking-related diagnoses, such as CHD, bladder cancer, pancreatic cancer, and emphysema. Such exclusions might yield a control group with a lower-than-expected prevalence of smoking, and the exclusion process becomes overly complex. One alternative is to not exclude any groups from selection as controls in the design of the study, but to analyze the study data separately for different diagnostic subgroups that constitute the control group. This, of course, will drive up the numbers of controls necessary and the expense that accompanies a larger sample size. Module 5: Using Epidemiology to Identify the Cause of Disease

Problems in Control Selection.


In a classic study published in 1981, the renowned epidemiologist Brian MacMahon and coworkers 19 reported a case-control study of cancer of the pancreas. The cases were patients with a histologically confirmed diagnosis of pancreatic cancer in 11 Boston and Rhode Island hospitals from 1974 to 1979. Controls were selected from patients who were hospitalized at the same time as the cases; they were selected from other inpatients hospitalized by the attending physicians who had hospitalized the cases. Excluded were nonwhites; those older than 79 years; patients with pancreatic, hepatobiliary tract, and smoking-related or alcohol-related diseases; and patients with cardiovascular disease, diabetes, respiratory or bladder cancer, and peptic ulcer. However, the authors did not exclude patients with other kinds of gastrointestinal diseases, such as diaphragmatic hernia, reflux, gastritis, and esophagitis.


One finding in this study was an apparent dose-response relationship between coffee drinking and cancer of the pancreas, particularly in women (Table 7.6). When such a relationship is observed, it is difficult to know whether the disease is caused by the coffee drinking or by some factor closely related to the coffee drinking. Because smoking is a known risk factor for cancer of the pancreas, and because coffee drinking was closely related to cigarette smoking at that time (it was rare to find a smoker who did not drink coffee), did MacMahon and others observe an association of coffee drinking with pancreatic cancer because the coffee caused the pancreatic cancer, or because coffee drinking is related to cigarette smoking, and cigarette smoking is known to be a risk factor for cancer of the pancreas? Recognizing this problem, the authors analyzed the data after stratifying for smoking history. The relationship with coffee drinking held both for current smokers and for those who had never smoked (Table 7.7).




Distribution of Cases and Controls by Coffee-Drinking Habits and Estimates of Risk Ratios

Sex      Category         coffee drinking (cups/day)     Total

0          1–2      3–4      ≥5

M        No. of cases    9          94        53        60        216

No. of controls            32        119      74        82        307

Adjusted relative risk a           1.0       2.6       2.3       2.6       2.6

95% Confidence interval        —        1.2–5.5            1.0–5.3            1.2–5.8            1.2–5.4

F          No. of cases    11        59        53        28        151

No. of controls            56        152      80        48        336

Adjusted relative risk a           1.0       1.6       3.3       3.1       2.3

95% Confidence interval        —        0.8–3.4            1.6–7.0            1.4–7.0            1.2–4.6

a Chi-square (Mantel extension) with equally spaced scores, adjusted over age in decades: 1.5 for men, 13.7 for women. Mantel-Haenszel estimates of risk ratios, adjusted over categories of age in decades. In all comparisons, the referent category was subjects who never drank coffee. Module 5: Using Epidemiology to Identify the Cause of Disease




Estimates of Relative Risk a of Cancer of the Pancreas Associated With Use of Coffee and Cigarettes

Cigarette Smoking Status       coffee drinking (cups/day)     Total b

0          1–2      ≥3

Never smoked             1.0       2.1       3.1       1.0

Ex-smokers                 1.3       4.0       3.0       1.3

Current smokers         1.2       2.2       4.6       1.2 (0.9–1.8)

Total a                         1.0                   1.8 (1.0–3.0)   2.7 (1.6–4.7)

Values in parentheses are 95% confidence intervals of the adjusted estimates.


a The referent category is the group that uses neither cigarettes nor coffee. Estimates are adjusted for sex and age in decades.


b Values are adjusted for the other variables, in addition to age and sex, and are expressed in relation to the lowest category of each variable.

This report aroused great interest in both the scientific and lay communities, particularly among coffee manufacturers. Given the widespread exposure of human beings to coffee, if the reported relationship were true, it would have major public health implications.


Let’s examine the design of this study. The cases were white patients with cancer of the pancreas at 11 Boston and Rhode Island hospitals. The controls are of particular interest: After some exclusions, they were patients with other diseases who were hospitalized by the same physicians who had admitted the pancreatic cancer cases. That is, when a case had been identified, the attending physician was asked if another of his or her patients who was hospitalized at the same time for another condition could be interviewed as a control. This unusual method of control selection had a practical advantage: One of the major obstacles in obtaining participation of hospital controls in case-control studies is that permission to contact the patient is usually requested of the attending physician. The physicians are often not motivated to have their patients serve as controls, because the patients do not have the disease that is the focus of the study. By asking physicians who had already given permission for patients with pancreatic cancer to participate, the likelihood was increased that permission would be granted for patients with other diseases to participate as controls.

Did that practical decision introduce any problems? The underlying question that the investigators wanted to answer was whether patients with cancer of the pancreas drank more coffee than people without cancer of the pancreas in the same population (Fig. 7.11). What MacMahon and coworkers found was that the level of coffee drinking in cases was greater than the level of coffee drinking in controls. Module 5: Using Epidemiology to Identify the Cause of Disease

The investigators would like to be able to establish that the level of coffee drinking observed in the controls is what would be expected in the general population without pancreatic cancer and that cases therefore demonstrate excessive coffee drinking (Fig. 7.12A). But the problem is this: Which physicians are most likely to admit patients with cancer of the pancreas to the hospital? Gastroenterologists are often the admitting physicians. Many of their other hospitalized patients (who served as controls) also have gastrointestinal problems, such as esophagitis and gastritis (as mentioned previously, patients with peptic ulcer were excluded from the control group). So, in this study, the persons who served as controls may very well have reduced their intake of coffee, either because of a physician’s instructions or because of their own realization that reducing their coffee intake could relieve their symptoms. We cannot assume that the controls’ levels of coffee drinking are representative of the level of coffee drinking expected in the general population; their rate of coffee drinking may be abnormally low. Thus the observed difference in coffee drinking between pancreatic cancer cases and controls may not necessarily have been the result of cases drinking more coffee than expected, but rather of the controls drinking less coffee than expected (see Fig. 7.12B).


FIG. 7.12 Interpreting the results of a case-control study of coffee drinking and pancreatic cancer. (A) Is the lower level of coffee drinking in the controls the expected level of coffee drinking in the general population? OR (B) Is the higher level of coffee drinking in the cases the expected level of coffee drinking in the general population?

MacMahon and his colleagues subsequently repeated their analysis but separated controls with gastrointestinal illness from controls with other conditions. They found that the risk associated with coffee drinking was indeed higher when the comparison was with controls with gastrointestinal illness but that the relationship between coffee drinking and pancreatic cancer persisted, albeit at a lower level, even when the comparison was with controls with other illnesses. This became a classical example for what problematic selection of controls could do to interpreting the results of a case-control study. Several years later, Hsieh and coworkers reported a new study that attempted to replicate these results; it did not support the original findings. 20


In summary, when a difference in exposure is observed between cases and controls, we must ask whether the level of exposure observed in the controls is really the level expected in the population in which the study was carried out or whether—perhaps given the manner of selection—the controls may have a particularly high or low level of exposure that might not be Information Bias

Problems of Recall.

A major problem in case-control studies is that of recall of a history of past exposure. Recall problems are of two types: limitations in recall and recall bias. Recall bias is the main form of information bias in case-control studies. The problem of recall is not limited to the case-control study design. Most epidemiologic studies inquire about life histories and are thus subject to recall biases. Survey research has identified many ways to mitigate the amount of bias associated with interviewing participants about events in their lives. However, many study participants forget about exposures or other events, tend to bring events that happened long ago forward in time (“telescoping”), and may be reticent to admit to practices that might be considered stigmatizing . Module 5: Using Epidemiology to Identify the Cause of Disease


Limitations in Recall.

Much of the information relating to exposure in case-control studies often involves collecting data from subjects by interviews. Because virtually all human beings are limited to varying degrees in their ability to recall information, limitations in recall is an important issue in such studies. A related issue that is somewhat different from limitations in recall is that persons being interviewed may simply not have the information being requested.


This was demonstrated years ago in an historic study carried out by Abraham Lilienfeld and Saxon Graham published in 1958. 21 At that time, considerable interest centered on the observation that cancer of the cervix was highly unusual in two groups of women: Jewish women and Catholic nuns. This observation suggested that an important risk factor for cervical cancer could be sexual intercourse with an uncircumcised man, and a number of studies were carried out to confirm this hypothesis. However, the authors were skeptical about the validity of the responses regarding circumcision status. To address this question they asked a group of men whether or not they had been circumcised. The men were then examined by a physician. As seen in Table 7.8, of the 56 men who stated they were circumcised, 19, or 33.9%, were found to be uncircumcised. Of the 136 men who stated they were not circumcised, 47, or 34.6%, were found to be circumcised. These data demonstrate that the findings from studies using interview data may not always be clear-cut.




Comparison of Patients’ Statements With Examination Findings Concerning Circumcision Status, Roswell Park Memorial Institute, Buffalo, New York

Examination Finding  patients’ statements regarding circumcision

yes                   no

No.      %         No.      %

Circumcised   37        66.1     47        34.6

Not circumcised         19        33.9     89        65.4

Total               56        100.0   136      100.0

Modified from Lilienfeld AM, Graham S. Validity of determining circumcision status by questionnaire as related to epidemiologic studies of cancer of the cervix. J Natl Cancer Inst. 1958;21:713–720.

Table 7.9 shows more recent data (2002) regarding the relationship of self-reported circumcision to actual circumcision status. These data suggest that men have improved in their knowledge and reporting of their circumcision status, or the differences observed may be due to the studies having been conducted in different countries. There may also have been methodological differences, which could have accounted for the different results between the two studies.



Comparison of Patients’ Statements With Physicians’ Examination Findings Concerning Circumcision Status in the Study of Circumcision, Penile Human Papillomavirus, and Cervical Cancer.

Physician Examination Findings        patients’ statements regarding circumcision

yes                                           no

No.      %                                 No.      %

Circumcised   282      98.3                             37        7.4

Not circumcised         5          1.7                   466      92.6

Total               287      100.0                                       503      100.0

Modified from Castellsague X, Bosch FX, Munoz N, et?al. Male circumcision, penile human papillomavirus infection, and cervical cancer in female partners. N Engl J Med. 2002;346:1105–1112.

If a limitation of recall regarding exposure affects all subjects in a study to the same extent, regardless of whether they are cases or controls, a misclassification of exposure status may result. Some of the cases or controls who were actually exposed will be erroneously classified as unexposed, and some who were actually not exposed will be erroneously classified as exposed. For exposures that have only two categories (e.g., “yes” vs. “no”), this leads to an underestimate of the true risk of the disease associated with the exposure (that is, there will be a tendency to bias the results toward a null finding).

Recall Bias.

A more serious potential problem in case-control studies is that of recall bias. Suppose that we are studying the possible relationship of congenital malformations to prenatal infections. We conduct a case-control study and interview mothers of children with congenital malformations (cases) and mothers of children without malformations (controls). Each mother is questioned about infections she may have had during the pregnancy.


A mother who has had a child with a birth defect often tries to identify some unusual event that occurred during her pregnancy with that child. She wants to know whether the abnormality was caused by something she did. Why did it happen? Such a mother may even recall an event, such as a mild respiratory infection, that a mother of a child without a birth defect may not even notice or may have forgotten entirely. This type of bias is known as recall bias; Ernst Wynder, a well-known epidemiologist, also called it “rumination bias.”

In the study just mentioned, let’s assume that the true infection rate during pregnancy in mothers of malformed infants and in mothers of normal infants is 15%—that is, there is no difference in infection rates. Suppose that mothers of malformed infants recall 60% of any infections they had during pregnancy, and mothers of normal infants recall only 10% of infections they had during pregnancy. As seen in Table 7.10, the apparent infection rate estimated from this case-control study using interviews would be 9% for mothers of malformed infants and 1.5% for mothers of control infants. Thus the differential recall between cases and controls introduces a recall bias into the study that could artifactually suggest a relation of congenital malformations and prenatal infections. Although a potential for recall bias is self-evident in case-control studies, in point of fact, few actual examples demonstrate that recall bias has been a major problem in case-control studies and has led to erroneous conclusions regarding associations. The small number of examples available could reflect infrequent occurrence of such bias, or the fact that the data needed to clearly demonstrate the existence of such bias in a certain study are frequently not available. Nevertheless, the potential problem cannot be disregarded, and the possibility for such bias must always be kept in mind.

TABLE 7.10


Example of an Artificial Association Resulting From Recall Bias: A Hypothetical Study of Maternal Infections During Pregnancy and Congenital Malformations

blank cell        Cases (With Congenital Malformations)       Controls (Without Congenital Malformations)

Assume That:

True incidence of infection (%)         15        15

Infections recalled (%)           60        10

Result Will Be:

Infection rate as ascertained by interview (%)           9.0       1.5

Other Issues in Case-Control Studies


A major concern in conducting a case-control study is that cases and controls may differ in characteristics or exposures other than the one that has been targeted for study. If more cases than controls are found to have been exposed, we may be left with the question of whether the observed association could be due to differences between the cases and controls in factors other than the exposure being studied. For example, if more cases than controls are found to have been exposed, and if most of the cases are of low income and most of the controls are of high income, we would not know whether the factor determining development of disease is exposure to the factor being studied or another characteristic associated with having low income. To avoid such a situation, we would like to ensure that the distribution of the cases and controls by socioeconomic status is similar, so that a difference in exposure will likely constitute the critical difference, and the presence or absence of disease is not likely to be attributable to a difference in socioeconomic status. Module 5: Using Epidemiology to Identify the Cause of Disease

One approach to dealing with this problem in the design and conduct of the study is to match the cases and controls for factors about which we may be concerned, such as income, as in the preceding example. Matching is defined as the process of selecting the controls so that they are similar to the cases in certain characteristics, such as age, race, sex, socioeconomic status, and occupation. Matching may be of two types: (1) group matching and (2) individual matching. It is very important to distinguish between the two types, since each has its own implications for the statistical analysis of the case-control study, which is not discussed in this book.

Group Matching.

Group matching (or frequency matching) consists of selecting the controls in such a manner that the proportion of controls with a certain characteristic is identical to the proportion of cases with the same characteristic. Thus if 25% of the cases are married, the controls will be selected so that 25% of that group is also married. This type of selection generally requires that all of the cases be selected first. After calculations are made of the proportions of certain characteristics in the group of cases, then a control group, in which the same characteristics occur in the same proportions, is selected. In general, when group matching, we never achieve exactly the same proportions of the key characteristic in cases and controls. When group matching is done for age, for example, the distribution that is the same in cases and controls is of the age groups (e.g., 45 to 49, 50 to 54); within each group, however, there may still be differences between cases and controls that must be considered: for example, although 10% of cases and controls are 50 to 54 years old, there may be a higher proportion of cases closer to age 54 than that of controls.

Individual Matching.

A second type of matching is individual matching (or matched pairs). In this approach, for each case selected for the study, a control is selected who is similar to the case in terms of the specific variable or variables of concern. For example, if the first case enrolled in our study is a 45-year-old white woman, we will seek a 45-year-old white female control. If the second case is a 24-year-old black man, we will select a control who is also a 24-year-old black man. This type of control selection yields matched case-control pairs—that is, each case is individually matched to a control. In our hypothetical case, we would absolutely match the cases by gender and race/ethnicity, but we might use a 3- or 5-year bound for age. Thus we might match a 45-year-old white woman with a 42- to 48-year-old white woman control. The implications of this method of control selection for the estimation of excess risk are discussed in Chapter 12.

Individual matching is often used in case-control studies that use hospital controls. The reason for this is more practical than conceptual. Let’s say that sex and age are considered important variables, and it is thought to be important that the cases and the controls be comparable in terms of these two characteristics. There is generally no practical way to dip into a pool of hospital patients to select a group with certain sex and age characteristics. Rather, it is easier to identify a case and then choose the next hospital admission that matches the case for sex and age. Thus individual matching is most expedient in studies using hospital controls.


What are the problems with matching? The problems with matching are of two types: practical and conceptual.

Practical Problems With Matching.

If an attempt is made to match according to too many characteristics, it may prove difficult or impossible to identify an appropriate control. For example, suppose that it is decided to match each case for race, sex, age, marital status, number of children, ZIP code of residence, and occupation. If the case is a 48-year-old black woman who is married, has four children, lives in ZIP code 21209, and works in a photo-processing plant, it may prove difficult or impossible to find a control who is similar to the case in all of these characteristics. Therefore the more variables on which we choose to match, the more difficult it will be to find a suitable control. Overmatching also leads to an inability to statistically analyze variables used in matching, as we address next. Module 5: Using Epidemiology to Identify the Cause of Disease

Conceptual Problems With Matching.

Perhaps a more important problem is the conceptual one: Once we have matched controls to cases according to a given characteristic, we cannot study that characteristic. For example, suppose we are interested in studying marital status as a risk factor for breast cancer. If we match the cases (breast cancer) and the controls (no breast cancer) for marital status, we can no longer study whether or not marital status is a risk factor for breast cancer. Why not? Because in matching according to marital status, we have artificially established an identical proportion in cases and controls: if 35% of the cases are married, and through matching we create a control group in which 35% are also married, we have artificially ensured that the proportion of married subjects will be identical in both groups. By using matching to impose comparability for a certain factor, we ensure the same prevalence of that factor in the cases and the controls. Clearly we will not be able to ask whether cases differ from controls in the prevalence of that factor. We would therefore not want to match on the variable of marital status in this study. Indeed, we do not want to match on any variable that we may wish to explore in our study.

It is also important to recognize that unplanned matching may inadvertently occur in case-control studies. For example, if we use neighborhood controls, we are in effect matching for socioeconomic status as well as for cultural and other characteristics of a neighborhood. If we use best-friend controls, it is likely that the case and his or her best friend share many lifestyle characteristics, which in effect produces a match for these characteristics. For example, in a study of oral contraceptive use and cervical cancer in which best-friend controls were considered, there was concern that if the case used oral contraceptives it might well be that her best friend would also be likely to be an oral contraceptive user. The result would be an unplanned matching on oral contraceptive use, so that this variable could no longer be investigated in this study. Another and less subtle example would be to match cases and controls on residence when doing a study of the relationship of air pollution to respiratory disease. Unplanned matching on a variable that is strongly related to the exposure being investigated in the study is called overmatching.


In carrying out a case-control study, therefore, we match only on variables that we are convinced are risk factors for the disease, which we are therefore not interested in investigating in this study.

Use of Multiple Controls

Early in this chapter, we noted that the investigator can determine how many controls will be used per case in a case-control study and that multiple controls for each case are frequently used. Matching 2 : 1, 3 : 1 or 4 : 1 will increase the statistical power of our study. Therefore many case-control studies will have more controls than cases. These controls may be either (1) controls of the same type or (2) controls of different types, such as hospital and neighborhood controls or controls with different diseases.

Controls of the Same Type.

Multiple controls of the same type, such as two controls or three controls for each case, are used to increase the power of the study. Practically speaking, a noticeable increase in power is gained only up to a ratio of about 1 case to 4 controls. One might ask, “Why use multiple controls for each case? Why not keep the ratio of controls to cases at 1 : 1 and just increase the number of cases?” The answer is that for many of the relatively infrequent diseases we study (which are best studied using case-control designs), there may be a limit to the number of potential cases available for study. A clinic may see only a certain number of patients with a given cancer or with a certain connective tissue disorder each year. Because the number of cases cannot be increased without either extending the study in time to enroll more cases or developing a collaborative multicenter study, the option of increasing the number of controls per case is often chosen. These controls are of the same type (e.g., neighborhood controls); only the ratio of controls to cases has changed.

Multiple Controls of Different Types.

In contrast, we may choose to use multiple controls of different types. For example, we may be concerned that the exposure of the hospital controls used in our study may not represent the rate of exposure that is “expected” in a population of nondiseased persons—that is, the controls may be a highly selected subset of nondiseased individuals and may have a different exposure experience. We mentioned earlier that hospitalized patients smoke more than people living in the community, and we are concerned because we do not know what the prevalence level of smoking in hospitalized controls represents or how to interpret a comparison of these rates with those of the cases. To address this problem, we may choose to use an additional control group, such as neighborhood controls. The hope is that the results obtained when cases are compared with hospital controls will be similar to the results obtained when cases are compared with neighborhood controls. If the findings differ, the reason for the discrepancy should be sought. In using multiple controls of different types, the investigator should ideally decide which comparison will be considered the “gold standard of truth” before embarking on the actual study. Module 5: Using Epidemiology to Identify the Cause of Disease

In 1979, Ellen Gold and coworkers published a case-control study of brain tumors in children. 22 They used two types of controls: children with no cancer (called normal controls) and children with cancers other than brain tumors (called cancer controls; Fig. 7.13). What was the rationale for using these two control groups?

FIG. 7.13 Study groups of Gold et al. for brain tumors in children. (Data from Gold EB, Gordis L, Tonascia J, et?al. Risk factors for brain tumors in children. Am J Epidemiol. 1979;109:309–319.)

Let’s consider the question, “Did mothers of children with brain tumors have more prenatal radiation exposure than control mothers?” Some possible results are seen in Fig. 7.14A.

FIG. 7.14 Rationale for using two control groups: (A) Radiation exposure is the same in both brain tumor cases and in other cancer controls, but is higher in both groups than in normal controls: Could this be due to recall bias? (B) Radiation exposure in other cancer controls is the same as in normal controls, but is lower than in brain tumor cases: recall bias is unlikely. (Data from Gold EB, Gordis L, Tonascia J, et?al. Risk factors for brain tumors in children. Am J Epidemiol. 1979;109:309–319.)

If the radiation exposure of mothers of children with brain tumors is found to be greater than that of mothers of normal controls, and the radiation exposure of mothers of children with other cancers is also found to be greater than that of mothers of normal children, what are the possible explanations? One conclusion might be that prenatal radiation is a risk factor both for brain tumors and for other cancers—that is, its effect is that of a carcinogen that is not site specific. Another explanation to consider is that the findings could have resulted from recall bias and that mothers of children with any type of cancer recall prenatal radiation exposure better than mothers of normal children.


Consider another possible set of findings, shown in Fig. 7.14B. If mothers of children with brain tumors have a greater radiation exposure history than both mothers of normal controls and mothers of children with other cancers, the findings might suggest that prenatal radiation is a specific carcinogen for the brain. These findings would also reduce the likelihood that recall bias is playing a role, as it would seem implausible that mothers of children with brain tumors would recall prenatal radiation better than mothers of children with other cancers. Thus multiple controls of different types can be valuable for exploring alternate hypotheses and for taking into account possible potential biases, such as recall bias.

Despite the issues raised in this chapter, case-control studies are invaluable in exploring the etiology of disease. Recent reports in the literature demonstrate the utility of the case-control study design in contemporary research.


Kristian Filion and colleagues in Canada addressed the concern that a common antidiabetic class of drugs (incretin-based drugs used in clinical practice) is associated with increased risk of heart failure. 23 Prior reports from clinical trials had been inconsistent. The investigators pooled health care data from four Canadian provinces, the United States, and the United Kingdom and conducted a case-control study in which each patient who was hospitalized for heart failure was matched with 20 controls. Matching criteria included age, sex, time entered into the study, how long diabetes had been treated, and how long patients with diabetes were under observation. Almost 30,000 patients were hospitalized for heart failure from almost 1.5 million total patients. Incretin-based medications were not found to increase hospitalization for heart failure when compared with oral antidiabetic drugs. Another example of the utility of the case-control study is given by Su and colleagues at the University of Michigan, who evaluated the association of occupational and environmental exposures on the risk of developing amyotrophic lateral sclerosis (ALS, commonly known as Lou Gehrig’s disease, a progressive neurological disease that affects the neurons in the brain and spinal cord responsible for controlling voluntary muscle movement). 24 Cases were identified at a tertiary referral center for ALS between 2011 and 2014. Cases consisted of 156 ALS patients; 128 controls were selected from volunteers who responded to online postings. Controls, who were frequency matched to cases by age, gender, and education, self-reported that they were free of neurodegenerative disease and had no first- or second-degree relatives with ALS. A questionnaire ascertained exposure to occupational and environmental exposures. Blood concentrations were assessed for 122 common pollutants. Overall, 101 cases and 110 controls had complete demographic and pollutant data. From the occupational history, military service was associated with ALS. Self-reported pesticide exposure was associated with fivefold increased odds of ALS. When controlling for other possible factors that might be associated with ALS, three exposures measured in the blood were identified: occupational exposures to pesticides and to polychlorinated biphenyls (PCBs) in farming and fishing industries. The authors concluded that persistent environmental pollutants as measured in the blood were significantly associated with ALS and suggested that reducing exposure to these agents might reduce the incidence of ALS at the population level. Module 5: Using Epidemiology to Identify the Cause of Disease


A final example of the usefulness of the case-control study relates to its use during a disease outbreak. In a study addressing the association of Guillain-Barré syndrome with Zika virus infection in French Polynesia in 2013–2014, Cao-Lormeau and colleagues noted that during the Zika outbreak, there was an increase in reports of Guillain-Barré syndrome suggestive of a possible relationship. Forty-two patients admitted to the main referral hospital in Papeete, Tahiti, meeting the diagnostic criteria for Guillain-Barré were matched to two types of controls: (1) age-, sex-, and residence-matched patients without fever seen at the facility (n = 98), and (2) age-matched patients with acute Zika free of neurologic symptoms (n = 70). Of the 42 patients with Guillian-Barré syndrome, 98% (41/42) had antibodies against the Zika virus, compared with 56% of controls. All patients in control group 2 had positive confirmation for the Zika virus. The authors concluded that their study provides evidence for Zika virus infection “causing” Guillain-Barré syndrome. This claim seems to go a bit beyond the evidence, as we will see in the next section and is reiterated in subsequent chapters.

When Is a Case-Control Study Warranted?

A case-control study is useful as a first step when searching for a cause of an adverse health outcome, as seen in the examples at the beginning of this chapter and those just presented. At an early stage in our search for an etiology, we may suspect any one of several exposures, but we may not have evidence, and certainly no strong evidence, to suggest an association of any one of the suspect exposures with the disease in question. Using the case-control design, we compare people with the disease (cases) and people without the disease (controls; Fig. 7.15A). We can then explore the possible roles of a variety of exposures or characteristics in causing the disease (see Fig. 7.15B). If the exposure is associated with the disease, we would expect the proportion of cases who have been exposed to be greater than the proportion of controls who have been exposed (see Fig. 7.15C). When such an association is documented in a case-control study, the next step is often to carry out a cohort study to further elucidate the relationship. Because case-control studies are generally less expensive than cohort studies and can be carried out more quickly, they are often the first step in determining whether an exposure is linked to an increased risk of disease.

FIG. 7.15 Design of a case-control study. (A) Start with the cases and the controls. (B) Measure past exposure in both groups. (C) Expected findings if the exposure is associated with the disease.

Case-control studies are also valuable when the disease being investigated is rare. It is often possible to identify cases for study from disease registries, hospital records, or other sources. In contrast, if we conduct a cohort study for a rare disease, an extremely large study population may be needed in order to observe a sufficient number of individuals in the cohort develop the disease in question. In addition, depending on the length of the interval between exposure and development of disease, a cohort design may involve many years of follow-up of the cohort and considerable logistical difficulty and expense in maintaining and following the cohort over the study period.

Case-Crossover Design

The case-crossover design is primarily used for studying the etiology of acute outcomes such as MIs or deaths from acute events in situations where the suspected exposure is transient and its effect occurs over a short time. This type of design has been used in studying exposures such as air pollution characterized by rapid and transient increases in particulate matter. In this type of study, a case is identified (e.g., a person who has suffered an MI) and the level of the environmental exposure, such as level of particulate matter, is ascertained for a short time period preceding the event (the at-risk period). This level is compared with the level of exposure in a control time period that is more remote from the event. Thus each person who is a case serves as his own control, with the period immediately before his adverse outcome being compared with a “control” period at a prior time when no adverse outcome occurred. Importantly, in this type of study, there is inherent matching for variables that do not change (e.g., genetic factors) or variables that only change within a reasonably long period (e.g., height). The question being asked is: Was there any difference in exposure between the time period immediately preceding the outcome and a time period in the more remote past that was not immediately followed by any adverse health effect?

Let’s look at a very small hypothetical 4-month case-crossover study of air pollution and MI (Fig. 7.16A to E).

FIG. 7.16 Design and findings of a hypothetical 4-month case-crossover study of air pollution and myocardial infarction (MI; see discussion in text on page 172). (A) Times of development of MI cases. (B) Periods of high air pollution (shown by the colored bands). (C) Defining at-risk periods (red brackets). (D) Defining control periods (blue brackets). (E) Comparisons made of air pollution levels in at-risk and in control periods for each MI case in the study (yellow arrows).

Fig. 7.16A shows that over a 4-month period, January–April, four cases of MI were identified, symbolized by the small red hearts in the diagrams. The vertical dotted lines delineate 2-week intervals during the 4-month period. For the same 4-month period, levels of air pollution were measured. Three periods of high levels of air pollution of different lengths of time were identified and are shown by the pink areas in Fig. 7.16B.


For each person with an MI in this study, an “at-risk” period (also called a “hazard period”) was defined as the 2 weeks immediately prior to the event. These at-risk periods are indicated by the red brackets in Fig. 7.16C. If an exposure has a short-term effect on risk of an MI, we would expect exposure to have occurred during that 2-week at-risk period. The critical element, however, in a case-crossover design is that for each subject in the study, we compare the level of exposure in that at-risk period with a control period (also called a “referent period”) that is unlikely to be relevant to occurrence of the event (the MI) because it is too far removed in time from the occurrence. In this example, the control period selected for each subject is a 2-week period beginning 1 month before the at-risk period, and these control periods are indicated by the blue brackets in Fig. 7.16D. Thus, as shown by the yellow arrows in Fig. 7.16E, for each subject, we are comparing the air pollution level in the at-risk period to the air pollution level in the control period. In order to demonstrate an association of MI with air pollution, we would expect to see greater exposure to high levels of air pollution during the at-risk period than during the control period.


In this example, we see that for subject 1 both the at-risk period and the control period were in low pollution times. For subjects 2 and 3, the at-risk periods were in high pollution times and the control periods in low pollution times. For subject 4, both the at-risk and control periods were in high pollution times. Module 5: Using Epidemiology to Identify the Cause of Disease

Thus, in the case-crossover design, each subject serves as his or her own control. In this sense the case-crossover design is similar to the planned crossover design presented in Chapter 10. In this type of design, we are not concerned about other differences between the characteristics of the cases and those of a separate group of controls. This design also eliminates the additional cost that would be associated with identifying and interviewing a separate control population.


Attractive as this design is, unanswered questions remain. For example, the case-crossover design can be used to study people with heart attacks in regard to whether there was an episode of severe grief or anger during the period immediately preceding the attack. In this study design, the frequency of such emotionally charged events during that time interval would be compared, for example, with the frequency of such events during a period a month earlier, which was not associated with any adverse health event.

Information on such events in both periods is often obtained by interviewing the subject. The question arises, however, whether there could be recall bias, in that a person may recall an emotionally charged episode that occurred shortly before a coronary event, while a comparable episode a month earlier in the absence of any adverse health event may remain forgotten. Thus recall bias may be a problem not only when we compare cases and controls, as discussed earlier in this chapter, but also when we compare the same individual in two different time periods. Further discussion of case-crossover is provided by Maclure and Mittleman. 26



We have now reviewed the most basic study observational designs used in epidemiologic investigations and clinical research. Unfortunately, a variety of different terms are used in the literature to describe different study designs, and it is important to be familiar with them. Table 7.11 is designed to help guide you through the often confusing terminology. The next study design is the “cohort study,” which is presented in Chapter 8, and builds upon what we have learned about the initial observational study designs presented in this chapter. We then follow with two chapters on randomized trials, which are not “strictly” observational studies. In observational studies, the investigator merely follows those who are diseased or not diseased, or exposed and not exposed. In the randomized trial study design, the investigator uses a random allocation schedule to determine which participants are exposed or not. Hence the randomized trial is akin to an experiment and is also known as an “experimental study.” However, it differs from observational studies only in that the exposure is experimentally (randomly) assigned by the study investigator.


TABLE 7.11


Finding Your Way in the Terminology Jungle

Case-control study                  =                      Retrospective study

Cohort study   =          Longitudinal study     =          Prospective study

Prospective cohort study        =          Concurrent cohort study         =          Concurrent prospective study

Retrospective cohort study     =          Historical cohort study           =            Nonconcurrent prospective study

Randomized trial                    =                      Experimental study

Cross-sectional study             =                      Prevalence survey

The purpose of all of these types of studies is to identify associations between exposures and diseases. If such associations are found, the next step is to determine whether the associations are likely to be causal. These topics, starting with estimating risk and determining whether exposure to a certain factor is associated with excess risk of the disease, are addressed later.


Review Questions for Chapter 7

1  A case-control study is characterized by all of the following except:

  1. It is relatively inexpensive compared with most other epidemiologic study designs
  2. Patients with the disease (cases) are compared with persons without the disease (controls)
  3. Incidence rates may be computed directly
  4. Assessment of past exposure may be biased
  5. Definition of cases may be difficult

2  Residents of three villages with three different types of water supply were asked to participate in a survey to identify cholera carriers. Because several cholera deaths had occurred recently, virtually everyone present at the time underwent examination. The proportion of residents in each village who were carriers was computed and compared. What is the proper classification for this study?

  1. Cross-sectional study
  2. Case-control study
  3. Prospective cohort study
  4. Retrospective cohort study
  5. Experimental study

3  Which of the following is a case-control study?

  1. Study of past mortality or morbidity trends to permit estimates of the occurrence of disease in the future
  2. Analysis of previous research in different places and under different circumstances to permit the establishment of hypotheses based on cumulative knowledge of all known factors
  3. Obtaining histories and other information from a group of known cases and from a comparison group to determine the relative frequency of a characteristic or exposure under study
  4. Study of the incidence of cancer in men who have quit smoking
  5. Both a and c. Module 5: Using Epidemiology to Identify the Cause of Disease

4  In a study begun in 1965, a group of 3,000 adults in Baltimore were asked about alcohol consumption. The occurrence of cases of cancer between 1981 and 1995 was studied in this group. This is an example of:

  1. A cross-sectional study
  2. A prospective cohort study
  3. A retrospective cohort study
  4. A clinical trial
  5. A case-control study

5  In a small pilot study, 12 women with endometrial cancer (cancer of the uterus) and 12 women with no apparent disease were contacted and asked whether they had ever used estrogen. Each woman with cancer was matched by age, race, weight, and parity to a woman without disease. What kind of study design is this?

  1. Prospective cohort study
  2. Retrospective cohort study
  3. Case-control study
  4. Cross-sectional study
  5. Experimental study

6  The physical examination records of the entire incoming freshman class of 1935 at the University of Minnesota were examined in 1977 to see if their recorded height and weight at the time of admission to the university was related to the development of coronary heart disease (CHD) by 1986. This is an example of:

  1. A cross-sectional study
  2. A case-control study
  3. A prospective cohort study
  4. A retrospective cohort study
  5. An experimental study

7  In a case-control study, which of the following is true?

  1. The proportion of cases with the exposure is compared with the proportion of controls with the exposure
  2. Disease rates are compared for people with the factor of interest and for people without the factor of interest
  3. The investigator may choose to have multiple comparison groups
  4. Recall bias is a potential problem
  5. a, c, and d

8  In which one of the following types of study designs does a subject serve as his own control?

  1. Prospective cohort study
  2. Retrospective cohort study
  3. Case-cohort study
  4. Case-crossover study
  5. Case-control study

9  Ecologic fallacy refers to:

  1. Assessing exposure in large groups rather than in many small groups
  2. Assessing outcome in large groups rather than in many small groups
  3. Ascribing the characteristics of a group to every individual in that group
  4. Examining correlations of exposure and outcomes rather than time trends
  5. Failure to examine temporal relationships between exposures and outcomes

10  A researcher wants to investigate if tea consumption (assessed by a biomarker for tea metabolism) increases the risk of CHD. He uses a case-control study to answer this question. CHD is rare in younger people. Which two groups are best to enroll and compare for this purpose?

  1. The group of CHD cases and a group of those who do not have CHD individually matched to the cases for tea metabolism biomarker
  2. The group of CHD cases and a group of those who do not have CHD frequency matched to the cases for tea metabolism biomarker
  3. The group of CHD cases and a group of those who do not develop CHD, matched for age
  4. A random sample of those who drink tea and a random sample of those who do not drink tea, matched for age
  5. A random sample of those who drink tea and a random sample of those who do not drink tea, unmatched for age

11  Which of the following is a true conclusion concerning matching?

  1. Once we have matched controls to cases according to a given characteristic, we can only study that characteristic when the prevalence of disease is low
  2. If an attempt is made to match on too many characteristics, it may prove difficult or impossible to adjust for all of the characteristics during data analysis
  3. Matching on many variables may make it difficult to find an appropriate control
  4. Individual matching differs from frequency matching because controls are selected from hospitals instead of from the general population. Module 5: Using Epidemiology to Identify the Cause of Disease
  5. None of the above


Cohort Studies


incidence; concurrent and nonconcurrent (historic/retrospective) cohort study; selection bias; information bias; nested case-control study; case-cohort study


  • To describe the designs of cohort studies and options for the conduct of longitudinal studies.
  • To illustrate the cohort study design with two important historical examples.
  • To discuss some potential biases in cohort studies. Module 5: Using Epidemiology to Identify the Cause of Disease

In this chapter, and in the following chapters in Section II, we turn to the uses of epidemiology to elucidate etiologic or causal relationships. The two steps that underlie the study designs are discussed in this chapter and the chapters on clinical trials. Fig. 8.1 schematically represents these two conceptual steps:


  1. First, we determine whether there is an association between a factor or a characteristic and the development of a disease. This can be accomplished by studying the characteristics of groups, by studying the characteristics of individuals, or both.
  2. Second, we derive appropriate inferences regarding a possible causal relationship from the patterns of association that have been found.

FIG. 8.1 If we observe an association between an exposure and a disease or another outcome (1), the question is: Is the association causal (2)?

Previously, we described the study designs used for step 1. In this chapter, cohort studies are discussed; randomized controlled trials (experiments) are presented in Chapters 10 and 11. Cohort studies, along with ecologic, cross-sectional, and case-control studies, in contrast to randomized controlled trials, are collectively referred to as observational studies. That is, there is no experimental manipulation involved; we investigate exposures among study participants (at one point in time or over time) and observe their outcomes either at the same point in time or sometime later on.

Design of a Cohort Study

In a cohort study, the investigator selects a group of exposed individuals and a group of unexposed individuals and follows both groups over time to compare the incidence of disease (or rate of death from disease) in the two groups (Fig. 8.2). The design may include more than two groups (such as no exposure, low exposure, and high exposure levels), although only two groups are shown here for diagrammatic purposes.

If a positive association exists between the exposure and the disease, we would expect that the proportion of the exposed group in whom the disease develops (incidence in the exposed group) would be greater than the proportion of the unexposed group in whom the disease develops (incidence in the unexposed group).


The calculations involved are seen in Table 8.1. We begin with an exposed group and an unexposed group. Of the (a + b) exposed persons, the disease develops in a but not in a/a+b

Thus the incidence of the disease among the exposed is . Similarly, in the (c + d) unexposed persons in the study, the disease develops in c but not in d. Thus the incidence of the disease among the unexposed is c/c+d.

The use of these calculations is seen in a hypothetical example of a cohort study shown in Table 8.2. In this cohort study, the association of smoking with coronary heart disease (CHD) is investigated by selecting for study a group of 3,000 smokers (exposed) and a group of 5,000 nonsmokers (unexposed), all of whom are free of heart disease at baseline. Both groups are followed for the development of CHD, and the incidence of CHD in both groups is compared. CHD develops in 84 of the smokers and in 87 of the nonsmokers. The result is an incidence of CHD of 28.0/1,000 in the smokers and 17.4/1,000 in the nonsmokers.

Note that because we are identifying new (incident) cases of disease as they occur, we can determine whether a temporal relationship exists between the exposure and the disease (i.e., whether the exposure preceded the onset of the disease). Clearly, such a temporal relationship must be established if we are to consider the exposure a possible cause of the disease in question.

Selection of StModule 5: Using Epidemiology to Identify the Cause of Diseaseudy Populations

The essential characteristic in the design of cohort studies is the comparison of outcomes in an exposed group and in an unexposed group (or a group with a certain characteristic and a group without that characteristic, such as older or younger participants). There are two basic ways to generate such groups:


  1. We can create a study population by selecting groups for inclusion in the study on the basis of whether or not they were exposed (e.g., occupationally exposed cohorts compared with similarly aged community residents who do not work in those occupations) (Fig. 8.3).


  1. We can select a defined population before any of its members become exposed or before their exposures are identified. We could select a population on the basis of some factor not related to exposure (such as community of residence) (Fig. 8.4) and take histories of, or perform blood tests or other assays on, the entire population. Using the results of the histories or the tests, one can separate the population into exposed and unexposed groups (or those who have and those who do not have certain biologic characteristics), such as was done in the Framingham Study, described later in this chapter. Module 5: Using Epidemiology to Identify the Cause of Disease

Cohort studies, in which we wait for an outcome to develop in a population, often require a long follow-up period, lasting until enough events (outcomes) have occurred. When the second approach is used—in which a population is identified for study based on some characteristic unrelated to the exposure in question—the exposure of interest may not take place for some time, even for many years after the population has been defined. Consequently, the length of follow-up required is even greater with the second approach than it is with the first. Note that with either approach the cohort study design is fundamentally the same: we compare exposed and unexposed persons. This comparison is the hallmark of the cohort design.

Types of Cohort Studies

A major issue with the cohort design just described is that the study population often must be followed up for a long period to determine whether the outcome of interest develops. Consider as an example a hypothetical study of the relationship of smoking to lung cancer. We identify a population of elementary school students and follow them; 10 years later, when they are teenagers, we identify those who smoke and those who do not. We then follow both groups—smokers and nonsmokers—to see who develops lung cancer and who does not. Let us say that we begin our study in 2012 (Fig. 8.5). Let us suppose that many children who will become smokers will do so within 10 years. Exposure status (smoker or nonsmoker) will therefore be ascertained 10 years later, in the year 2022. For purposes of this example, let us assume that the average latent period from beginning smoking to development of lung disease is 20 years. Therefore development of lung cancer will be, on average, ascertained 20 years later, in 2042.

This type of study design is called a prospective cohort study (also called by some a concurrent cohort or longitudinal study). It is concurrent (happening or done at the same time) because the investigator identifies the original population at the beginning of the study and, in effect, follows the subjects concurrently through calendar time until the point at which the disease develops or does not develop.


What is the problem with this approach? The difficulty is that, as just described, the study will take at least 30 years to complete. Several problems can result. If one is fortunate enough to obtain a research grant, such funding is generally limited to a maximum of only 5 years. In addition, with a study of this length, there is the risk that the study subjects will outlive the investigator, or at least that the investigator may not survive to the end of the study. Given these issues, the prospective cohort study often proves unattractive to investigators who are contemplating a new research question.

Do these problems mean that the cohort design is not practical? Is there any way to shorten the time period needed to conduct a cohort study? Let us consider an alternate approach using the cohort design (Fig. 8.6). Suppose that we again begin our study in 2012, but now we find that an old roster of elementary schoolchildren from 1982 is available in our community and that they had been surveyed in high school regarding their smoking habits in 1992. Using these data resources in 2012, we can begin to determine who in this population developed lung cancer and who has not. This is called a retrospective cohort or historical cohort study (also called a nonconcurrent prospective study). However, note that the study design does not differ from that of the prospective cohort design—we are still comparing exposed and unexposed groups. What we have done in the retrospective cohort design is to use historical data so that we can telescope (reduce) the frame of calendar time for the study and obtain our results sooner. It is no longer a prospective design, because we are beginning the study with a preexisting population to reduce the duration of the study. However, as shown in Fig. 8.7, the designs for both the prospective cohort study and the retrospective or historical cohort study are identical: we are comparing exposed and unexposed populations. The only difference between them is calendar time. In a prospective cohort design, exposure and unexposure are ascertained as they occur during the study; the groups are then followed for several years into the future and incidence is measured. In a retrospective cohort design, exposure is ascertained from past records and the outcome (development or no development of disease) is determined when the study is begun. Module 5: Using Epidemiology to Identify the Cause of Disease

It is also possible to conduct a study that is a combination of prospective cohort and retrospective cohort designs. With this approach, exposure is ascertained from objective records in the past (as in a historical cohort study) and follow-up and measurement of outcome continue into the future.

Examples of Cohort Studies

Example 1: The Framingham Study

One of the first, most important, and best-known cohort studies is the Framingham Study of cardiovascular disease, which was begun in 1948. 1 Framingham is a town in Massachusetts, approximately 20 miles west of Boston. It was thought that the characteristics of its population (just less than 30,000 residents) would be appropriate for such a study and would facilitate follow-up of participants because migration out was considered to be low (i.e., the population was stable).


Residents were considered eligible if they were between 30 and 62 years of age at study initiation. The rationale for using this age range was that people younger than 30 years would generally be unlikely to manifest the cardiovascular end points being studied during the proposed 20-year follow-up period. Many persons older than 62 years would already have established coronary disease, and it would therefore not be rewarding to study persons in this age group for identifying the incidence of coronary disease. Module 5: Using Epidemiology to Identify the Cause of Disease

The investigators sought a sample size of 5,000. Table 8.3 shows how the final study population was derived. It consisted of 5,127 men and women who were between 30 and 62 years of age at the time of study entry and were free of cardiovascular disease at that time. In this study, many suggested “exposures” were defined, including age and gender, smoking, weight, blood pressure, cholesterol levels, physical activity, and other factors.



Derivation of the Framingham Study Population

blank cell        No. of Men     No. of Women Total

Random sample          3,074   3,433   6,507

Respondents   2,024   2,445   4,469

Volunteers      312      428      740

Respondents free of CHD       1,975   2,418   4,393

Volunteers free of CHD          307      427      734

Total free of CHD: The Framingham Study Group   2,282   2,845   5,127

CHD, Coronary heart disease.


From Dawber TR, Kannel WB, Lyell LP. An approach to longitudinal studies in a community: the Framingham Study. Ann NY Acad Sci. 1993;107:539–556.


New coronary events (incidence) were identified by examining the study population every 2 years and by daily surveillance of hospitalizations at the only hospital in Framingham. Module 5: Using Epidemiology to Identify the Cause of Disease


The study was designed to test the following hypotheses:


  • The incidence of CHD increases with age. It occurs earlier and more frequently in males.
  • Persons with hypertension develop CHD at a greater rate than those who are normotensive.
  • Elevated blood cholesterol level is associated with an increased risk of CHD.
  • Tobacco smoking and habitual use of alcohol are associated with an increased incidence of CHD.
  • Increased physical activity is associated with a decrease in the development of CHD.
  • An increase in body weight predisposes a person to the development of CHD.
  • An increased rate of development of CHD occurs in patients with diabetes mellitus.

When we examine this list nowadays, we might wonder why such obvious and well-known relationships should have been examined in such an extensive study. The danger of this “hindsight” approach should be kept in mind; it is primarily because of the Framingham Study, a classic cohort study that made fundamental contributions to our understanding of the epidemiology of cardiovascular disease, that these relationships are currently well known.


This study used the second method described earlier in the chapter for selecting a study population for a cohort study: A defined population was selected on the basis of location of residence or other factors not related to the exposure(s) in question. The population was then observed over time to determine which individuals developed or already had the “exposure(s)” of interest and, later on, to determine which study participants developed the cardiovascular outcome(s) of interest. This approach offered an important advantage: It permitted the investigators to study multiple “exposures,” such as hypertension, smoking, obesity, cholesterol levels, and other factors, as well as the complex interactions among the exposures, by using multivariable techniques. Thus, although a cohort study that begins with an exposed and an unexposed group focuses often on only one specific exposure, a cohort study that begins with a defined population can explore the roles of many exposures to the study outcome measure(s). Module 5: Using Epidemiology to Identify the Cause of Disease

Example 2: Incidence of Breast Cancer and Progesterone Deficiency

It has long been recognized that breast cancer is more common in women who are older at the time of their first pregnancy. A difficult question is raised by this observation: Is the relationship between late age at first pregnancy and increased risk of breast cancer related to the finding that early first pregnancy protects against breast cancer (and therefore such protection is missing in women who have a later pregnancy or no pregnancy), or are both a delayed first pregnancy and an increased risk of breast cancer the result of some third factor, such as an underlying hormonal abnormality?

It is difficult to tease apart these two interpretations. However, in 1978, Linda Cowan and coworkers 2 carried out a study designed to determine which of these two explanations was likely to be the correct one (Fig. 8.8). The researchers identified a population of women who were patients at the Johns Hopkins Hospital Infertility Clinic in Baltimore, Maryland, from 1945 to 1965. Because they were patients at this clinic, the subjects, by definition, all had a late age at first pregnancy. In the course of their diagnostic evaluations, detailed hormonal profiles were developed for each woman. The researchers were therefore able to separate the women with an underlying hormonal abnormality, including progesterone deficiency (exposed), from those without such a hormonal abnormality (unexposed) who had another cause of infertility, such as a problem with tubal patency or a husband’s low sperm count. Both groups of women were then followed for subsequent development of breast cancer.


How could the results of this study design clarify the relationship between late age at first pregnancy and increased risk of breast cancer? If the explanation for the association of late age at first pregnancy and increased risk of breast cancer is that an early first pregnancy protects against breast cancer, we would not expect any difference in the incidence of breast cancer between the women who have a hormonal abnormality and those who do not (and none of the women would have had an early first pregnancy). However, if the explanation for the increased risk of breast cancer is that the underlying hormonal abnormality predisposes these women to breast cancer, we would expect to find a higher incidence of breast cancer in women with the hormonal abnormality than in those without this abnormality. Module 5: Using Epidemiology to Identify the Cause of Disease

The study found that, when the development of breast cancer was considered for the entire group, the incidence was 1.8 times greater in women with hormonal abnormalities than in women without such abnormalities, but the finding was not statistically significant. However, when the occurrence of breast cancer was divided into categories of premenopausal and postmenopausal incidence, women with hormonal abnormalities had a 5.4 times greater risk of premenopausal occurrence of breast cancer (they developed breast cancer earlier); no difference was seen for postmenopausal occurrence of breast cancer. It is not clear whether this lack of a difference in the incidence of postmenopausal breast cancer represents the true absence of a difference or whether it can be attributed to the small number of women in this population who had reached menopause at the time the study was conducted.


What type of study design is this? Clearly, it is a cohort design because it compares exposed and unexposed persons. Furthermore, because the study was carried out in 1978 and the investigator used a roster of patients who had been seen at the Infertility Clinic from 1945 to 1965, it is a retrospective cohort design.

Cohort Studies for Investigating Childhood Health and Disease

A particularly appealing use of the cohort design is for long-term cohort studies of childhood health and disease. In recent years, there has been increasing recognition that experiences and exposures during fetal life may have long-lasting effects, even into adult life. Infections during pregnancy, as well as exposures to environmental toxins, hormonal abnormalities, or the use of drugs (either medications taken during pregnancy or substances abused during pregnancy), may have potentially damaging effects on the fetus and child, and these agents might have possible effects that last even into adult life. David Barker and his colleagues concluded from their studies that adult chronic disease is biologically programmed in intrauterine life or early infancy. 3 The importance of including a life course approach to the epidemiologic study of chronic disease throughout life has been emphasized.

In this chapter, we have discussed two types of cohort studies; both have applicability to the study of childhood health. In the first type of cohort study, we start with exposed and unexposed groups. For example, follow-up studies of fetuses exposed to radiation from atomic bombs in Hiroshima and Nagasaki during World War II have provided much information about cancer and other health problems resulting from intrauterine exposure to radiation. 4 The exposure dose was calibrated for the survivors on the basis of how far the pregnant women was from the point of the bomb drop and the nature of the barriers between that person and the point of the bomb drop. It was then possible to relate the risk of adverse outcome to the radiation dose that each person received. Another example is the cohort of pregnancies during the Dutch Famine in World War II. 5 Because the Dutch kept excellent records, it was possible to identify cohorts who were exposed to the severe famine at different times in gestation and to compare them with one another and with an unexposed group. Module 5: Using Epidemiology to Identify the Cause of Disease

As discussed earlier in this chapter, in the second type of cohort study, we identify a group before any of its members become exposed or before the exposure has been identified. For example, infants born during a single week in 1946 in Great Britain were followed into childhood and later into adult life. The Collaborative Perinatal Study, begun in the United States in the 1950s, was a multicenter cohort study that followed more than 58,000 children from birth to age 7 years. 6


Although the potential knowledge to be gained by such studies is very attractive, several challenging questions arise when such large cohort studies of children are envisioned and when such long-term follow-up is planned. Among the questions are the following:

  1. At what point should the individuals in the cohort first be identified? When a cohort is initiated at birth and then followed (Fig. 8.9), data on prenatal exposures can be obtained only retrospectively by interview and from relevant records. Therefore some cohort studies have begun in the prenatal period, when the pregnancy is first identified. However, even when this is done, preconceptual and periconceptual data that may be needed to answer certain questions may only be obtained retrospectively. Therefore a cohort initiated prior to the time of conception (Fig. 8.10) is desirable for answering many questions because it permits concurrent gathering of data about exposures at the time of or preceding conception and then in the prenatal and perinatal periods. However, this is generally a logistically difficult and very expensive challenge. Module 5: Using Epidemiology to Identify the Cause of Disease
  2. Should the cohort be drawn from one center or from a few centers, or should it be a national sample drawn in an attempt to make the cohort representative of a national population? Will the findings of studies based on the cohort be broadly generalizable only if the cohort is drawn from a national sample? The National Children’s Study (NCS) was a planned long-term study of 100,000 children and their parents in the United States which was designed to investigate environmental influences on child health and development. The pilot study was initiated in 2009, and only 5,000 children were recruited by 2013 from 40 centers across the United States. Based on the recommendations of an expert panel, the National Institutes of Health (NIH) director closed the NCS in 2014. In 2016 the NIH launched a 7-year study called the Environmental Influences on Child Health Outcomes (ECHO) enrolling existing child (and parent in some cases) cohorts which will then continue to be followed using harmonized data collection. The resulting “synthetic cohort” (or a cohort of cohorts) should prove far more efficient than the planned NCS proposed cohort.
  3. For how long should a cohort be followed? Eaton urged that a cohort should be established at the time of conception and followed into adult life or until death. 7 This approach would help to test Barker’s hypothesis regarding the early origins of many chronic diseases. Recalling that federal funding is generally limited to 5 years, this is an impediment to long-term follow-up.

Which hypotheses and how many hypotheses should be tested in the cohort that will be established? A major problem associated with long-term follow-up of large cohorts is that, by the time the cohort has been established and followed for a number of years, the hypotheses that originally led to the establishment of the cohort may no longer be of sufficient interest or relevance because scientific and health knowledge has changed over time. Furthermore, as new knowledge leads to new hypotheses and to questions that were not originally anticipated when the study was initiated, data on the variables needed to test such new hypotheses and to answer such new questions may not be available in the data originally collected. An example from HIV/AIDS research illustrates these issues. In the early 1980s, when clusters of men were identified who had rare malignancies associated with compromised immune function, later to be defined as HIV/AIDS, the NIH launched the Multicenter AIDS Cohort Study in 1983 and enrolled the first participants in four US cities in 1984. The goal was to identify risk factors for this viral disease and to elucidate the natural history of the disease. With the advent of highly active antiretroviral therapy in 1996, virtually all of the study participants who were already infected were then placed on treatment, and their immune systems were reconstituted. How then could the natural history of a treated infection remain relevant? Was there any use in continuing to follow this cohort? Indeed, a vast number of new, relevant questions unfolded, chief among them has become what is the impact of long-term antiretroviral therapy treatments upon natural aging and the incidence of chronic diseases (cancer, cardiovascular disease, and diabetes, among others)? 9 Furthermore, new genetic tests have been discovered over the past 15 years that provide new insights into why some participants do better than others on treatment. 10 It must be emphasized that cohort studies whose participants are examined periodically, such as the Atherosclerosis Risk in Communities (ARIC) study, 11 allow evaluation of new hypotheses based on information that is collected in follow-up examinations. Module 5: Using Epidemiology to Identify the Cause of Disease


Potential Biases in Cohort Studies

A number of potential biases must be either avoided or taken into account in conducting cohort studies. Discussions of biases in relation to case-control studies were presented earlier; bias in relation to causal inference will be presented later. The definitions used for many types of biases often overlap, and in the interest of clarity, two major categories are commonly used: selection bias and information bias.

Selection Biases

Nonparticipation and nonresponse can introduce major biases that can complicate the interpretation of the study findings. If participants refuse to join a cohort, might their characteristics differ sufficiently from those who consent to enroll, and might these differences lead to misguided inferences regarding exposures to outcomes? For example, if those who refuse to join a study are more likely to smoke than those who consent to participate, would our estimate of the effect of smoking on the disease outcome be biased? If smokers who refuse participation are more likely to develop the disease than those who participate, the impact would be to diminish the association toward the null. Similarly, loss to follow-up can be a serious problem: If people with the disease are selectively lost to follow-up, and those lost to follow-up differ from those not lost to follow-up, the incidence rates calculated in the exposed and unexposed groups will clearly be difficult to interpret. Module 5: Using Epidemiology to Identify the Cause of Disease

Information Biases

  1. If the quality and extent of information obtained is different for exposed persons than for the unexposed persons, a significant bias can be introduced. This is particularly likely to occur in historical cohort studies, in which information is obtained from past records. As we will discuss next in connection with randomized trials, in any cohort study, it is essential that the quality of the information obtained be comparable in both exposed and unexposed individuals.
  2. If the person who decides whether the disease has developed in each subject also knows whether that subject was exposed, and if that person is aware of the hypothesis being tested, that person’s judgment as to whether the disease developed may be biased by that knowledge. This problem can be addressed by “masking” the person who is making the disease assessment and also by determining whether this person was, in fact, aware of each subject’s exposure status.
  3. As in any study, if the epidemiologists and statisticians who are analyzing the data have strong preconceptions, they may unintentionally introduce their biases into their data analyses and into their interpretation of the study findings.

When Is a Cohort Study Warranted?

Fig. 8.11A to C reviews the basic steps in a cohort study, beginning with identifying an exposed group and an unexposed group (see Fig. 8.11A). We then ascertain the rate of development of disease (incidence) in both the exposed and the unexposed groups (see Fig. 8.11B). If the exposure is associated with disease, we would expect to find a greater incidence rate of disease in the exposed group than in the unexposed group, as shown schematically in Fig. 8.11C.

FIG. 8.11 Design of a cohort study. (A) Starting with exposed and unexposed groups. (B) Measuring the development of disease in both groups. (C) Expected findings if the exposure is associated with disease.

Clearly, to carry out a cohort study, we must have some idea of which exposures are suspected a priori as possible causes of a disease and are therefore worth investigating. Consequently, a cohort study is indicated when good evidence suggests an association of a disease with a certain exposure or exposures (evidence obtained from either clinical observations or case-control or other types of studies). Often, we collect biologic specimens at study baseline (enrollment), allowing testing of these samples in the future, often when new test methods are developed and/or new hypotheses are generated. As an example, George Comstock collected serum specimens at the time of a community assessment in the 1960s in Washington County, Maryland. Decades later, these specimens were tested for “clues” to the development of cancer. Results from the Campaign Against Cancer and Heart Disease (CLUE II) cohort study that Dr. Comstock founded showed that high serum cholesterol increases the risk of high-grade prostate cancer and subsequently supports the hypothesis that cholesterol lowering is a potential mechanism by which statins, a cholesterol-lowering medications, could have anticancer effects. 12

Because cohort studies often involve follow-up of populations over a long period, the cohort approach is particularly attractive when we can minimize attrition (losses to follow-up) of the study population. Consequently, such studies are generally easier to conduct when the interval between the exposure and the development of disease is short. An example of an association in which the interval between exposure and outcome is short is the relationship between rubella infection during pregnancy and the development of congenital malformations in the offspring. Module 5: Using Epidemiology to Identify the Cause of Disease

Case-Control Studies Based Within a Defined Cohort

In recent years, considerable attention has focused on whether it is possible to take advantage of the benefits of both case-control and cohort study designs by combining some elements of both into a single study. The resulting combined study is in effect a hybrid design in which a case-control study is initiated within a cohort study. The general design is shown schematically in Fig. 8.12.

In this type of study, a population is identified and followed over time. At the time the population is identified, baseline data are obtained from records or interviews, from blood or urine tests, and in other ways. The population is then followed for a period of years. For most of the diseases that are studied, a small percentage of study participants manifest the disease, whereas most do not. As seen in Fig. 8.12, a case-control study is then carried out using as cases persons in whom the disease developed and using as controls a sample of those in whom the disease did not develop.


Such cohort-based case-control studies can be divided into two types, largely on the basis of the approach used for selecting the controls. These two types of studies are called nested case-control studies and case-cohort studies. Module 5: Using Epidemiology to Identify the Cause of Disease

Nested Case-Control Studies

In nested case-control studies the controls are a sample of individuals who are at risk for the disease at the time each case of the disease develops. This is shown schematically in Fig. 8.13A to I.

Fig. 8.13A shows the starting point as a defined cohort of individuals. Some of them develop the disease in question, but most do not. In this hypothetical example, the cohort is observed over a 5-year period. During this time, five cases develop—one case after 1 year, one after 2 years, two after 4 years, and one after 5 years.


Let us follow the sequence of steps over time. Fig. 8.13B to I shows the time sequence in which the cases develop after the start of observations. At the time each case or cases develop, the same number of controls is selected. The solid arrows on the left side of the figure denote the appearance of cases of the disease, and the dotted arrows on the right side denote the selection of controls who are disease free but who are at risk of developing the disease in question at the time the case develops the disease. Fig. 8.13B shows case #1 developing after 1 year, and Fig. 8.13C shows control #1 being selected at that time. Fig. 8.13D shows case #2 developing after 2 years, and Fig. 8.13E shows control #2 being selected at that time. Fig. 8.13F shows cases #3 and #4 developing after 4 years, and Fig. 8.13G shows controls #3 and #4 being selected at that time. Finally, Fig. 8.13H shows the final case (#5) developing after 5 years, and Fig. 8.13I shows control #5 being selected at this point.

Fig. 8.13I is also a summary of the design and the final study populations used in the nested case-control study. At the end of 5 years, five cases have appeared, and at the times the cases appeared a total of five controls were selected for study. In this way, the cases and controls are, in effect, matched on calendar time and length of follow-up. Because a control is selected each time a case develops, a control who is selected early in the study could later develop the disease and become a case in the same study.


Case-Cohort Studies

The second type of cohort-based case-control study is the case-cohort design seen in Fig. 8.14. In the hypothetical case-cohort study seen here, cases develop at the same times that were seen in the nested case-control design just discussed, but the controls are randomly chosen from the defined cohort with which the study began. This subset of the full cohort is called the subcohort. An advantage of this design is that because controls are not individually matched to each case, it is possible to study different diseases (different sets of cases) in the same case-cohort study using the same cohort for controls. In this design, in contrast to the nested case-control design, cases and controls are not matched on calendar time and length of follow-up; instead, exposure is characterized for the subcohort. This difference in study design needs to be taken into account in analyzing the study results.



FIG. 8.14 Design of a hypothetical case-cohort study: steps in selecting cases and controls. Module 5: Using Epidemiology to Identify the Cause of Disease

Advantages of Embedding a Case-Control Study in a Defined Cohort

What are the advantages of conducting a case-control study in a defined cohort? First, because interviews are completed or certain blood or urine specimens are obtained at the beginning of the study (at baseline), the data are obtained before any disease has developed. Consequently, the problem of possible recall bias discussed earlier in this chapter is eliminated. Second, if abnormalities in biologic characteristics such as laboratory values are found, because the specimens were obtained years before the development of clinical disease, it is more likely that these findings represent risk factors or other premorbid characteristics than a manifestation of early, subclinical disease. When such abnormalities are found in the traditional case-control study, we do not know whether they preceded the disease or were a result of the disease, particularly when the disease has a long subclinical (asymptomatic) phase, such as prostate cancer and chronic lymphocytic leukemia. Third, such a study is often more economical to conduct. One might ask, why perform a case-control study within a defined cohort? Why not perform a regular prospective cohort study? The answer is that in a cohort study of, say, 10,000 people, laboratory analyses of all the specimens obtained would have to be carried out, often at great cost, to define exposed and unexposed groups. However, in a case-control study within the same cohort, the specimens obtained initially are frozen or otherwise stored. Only after the disease has developed in some subjects is a case-control study begun and the specimens from the relatively small number of people who are included in the case-control study are thawed and tested. Laboratory tests would not need to be performed on all 10,000 people in the original cohort. Thus the laboratory burden and costs are dramatically reduced. Module 5: Using Epidemiology to Identify the Cause of Disease


Finally, in both nested case-control and case-cohort designs, cases and controls are derived from the same original cohort, so there is likely to be greater comparability between the cases and the controls than one might ordinarily find in a traditional case-control study. For all of these reasons, the cohort-based case-control study is an extremely valuable type of study design.



Several considerations can make the cohort design impractical. Often, strong evidence does not exist to justify mounting a large and expensive study for in-depth investigation of the role of a specific risk factor in the etiology of a disease. Even when such evidence is available, a cohort of exposed and unexposed persons often cannot be identified easily. In general, we do not have access to appropriate past records or other sources of data that enable us to conduct a retrospective cohort study; as a result, a long study is required because of the need for extended follow-up of the population after exposure. Furthermore, many of the diseases that are of interest today occur at very low rates. Consequently, very large cohorts must be enrolled in a study to ensure that enough cases develop by the end of the study period to permit valid analyses and conclusions.


In view of these considerations, an approach other than a cohort design is often needed—one that will surmount many of these difficulties. As we previously presented, such study designs—the case-control study and cross-sectional study designs—are being increasingly used. Later, we discuss the use of these study designs in estimating increased risk associated with an exposure.




Review Questions for Chapter 8

1  In cohort studies of the role of a suspected factor in the etiology of a disease, it is essential that:

  1. There be equal numbers of persons in both study groups
  2. At the beginning of the study, those with the disease and those without the disease have equal risks of having the factor
  3. The study group with the factor and the study group without the factor be representative of the general population
  4. The exposed and unexposed groups under study be as similar as possible with regard to possible confounding factors
  5. Both b and c

2  Which of the following is not an advantage of a prospective cohort study?

  1. It usually costs less than a case-control study
  2. Precise measurement of exposure is possible
  3. Incidence rates can be calculated
  4. Recall bias is minimized compared with a case-control study
  5. Many disease outcomes can be studied simultaneously. Module 5: Using Epidemiology to Identify the Cause of Disease

3  Retrospective cohort studies are characterized by all of the following except:

  1. The study groups are exposed and unexposed
  2. Incidence rates may be computed
  3. The required sample size is smaller than that needed for a prospective cohort study
  4. The required sample size is similar to that needed for a prospective cohort study
  5. They are useful for rare exposures

4  A major problem resulting from the lack of randomization in a cohort study is:

  1. The possibility that a factor that led to the exposure, rather than the exposure itself, might have caused the disease
  2. The possibility that a greater proportion of people in the study may have been exposed
  3. The possibility that a smaller proportion of people in the study may have been exposed
  4. That, without randomization, the study may take longer to carry out
  5. Planned crossover is more likely. Module 5: Using Epidemiology to Identify the Cause of Disease

5  In a cohort study, the advantage of starting by selecting a defined population for study before any of its members become exposed, rather than starting by selecting exposed and unexposed individuals, is that:

  1. The study can be completed more rapidly
  2. A number of outcomes can be studied simultaneously
  3. A number of exposures can be studied simultaneously
  4. The study will cost less to carry out
  5. a and d

6  In 2010, investigators were interested in studying early-adult obesity as a risk factor for cancer mortality. The investigators obtained physician health reports on students who attended the University of Glasgow between 1948 and 1968. These reports included records of the students’ heights and weights at the time they attended the university. The students were then followed through 2010. Mortality information was obtained using death certificates. This study can best be described as a:

  1. Nested case-control
  2. Cross-sectional
  3. Prospective cohort
  4. Retrospective cohort
  5. Population-based case-control

7  From 1983 to 1988, blood samples were obtained from 3,450 HIV-negative men in the Multicenter AIDS Cohort Study (MACS) and stored in a national repository. In 2010 a researcher was interested in examining the association between levels of inflammation and HIV infection. Of the 3,450 men, 660 men were identified as HIV-infected cases. The researcher investigated the association between C-reactive protein (CRP) and HIV infection among these 660 cases and 660 controls, matched to the cases by age and ethnicity, who did not become infected with HIV. The researcher used the stored blood samples to measure the serum level of CRP, a marker of systemic inflammation. The study initiated in 2010 is an example of a: Module 5: Using Epidemiology to Identify the Cause of Disease

  1. Nested case-cohort study
  2. Nested case-control study
  3. Retrospective cohort study
  4. Cross-sectional study
  5. Randomized clinical trial



Comparing Cohort and Case-Control Studies


comparison of exposed and unexposed; comparison of diseased (cases) and nondiseased (controls); comparison of cohort and case-control studies; temporality. Module 5: Using Epidemiology to Identify the Cause of Disease

At this point in our discussion, we will review some of the material that has been covered to this point in Section II. Because the presentation proceeds in a stepwise manner, it is important to understand what has been discussed thus far.


First, let’s compare the designs of cohort and case-control studies, as seen in Fig. 9.1. The important point that distinguishes between these two types of study designs is that, in a cohort study, exposed and unexposed persons are compared and, in a case-control study, persons with the disease (cases) and without the disease (controls) are compared (Fig. 9.2A). In cohort studies, we compare the incidence of disease in exposed and in unexposed individuals, and in case-control studies, we compare the proportions who have the exposure of interest in people with the disease and in people without the disease (see Fig. 9.2B).

Table 9.1 presents a detailed comparison of prospective cohort, retrospective (historical) cohort, and case-control study designs. If the reader has followed the discussion in Section II to this point, the entries in the table should be easy to understand. Module 5: Using Epidemiology to Identify the Cause of Disease



Comparisons of Cohort and Case-Control Studies

blank cell        cohort studies Case-Control Studies

Prospective     Retrospective

  1. Study group


Exposed persons: (a + b)        Exposed persons: (a + b)        Persons with the disease (cases): (a + c)

  1. Comparison group


Nonexposed persons: (c + d)  Nonexposed persons: (c + d)  Persons without disease (controls): (b + d)

  1. Outcome measurements


Incidence in the exposed        Incidence in the exposed        Proportion of cases exposed


and      and      and

Incidence in the nonexposed  Incidence in the nonexposed  Proportion of controls exposed


  1. Measures of risk


Absolute risk  Absolute risk  —

Relative risk   Relative risk   —

Odds ratio       Odds ratio       Odds ratio

Attributable risk         Attributable risk         Attributable risk a

  1. Temporal relationship between exposure and disease


Easy to establish         Sometimes hard to establish   Sometimes hard to establish

  1. Multiple associations


Possible to study associations of an exposure with several diseases b          Possible to study associations of an exposure with several diseases b     Possible to study associations of a disease with several exposures or factors. Module 5: Using Epidemiology to Identify the Cause of Disease

  1. Time required for the study


Generally long because of need to follow the subjects          May be short   Relatively short

  1. Cost of study


Expensive       Generally less expensive than a prospective study    Relatively inexpensive

  1. Population size needed


Relatively large          Relatively large          Relatively small

  1. Potential bias


Assessment of outcome          Susceptible to bias both in assessment of exposure and assessment of outcome           Assessment of exposure

  1. Best when


Exposure is rare          Exposure is rare          Disease is rare

Disease is frequent among exposed   Disease is frequent among exposed   Exposure is frequent among persons with disease

  1. Problems


Selection of nonexposed comparison group often difficult. Module 5: Using Epidemiology to Identify the Cause of Disease

Changes over time in criteria and methods    Selection of nonexposed comparison group often difficult

Changes over time in criteria and methods    Selection of appropriate controls often difficult

Incomplete information on exposure

a Additional information must be available.


b It is also possible to study multiple exposures when the study population is selected on the basis of a factor unrelated to the exposure.


When we begin a cohort study with exposed and unexposed groups, we can study only the specific exposure that distinguishes one group from the other. However, as shown in Fig. 9.3, we can study multiple outcomes or diseases in relation to the exposure of interest. Most cohort studies start with exposed and unexposed individuals. Less common is the situation where we start with a defined population in which the study population is selected on the basis of a factor not related to exposure, such as place of residence, and some members of the cohort become exposed and others are not exposed over time (Fig. 9.4). In a cohort study that starts with a defined population, it is possible to study multiple exposures. Thus, for example, in the Framingham Study, it was possible to study many exposures, including weight, blood pressure, cholesterol level, smoking, and physical activity among the participating individuals residing in Framingham, Massachusetts.

FIG. 9.4 In a cohort study that starts with a defined population, we can study both multiple exposures and multiple outcomes. Module 5: Using Epidemiology to Identify the Cause of Disease

In cohort studies, incidence in both exposed and unexposed groups can be calculated, and we can therefore directly calculate the relative risk. Prospective cohort studies minimize the potential for recall and other bias in assessing the exposure and have greater validity of the exposure assessments. However, in retrospective cohort studies, which require data from the past, these problems may be significant. Cohort studies are desirable when the exposure of interest is rare. In a case-control design, we are unlikely to identify a sufficient number of exposed persons when we are dealing with a rare exposure. In prospective cohort studies in particular, we are likely to have better data on the temporal relationship between exposure and outcome (i.e., did the exposure precede the outcome?) Among the disadvantages of cohort studies is that they usually require large populations, and, in general, prospective cohort studies are especially expensive to carry out because follow-up of a large population over time is required. A greater potential bias for assessing the outcome is present in cohort studies than in case-control studies. Finally, cohort studies often become impractical when the disease under study is rare.

As seen in Table 9.1, case-control studies have a number of advantages. They are relatively inexpensive and require a relatively small number of subjects for study. They are desirable when the disease occurrence is rare, because if a cohort study were performed in such a circumstance, a tremendous number of people would have to be followed to generate enough people with the disease for study. As seen in Fig. 9.5, in a case-control study, because we begin with cases and controls, we are able to study more than one possible etiologic factor and to explore interactions among the factors.

Because case-control studies often require data about past events or exposures, they are often encumbered by the difficulties encountered in using such data (including a potential for recall bias). Furthermore, as has been discussed in some detail, selection of an appropriate control group is one of the most difficult methodologic problems encountered in epidemiology. In addition, in most case-control studies, we cannot calculate disease incidence in either the total population or the exposed and unexposed groups without some supplemental information. Module 5: Using Epidemiology to Identify the Cause of Disease


The nested case-control design combines elements of both cohort and case-control studies and offers a number of advantages. The possibility of recall bias is eliminated because the data on exposure are obtained before the disease develops. Exposure data are more likely to represent the pre-illness state because they are obtained years before clinical illness is diagnosed. Finally, the costs are lower than with a cohort study because laboratory tests need to be done only on specimens from subjects who are later chosen as cases or controls; that is, we need to selectively do laboratory tests only on a subset of the overall cohort, thereby yielding considerable cost sThe nested case-control design combines elements of both cohort and case-control studies and offers a number of advantages. The possibility of recall bias is eliminated because the data on exposure are obtained before the disease develops. Exposure data are more likely to represent the pre-illness state because they are obtained years before clinical illness is diagnosed. Finally, the costs are lower than with a cohort study because laboratory tests need to be done only on specimens from subjects who are later chosen as cases or controls; that is, we need to selectively do laboratory tests only on a subset of the overall cohort, thereby yielding considerable cost savings.

In addition to the cohort and case-control study designs, we have discussed the cross-sectional study design, in which data on both exposure and disease outcomes are collected simultaneously from each subject. The data from a cross-sectional study can be analyzed by comparing the prevalence of disease in exposed individuals with that in unexposed individuals or by comparing the prevalence of exposure in persons with the disease with that of persons without the disease. Although cross-sectional data are often obtained from representative surveys and can be very useful, they usually do not permit the investigator to determine the temporal relationship between exposure and the development of disease. As a result, their value for deriving causal inferences is somewhat limited. However, they can provide important directions for further research using cohort, case-control, and nested case-control designs. Module 5: Using Epidemiology to Identify the Cause of Disease