Main

The sequencing of the human genome and increased investigation of its function are providing powerful research tools for identifying genetic variants that contribute to common diseases1,2,3. Recognition is growing, however, that genetic variants alone cannot account for most cases of chronic disease4. It is far more likely that environmental and behavioural changes, in interaction with a genetic predisposition, have produced most of the recent increases in chronic disease, and might therefore be the key to reversing this trend5.

For these reasons, the search for gene–environment interactions — differences in the association of a genetic variant with disease in the presence of a particular environmental exposure, or vice versa — is gaining increased emphasis6. These interactions are important because they can mask the detection of a genetic (or environmental) effect if they are not identified and controlled for, and can also lead to inconsistencies in disease associations when populations are subject to different environmental exposures that modify the effect of a given genetic variant (or the reverse)7,8,9,10,11 (Fig. 1). However, the most important implication of gene–environment interactions is that they can suggest approaches for modifying the effects of deleterious genes by avoiding the deleterious environmental exposure, as both the genetic variant and the exposure must be present to produce disease.

Figure 1: The importance of gene–environment interactions — an example.
figure 1

Predicted values of high-density lipoprotein cholesterol (HDL-C) are shown for different hepatic lipase (LIPC) genotypes at different total levels of dietary fat intake (data from Ref. 7). Low fat intake (band A) combined with the TT genotype results in the highest HDL-C level. For a moderate fat intake (band B), there is no relationship between genotype and HDL-C level. For a high fat intake (band C), the TT genotype has the lowest HDL-C level. Gene–environment interactions are therefore important in identifying genetic and environmental determinants of medically relevant phenotypes such as HDL-C levels; depending on the dietary fat intake, one could conclude that the TT genotype produces high (band A) or low (band C) HDL-C levels, or that it is not associated with HDL-C levels at all (band B).

The most widely used method for investigating the genetic and environmental basis of complex disease is the case–control study. Case–control studies involve an investigation of all cases of disease, or a representative sample of cases, compared with a representative sample of disease-free controls. Cases and controls are typically investigated retrospectively for evidence of genetic and other risk factors along with environmental exposures that existed before disease onset, and so probably contributed to disease development. However, because case–control studies typically begin with disease cases that have already occurred, they are subject to significant sources of bias, as described below.

By contrast, prospective cohort studies involve the investigation of a representative sample of the population before disease onset. This sample is then followed until the occurrence of specified endpoints (see Figure 2 for a comparison of this design with a case–control study12). The purpose of this design is to identify risk factors that predispose an individual to disease, or biomarkers for predicting disease development, in the population as a whole, not only among those individuals that come to medical attention. Prospective cohort studies are particularly valuable for detecting risk factors and risk markers that might be affected by disease, treatment or lifestyle changes13, which are subject to imperfect or biased recall, and for identifying risk factors that might have early pathogenic effects14. Several large-scale prospective cohort studies of genes and environment are underway or in planning throughout the world, including the UK Biobank15 and a proposed large-scale US cohort study5. However, the need for this design in genetic research has been questioned16,17. The high costs, large sample sizes and long durations that are typical of prospective cohort studies have been contrasted to the potentially more efficient case–control design18.

Figure 2: The case–control and prospective cohort study designs.
figure 2

Case–control studies identify individuals with and without disease, determine the differences between them in past exposures or biological characteristics, and then examine those differences for potentially causative factors. Prospective cohortstudies identify individuals with and without a given exposure, follow them through time to determine who develops disease, and then examine differences in the preceding exposures for potentially causative factors. Modified with permission from Ref. 12 © (2003) Massachusetts Medical Society.

Here we present the advantages of the prospective cohort design, which avoids or significantly reduces the important weaknesses of the case–control design, particularly with respect to identifying gene–environment interactions. We begin by discussing how bias can be introduced into studies of risk factors for disease, followed by an analysis of the extent to which each design is affected by such biases and other weaknesses, and the advantages that prospective cohort studies provide. We then outline the instances in which we believe that prospective cohort studies have important advantages, with a feasability analysis that includes the sample sizes needed to identify genetic and environmental risk factors and their interactions, and the challenges faced. On this basis, we argue that prospective cohort studies provide a valuable, feasible and, indeed, indispensable means of exploring the genetic basis of complex human diseases. We also put forward the case for carrying out new, large-scale studies of this type to determine the roles of genes and environment in diseases of major public health importance.

Potential sources of bias

The validity of the evidence from observational studies of the genetic and environmental influences on disease relies on the avoidance of bias, which is defined as: “Any process at any stage of inference which tends to produce results or conclusions that differ systematically from the truth”.19 Reduction of bias is the principal reason for preferring the prospective cohort design to the case–control design.

At least 35 types of bias have been described19, but 8 are crucial in assessing the strengths and weaknesses of case–control and prospective cohort studies (Box 1). Particularly important are biases in subject selection20, especially prevalence–incidence bias, which occurs when a study of currently evident (prevalent) cases (which are often identified through medical records) overlooks fatal cases or other short episodes21. This is a particular problem if a sizeable subset of cases suffers a rapid and fatal course (as in coronary disease or some cancers), so that the 'aetiological' factors that are subsequently identified among the subset of survivors are actually more related to survival or a benign prognosis than to disease causation22. Another potentially important form of respondent bias in genetic studies is the tendency for people with a positive family history to be more likely to participate23,24. A critically important bias in the estimation of self-reported environmental exposures is recall bias. This type of bias occurs when disease status influences the reporting of exposures, for example, when questions about exposure to a putative cause might be asked many times of known cases (or they might repeatedly search their memories) but only once of those without disease.

Any of these forms of bias can severely affect the validity and generalizability of any observational study of disease aetiology. Although concerns about recall bias tend to be dismissed in genetic studies because determination of the key exposure (a genetic variant) does not rely on recall and the temporal nature of the genetic association is clear, the potential for bias in the selection of cases and controls and in the assessment of other exposures remains25.

Case–control studies

The advantages of the case–control design are compared with those of the prospective cohort approach in Table 1. Although the case–control design is often preferred during initial efforts to identify putative risk factors for common diseases because of ease and cost, it actually has particularly important advantages in the study of rare diseases. This is because it starts with diagnosed cases of disease, often from specialized referral centres, making identification and recruitment relatively easy. By contrast, the prospective cohort design requires the follow-up of large numbers of people who will never develop a rare disease, in order to identify the few cases who do14. The case–control design also allows the assessment of multiple exposures in relation to disease outcome, provided that those exposures can be measured retrospectively, or after disease has occurred. It can also allow a more detailed assessment of a particular exposure (such as in occupational or recreational settings) if that exposure is known to be especially relevant to the disease under study.

Table 1 Comparison of case—control and prospective cohort studies

Despite these advantages, case–control studies are prone to several of the sources of bias outlined in Box 1. A key requirement for a bias-free case–control study is that cases be representative of all those who develop the disease that is being studied. However, because cases are often identified in the clinical setting, mild cases or those that cause early mortality are likely to be missed, leading to prevalence–incidence bias. Another requirement is that the controls be representative of all those at risk of developing the disease26. In this respect, the potential threats to the representativeness of cases are also relevant to controls, particularly non-response bias. Differential response rates that are related to an individual's genetic background are possible in cases and controls owing to sample stratification by ancestry or a positive family history of disease24. Findings from a biased group of cases or controls might not be generalizable to the population at large, and might actually be invalid. Selection of controls is one of the most difficult and most heavily criticized aspects of case–control studies; indeed, it has been suggested that the ideal control group probably does not exist27.

A third requirement for a bias-free case–control study is that the collection of risk-factor and exposure information should be the same for cases and controls20. This can be difficult to ensure, particularly for information that has been collected in the course of clinical care, as invasive diagnostic approaches cannot be justified in healthy controls. Data collection methods must therefore be developed that can be applied equally to both groups. However, even this cannot control for the potential recall bias among the cases. Limiting the collection of risk-factor or biomarker information to the period before disease onset, if the time of onset can be clearly defined, will reduce biases in risk-factor ascertainment that are related to clinical care or awareness of disease status. Such use of pre-morbid risk-factor information will also strengthen inferences about the temporal nature of risk relationships, a key element in determining causality28. Unless extensive records exist before disease diagnosis, however, many key exposures, such as dietary patterns or medication use, cannot be collected retrospectively, and so pre-morbid risk factor information is often unavailable.

Another requirement for a valid case–control study is that the ancestral geographical origins and predominant environmental exposures of cases must not differ dramatically from those of controls. Fortunately, the collection of ancestry informative markers and information on potential environmental confounders allows adjustment for differences in genetic background and environmental exposures, as long as there is some commonality between cases and controls29,30. These must be applied carefully, however, to avoid over-adjusting for variants or exposures that might actually be causal31.

Finally, case–control studies allow the investigation of only one primary outcome: the condition by which cases are defined. Because complex diseases rarely occur in isolation and often share risk factors, the ability to examine genetic and environmental risk factors for a number of conditions after costly genomic assays have been done is one of the main advantages of cohort studies.

Prospective cohort studies

An important advantage of the prospective cohort design is that it allows standardized and detailed collection of pre-morbid exposure information, tailored to meet the goals of the study. The assessment of environmental risk factors, and therefore gene–environment interactions, is typically more extensive and less prone to bias in prospective cohort studies than in case–control studies, making the prospective cohort design much more suitable for studying environmental influences on disease risk. Recall bias in particular is avoided by collecting information before disease onset.

Another key aspect of the prospective cohort design is that all participants are followed in a systematic way, so that all cases of disease have an equal likelihood of being detected. This feature is important as it minimizes biases in case identification — particularly prevalence–incidence bias — that are typically encountered in clinical series. The time of disease onset can also be defined more clearly in prospective cohort studies than in case–control studies, and multiple disease outcomes can be studied.

The requirements for a generalizable prospective cohort study are that people recruited into the cohort have similar genetic and environmental exposures, and disease risk, to those who are not recruited, and that cohort members who are 'lost' to follow-up have similar exposures and disease risk to those remaining. A third requirement is that the likelihood of detection of disease is independent of the exposure of interest and potentially confounding factors such as age, other exposures and access to medical care. This ensures similarity of data collection (and avoidance of bias) between exposed and unexposed people.

Ascertainment methods and outcome definitions should be the same in all cohort members and should not differ in relation to the participants' genetic or environmental exposures. Changes in exposure history should be assessed by repeated collection of exposure information and analysed by appropriate longitudinal techniques32. Cohort studies that rely on outcomes that have been identified in the course of clinical care are prone to many of the biases discussed for case–control studies, so most prospective cohort studies implement a regular schedule of follow-up in which all participants are systematically investigated for the occurrence of disease and changes in exposure. The need for such ongoing follow-up has been one of the main criticisms of prospective cohort studies, as it is time-intensive and costly.

Other important limitations of the prospective cohort design include the large sample size needed to produce sufficient numbers of incident disease cases, which we discuss in more detail below, and the typically long duration needed for these cases to accrue. In addition, the need to identify and collect information on risk factors of interest before disease cases have accrued adds to the complexity and cost of prospective cohort studies, but is often the only way to obtain valid exposure information for the prediction of disease.

When should cohort studies be used?

Given the strengths and weaknesses of the two study designs, what are the areas of aetiological research for which the prospective cohort design is preferable? One such situation is the study of diseases for which case–control studies might miss the full range of disease manifestations, including those with high a mortality at onset, a short duration or a long preclinical phase. Such conditions include complex diseases that represent an important burden on health in the developed world, such as type 2 diabetes and pancreatic cancer (Table 2).

Table 2 Situations for which prospective cohort studies are likely to be superior to case-control studies

The prospective cohort design also allows the identification of predictive biomarkers that appear well before a disease is diagnosed clinically, and risk factors with a relationship to disease that is not constant over time, such as those that have a long latent period or a suggested early pathogenic effect. Prospective cohort studies are better suited to identifying risk factors that change after the onset of disease, such as those affected by disease, treatment or lifestyle changes, or those subject to imperfect or biased recall.

In addition, the prospective cohort design is preferable for studies of common diseases that seem to be genetically complex, that is, due to many genes of small effect rather than a single major gene. As discussed above, this is because the breadth and reliability of the environmental exposure data that can be obtained prospectively allows the examination of key gene–environment interactions and, consequently, greater validity in estimates of genetic effects.

Prospective cohort studies are also particularly well suited to studying multiple disease outcomes, especially those that might share risk factors, such as cancer, heart disease and diabetes. This potential of prospective cohort studies is infrequently realized, with many studies still being designed to assess only one major disease or group of diseases33,34. However, several notable studies do include multiple endpoints35,36,37. Given that the lifetime risk of heart disease is estimated to be one in three men and one in four women38, that of breast cancer is estimated to be one in eight women (as described in the SEER Cancer Statistics Review, 1975–2002), and that of prostate cancer is estimated to be one in six men39, the assessment of multiple outcomes would dramatically increase the efficiency of these studies. Existing cohort studies might also be supplemented to expand their ascertainment methods to other disease endpoints40,41, although this could require considerable additional funding, expertise and consent.

Last, prospective cohort studies are valuable for critically examining the potential risk factors that are initially identified through other approaches, including case–control studies. Many of the irremediable biases of case–control studies can be addressed only by confirming their findings in prospective cohort designs, so that a detailed and reliable estimation of environmental exposures can be included at the outset. Unfortunately, as important as such confirmatory studies are (for examples, see Refs 4244), they also cause prospective cohort studies to be viewed as lacking original hypotheses and innovation45,46,47. Despite the negative way in which prospective cohort studies are sometimes viewed, however, their impact on public health is undeniable. This importance is highlighted by the fact that many clinical misperceptions, such as the ideas that isolated systolic hypertension is normal with ageing, that silent myocardial infarction does not carry an increased risk of mortality and that the risk of hypertension has a threshold rather than a continuous effect, have been dispelled by cohort studies43,48.

The need for new studies

Although many prospective cohort studies are already in place35,47,49, none is comprehensive enough to cover the main causes of morbidity and mortality that are relevant during an entire human lifetime, nor to provide sufficient diversity, in terms of racial, ethnic or socioeconomic groups, to be applicable to the general population in countries such as the United States. Although individual studies can address particular population segments, combining these existing studies into a single cohort carries the risk of significant between-study biases within the resulting large cohort. This issue was highlighted in responses to a Request for Information issued by the US National Human Genome Research Institute (NHGRI) in 2004. In addition, the need for comparable and broad-based data collection in all cohort members would necessitate the collection of new exposure information, disease outcomes and informed consent, and would therefore be unlikely to produce appreciable cost savings.

These considerations led an NHGRI Expert Panel to conclude that although existing studies could provide valuable experience, previously obtained data and large numbers of potentially interested study participants, combining those data in a way that allows meaningful cross-study analyses would be almost impossible. It would also risk limiting the study to the lowest common denominator of exposure information collected. Far preferable, although more costly, would be to design a prospective cohort study with state-of-the-art measures of multiple exposures and diseases right from the start, which could recruit some of its participants from existing studies if desired.

In light of these considerations, the NHGRI Expert Panel has recommended establishing a new cohort that is broadly representative of the US population. The participants would be selected to represent the entire human lifespan at the time of their entry into the cohort, and would undergo periodic re-examinations and annual follow-up for major disease outcomes. Similar plans are proposed for the UK Biobank, although that study has a more limited age range and periodic re-examinations of the entire cohort are not anticipated. Improved methods for exposure assessment have been highlighted as being crucial for such research to move forward5, and are being actively pursued, for example by the US National Institute of Environmental Health Sciences 50 and the proposed Genes and Environment Initiative.

Feasibility of prospective cohort studies

Sample sizes and affordability. To examine the feasibility of carrying out successful large-scale prospective cohort studies, we estimated the sample sizes that would be needed to detect genetic and environmental effects, and gene–gene or gene–environment interactions. This was achieved by using incidence estimates from a common source (the Incidence and Prevalence Database Timely Data Resources, Capitola, California) for a range of diseases to determine the number of cases that would accrue over a 5-year period of follow-up in samples of varying sizes that would reflect the general US population. The samples that we used are representative of the full age (from birth), sex and ethnicity distributions of the 2000 US Census. The estimated numbers of cases that are expected to arise are shown in Table 3. These numbers were then used to determine the minimum odds ratios that could be detected for environmental, genetic, gene–environment and gene–gene effects. The QUANTO program51 was used to calculate the minimum number of cases needed (assuming there are two matched controls for each case) for different frequencies of the risk allele, marginal genetic effect (odds ratio associated with the genetic variant alone), environmental exposure frequency and marginal environmental effect (odds ratio associated with the exposure alone) (Fig. 3).

Table 3 Estimated disease incidence rates in prospective cohort studies
Figure 3: Sample-size requirements in prospective cohort studies.
figure 3

The estimated minimum detectable odds ratios after 5 years of follow-up for various cohort sizes and disease incidences are shown, assuming: 10% allele frequency for a dominant risk allele, 10% environmental exposure frequency, no prevalent cases in the cohort at the start of the study, 3% annual loss to follow-up, 80% power, and a type I error rate of 0.0001. Minimum odds ratios are shown for: an environmental exposure effect (a); a genetic effect (for a dominant variant) (b); a gene–environment interaction, assuming genetic and environmental marginal effects of 1.5 (c); a gene–gene interaction, assuming genetic and environmental marginal effects of 1.5 (d). Asterisks indicate minimum detectable odds ratios in excess of 10.

According to our estimates, a prospective cohort study of 1,000,000 subjects (Fig. 3a) would have sufficient power to detect an environmental exposure odds ratio of ≥1.5 for diseases of ≥0.05% incidence per year, such as colorectal cancer, whereas a study of 200,000 people could only detect an environmental odds ratio of ≥2.3 for diseases with this incidence. The minimum detectable odds ratios for genetic factors were slightly lower (indicating the power of the study was higher), mainly because a single individual has two 'chances' of carrying a dominant risk allele (Fig. 3b). For interactions, however, the minimum detectable odds ratios were much higher (that is, the power was lower), as would be expected from the much smaller number of participants exposed to both genetic and environmental risk factors. Whereas a prospective cohort study of 1,000,000 had sufficient power to detect a gene–environment interaction odds ratio of ≥1.4 for diseases of ≥0.5% incidence a year, a study of 200,000 could only detect this gene–environment interaction odds ratio for diseases of ≥3% incidence (Fig. 3c). For a disease of 0.05% incidence, the minimum detectable odds ratio was about 2.4 in the 1,000,000-person study, and as much as 7.0 in the 200,000-person study. Minimum detectable gene–gene odds ratios were slightly lower than gene–environment odds ratios (Fig. 3d).

Genetic and environmental marginal odds ratios and interaction odds ratios of at least 1.5 are likely to be important to detect, as this is the magnitude of risk associated with genetic variants that is known to be important in complex diseases such as diabetes52,53. A cohort of 200,000 will provide adequate power within 5 years for only the most common diseases, such as cataracts and hypertension, and will miss these effects for important diseases such as myocardial infarction, diabetes and all cancers. By contrast, a cohort size of 500,000 — the number recommended by the NHGRI Expert Panel for a US cohort — will capture many more of these effects. For rarer diseases such as Parkinson disease or schizophrenia, gene–environment interactions would probably not be detectable within 5 years, even with 1,000,000 participants, but might be approached by continued follow-up and accrual of additional cases (or pooling with other cohort studies) over time. Conversely, gene–environment interactions for more common diseases, such as hypertension, could be examined early in follow-up and could be assessed for consistency in key subgroups. Of course, consideration of higher-order interactions (gene-by-gene-by-gene, or multiple interacting genetic and environmental factors) will require larger sample sizes and might not be approachable within a single study, even for the most common outcomes.

The recruitment of such large numbers of subjects will of course require substantial investment. The costs of the ongoing Women's Health Initiative Observational Study of 116,000 women, for example, have been estimated at US$128 per participant per year, with approximately $400 per participant for initial recruitment, or roughly $120 million for a 5-year study (J. Rossouw, personal communication).

Other factors that affect feasibility. Other challenges in conducting prospective cohort studies are well known, and include the difficulties in enrolling a generalizable population and maintaining high follow-up rates, assessing incident morbid events and classifying causes of death, and collecting detailed exposure information for the large number of exposures that are potentially relevant to multiple diseases. Monitoring incident diseases can also be difficult in settings that have no universal access to health care or electronic medical records. For example, this is the case in much of the United States, although electronic records do currently exist in large-scale health-maintenance organizations and military and veterans' health-care systems. Indeed, an electronic medical record for all US citizens is a high priority in the proposed National Health Infrastructure Initiative.

Although the size and complexity of a study addressing multiple diseases might seem daunting, complex diseases have many key risk factors in common. Data collection can therefore be prioritized to focus on the exposures with the greatest potential relevance to multiple diseases of public health importance, as described by the NHGRI Expert Panel and the Request for Information cited above. Challenges related to participant confidentiality and informed consent in large-scale genetic studies, and other difficult issues such as the return of genetic results, the costs of additional testing and clinical care, and the risks to insurance or employment status from research participation, are encountered in case–control as well as cohort studies and are being actively addressed in programmes such as the NHGRI Ethical, Legal and Social Issues programme54 and the Ethics and Governance Framework of the UK Biobank. A dynamic consent process and the ongoing follow-up that is a feature of prospective cohort studies might make these studies uniquely suited to addressing the ethical issues and participant concerns that are emerging in relation to evolving scientific opportunities. This could help to ensure continued high rates of participation through frequent participant contact and updated consent.

Although the case–control design avoids some of these logistical challenges, the generalizability of the resulting information is limited considerably, as described above. More importantly, the difficulties in conducting good cohort studies are far from insurmountable, as demonstrated by the many successful studies of this type. As discussed, added efficiency can be gained by expanding the number of disease outcomes ascertained, and by collecting expensive exposure measures on an informative subset using the 'case–control within a cohort' or 'nested case–control' design55,56. This design avoids many of the potential pitfalls of classic case–control studies by selecting incident cases and a sample of disease-free controls from within a prospective cohort study that was established earlier. The validity of the nested case–control design critically depends, however, on the ability to measure existing exposures before disease onset once cases have developed, as with biological samples collected and stored at study entry. Such an approach could also be used for limiting intensive assessment of outcomes to participants with a particular exposure, such as an environmental toxin, in a modification of the nested design.

Conclusion

As noted by Langholz et al. “...once the cohort study resource is established and a sufficient number of cases has occurred, a study of genetic factors can proceed much more quickly and efficiently than a population-based study.”13 Of course, the existence of such studies depends on researchers having the prescience, persistence and resources to establish the population-based cohort in the first place.

Despite the near universal preference for quick returns, complex diseases develop over decades and the reliable identification of their aetiological factors requires detailed examination and long-term follow-up of disease-free individuals in prospective cohort studies. Such studies are a necessary complement to case–control studies and other epidemiological designs. We might not need many in place, if they are comprehensive enough and provide wide access to data and samples57 (with appropriate protections for participant confidentiality) and if they include the potential for adding new exposure or outcome assessments as science progresses. All of these characteristics have been recommended for the design of a possible large-scale US prospective cohort study5, and are included to varying degrees in other similar efforts such as UK Biobank, Biobank Japan 58, and the Swedish National Biobank Program. The time to proceed with such studies is upon us.