1000 GENOMES: A World of Variation
Capturing roughly 95% of the common genetic variations that can be detected among the Earth's 6.9 billion individuals1 is no small task. But in the report2 published last fall of pilot data from the 1000 Genomes Project, researchers attempted to do just that and generated the most comprehensive map ever compiled of human genetic variation.
Launched in 2008, the 1000 Genomes Project has set its sights on describing more than 95% of variants that occur in as few as 1%, or even fewer, of the world's people. Five major population groups are targeted—West African, European, American, and East and South Asian.
“To understand a disease fully, we need to do much more extensive studies than any one such analysis should do,” says Aravinda Chakravarti, PhD, director of the Center for Complex Disease Genomics at Johns Hopkins' McKusick-Nathans Institute of Genetic Medicine. With this detailed baseline of genetic differences among living adults in hand, “it will be up to other scientists to use this … for any of the kinds of research they would want to do.”
The clinical impact stemming from a deeper understanding of gene dysregulation in human disease will unfold gradually, Chakravarti adds. “But when it comes, it will be with a deep and … pervasive benefit in treatment.”
Since the first draft of the human genome sequence was published a decade ago, sequencing technology and related tools have rocketed forward, making it feasible to systematically capture the genomes of many individuals and build a comprehensive catalog. In its Oct 28, 2010 issue—the same in which the 1000 Genomes pilot data was published—Nature estimated that by the end of 2011, North American laboratories alone will have sequenced approximately 9000 genomes.3 Scarcity of high-throughput sequencing equipment in most other parts of the world has led to underrepresentation of genomes sequenced in nonwhite and non-Asian populations, the journal noted.
The 1000 Genomes Project involves researchers from more than 75 institutions and companies4 in the United States, United Kingdom, China, and Germany.5 The estimated cost of the 5-year project, scheduled to end next year, is a whopping $120 million.6
The pilot, proof-of-principle studies published last fall used three strategies: low-coverage whole-genome sequencing of 179 people, combining partial data across the samples; sequencing that targeted 8140 exons, or protein-coding regions, in 697 people; and deep sequencing of six people (two mother-father-daughter triads).2 Comparing the low-coverage and exon-only data with the highly accurate and detailed genomic data allowed the researchers to show that both of the less-thorough strategies can provide valid results and have different, complementary strengths.
“The three pilots of the 1000 Genomes Project were mostly designed to achieve a careful and comprehensive evaluation of the new technology and the potential biases and artifacts that could affect the use and interpretation of the data,” notes genomics researcher Nicole Soranzo, PhD, of the Wellcome Trust Sanger Institute near Cambridge, England.
The focus on protein-coding sequences allows scientists to at least begin to understand why any variants that are found might be causing disease, says Jonathan Seidman, PhD, genetics professor at Harvard University. “If you find a variant in a sequence that does not code proteins, you have no idea what it does.”
In all, the pilot research found 15 million single-nucleotide polymorphisms (SNPs), or instances in which one base in DNA was replaced by another. The research also detected 1 million short insertions and deletions of DNA and 20 000 large, structural variants.2 Populations of African ancestry contributed the largest number of variants to the data, the researchers noted, including the biggest portion of novel variants.2
More than half of all the genetic variants that were found were previously unknown. Individuals in the research, who lacked any particular phenotype, had an average of about 250 to 300 genes—or approximately 1% of all their genes—with loss-of-function variants. Fifty to 100 of the variants had previously been associated with an inherited disease, the pilot data showed.2
Deep sequencing of the two nuclear families allowed researchers to estimate, based on variants present in the daughters but not the parents, that each person has about 60 new variants not present in either parent.6
Meanwhile, a companion article published concurrently by Science7 described other significant technological advances8—namely, a technique for deciphering the number of copies of various genes in murky stretches of the genome in which sequence is heavily duplicated and highly identical. These copy-number variants, which can influence quantities of proteins produced, are also believed to influence disease risk.9 Using data from the 1000 Genomes Project, the team reporting in Science found substantial differences among three geographically defined populations in number of copies of certain genes. Also, for approximately 70% of these genes, the researchers cataloged subtle differences in sequence that arise.9
In the 1000 Genomes Project, the pilot studies alone generated 4.9 terabases of DNA sequence. At the same time, data have already been collected to attain the scale of the project touted in its title—1000 genomes. Chakravarti, a member of several of the projects' committees, expects analyses on at least that many genomes to be published by the end of the year, although bursts of data are released every three months and immediately available for research use.
Seidman, who is not involved with the project, notes that even 1000 genomes are just a start. “We won't explain all human genetic variation based on these 1000 genomes,” he says, “but it will certainly advance the field.” Ultimately, the 1000 Genomes Project aims to achieve 2.5 times that—sequencing of 2500 genomes, from 27 populations worldwide.6
A significant outgrowth of the pilot studies is development of more reliable methods to study genetic variation—including the demonstration that imputation methods can enhance genome-wide association studies (GWAS).4,9 The more genomes studied, the better imputation works. “Imputation is what allows many more people's work to become comparable,” says Chakravarti; even though they may use different technologies, ultimately it all can be brought to the same platform.
Researchers including Chakravarti, on the hunt for causes of specific ailments, are already putting the project's early data to work daily to understand complex diseases, such as hypertension, that cannot be ascribed to a single etiology. “What these resources allow us to try to do is [to] probe those questions much deeper than we could have done before,” he says. “We are using a microscope of increasing power already, long before the project has ended.”
In her work studying the genetics of cardiovascular disease, Soranzo, who is not a participant in the 1000 Genomes Project, expects to use its data much as other scientists will. She plans to impute rare variants from GWAS data, for more detailed association studies; to identify and catalog variants for functional follow-up; and to explore fine-scale structure of linkage disequilibrium (correlation patterns between nearby variants) in multiple populations.
Background data from the project will be crucial in allowing researchers to distinguish nondisease-causing from disease-causing variants for conditions such as congenital heart disease, cardiac arrhythmias, and cardiomyopathies, says Seidman. In a condition like dilated cardiomyopathy, for instance, the question when someone finds a new variant is, “Has it been found in 1000 other people?”
Soranzo notes that in recent years, GWAS has served as a standard statistical model framework used in searching for disease-associated genes or related traits, such as serum cholesterol or body mass index. She says that commercial genome-wide arrays that genotype between 300 000 and 1.2 million SNPs at the same time are used, but limitations of those arrays make them most suited to surveying common variants. “Other types of variations—including SNPs in the rare- to low-frequency range, short insertion/deletion polymorphisms, and other more complex structural rearrangements in the human genome—are not represented on these arrays.” Yet, the emerging line of thought, she says, is that genetic vulnerability to disease arises from a combination of many common variants, each exerting a weak effect, and potentially fewer variants with a stronger effect.
Every variant put on GWAS arrays is present in at least 5% of the population, notes Seidman. “Clearly that approach is not going to detect variants that occur in fewer than that, in 1% or 0.1% of the population.” In diabetes, he points out, no one has found a single variant that could explain most cases of the disease, and related variants that were found could not explain the condition in a large fraction of the diabetic population. “Assuming heritable genetic variation is responsible … it has to be that it is a combination of multiple rarer events,” Seidman says.
By cataloging variations down to lower frequencies, Soranzo says, the 1000 Genomes Project will allow the design of denser genome-wide arrays to be used in association studies and, thus, a deeper survey of genetic variants in those frequency ranges. Project leaders point out that the current data set is already beginning to support the development of advanced genotyping products and is helping to filter out variants that might cloud the search for rare disease mutations.6
At the same time, sequencing technology is marching on. Current techniques, Soranzo says, are beginning to be replaced by studies that exhaustively sequence genomes or the protein coding regions from phenotyped individuals, allowing direct association studies.
Lee Hood, MD, PhD, president of the Institute for Systems Biology in Seattle, notes the cost of sequencing an entire genome at high coverage, about $10 000 now, is expected to soon drop to around $1000. “Full sequencing is going to become so inexpensive [that] it will be a waste to do low-coverage sequencing,” he says. In the 1000 Genomes Project pilot studies, low-coverage analysis entailed reading the genomes two to six times, compared with 42 times for the high-coverage study examining the families' genomes.
Whereas Hood lists some key accomplishments of the 1000 Genomes Project—helping scientists identify more and more relatively common SNPs, improving understanding of the architecture of genomes from diverse individuals, and providing a framework for posing deeper questions about evolution or the origins of disease—he also cites two limitations: the project's low coverage, with the accompanying problem of distinguishing which findings are signal and which are noise, and the fact that the project's subjects, for the most part, are genetically unrelated.
By contrast, using complete genome sequencing of families has two enormous advantages: Scientists can use principles of Mendelian genetics to correct more than 70% of DNA sequencing errors, and the approach can identify extremely rare SNPs. Hood says that studies on families “integrate genetics with genomics in ways that have not been done before.” He believes that, ultimately, to pinpoint which genes or gene combinations underlie conditions such as complex cardiovascular diseases, classic genetics or systems biology approaches will prove invaluable.
“I think the real key … is being able to translate the gene activity into the operation of biological networks,” Hood says. “What can be useful is to look at the genes that are present in the 1000 Genomes Project, the nature of the variation, and map them into key biological networks in cardiovascular disease, neurodegenerative disease, whatever you are interested in and see if there are candidates that stand out. Are there variants that might lead to interesting behaviors of those biological networks?”
In the end, Seidman believes, a key function of the 1000 Genomes Project will be to establish the precise contribution of inherited variation to disease. He points out that epigenetics—extra biological instructions layered on top of the DNA code—could be crucial to the puzzle, as well. “If [the 1000 Genomes Project] determines genetic variation is accountable for much of the inheritable variation, that will be one answer,” he says. “But if it doesn't, that will change the field.
“It's a very exciting time in human genetics, with the notion that in the next 5 to 10 years, we'll understand how inherited variation causes human disease.”
1000 Genomes Project data are available through www.1000genomes.org.
The opinions expressed in News & Views are not necessarily those of the editors or of the American Heart Association.
↵* News & Views are edited by Aruni Bhatnagar, Ali J. Marian, and Houman Ashrafian.
- © 2011 American Heart Association, Inc.
U.S. Census Bureau. U.S. & world population checks. www.census.gov/main/www/popclock.html. Accessed January 3, 2011.
National Human Genome Research Institute. 1000 Genome Project. Available at www.genome.gov/pfv.cfm?ageID27528684. Accessed December 2, 2010.
1000 Genomes Project publishes analysis of completed pilot phase. Available at www.nih.gov/news/health/oct2010/nhgri-27.htm. Accessed December 2, 2010.
- Sudmant PH,
- Kitzman JO,
- Antonacci F,
- Alkan C,
- Malig M,
- Tsalenko A,
- Sampas N,
- Bruhn L,
- Shendure J,
- 1000 Genomes Project,
- Eichler EE
- Katnelson A
- Pennisi E