Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

The Simons Genome Diversity Project: 300 genomes from 142 diverse populations

Abstract

Here we report the Simons Genome Diversity Project data set: high quality genomes from 300 individuals from 142 diverse populations. These genomes include at least 5.8 million base pairs that are not present in the human reference genome. Our analysis reveals key features of the landscape of human genome variation, including that the rate of accumulation of mutations has accelerated by about 5% in non-Africans compared to Africans since divergence. We show that the ancestors of some pairs of present-day human populations were substantially separated by 100,000 years ago, well before the archaeologically attested onset of behavioural modernity. We also demonstrate that indigenous Australians, New Guineans and Andamanese do not derive substantial ancestry from an early dispersal of modern humans; instead, their modern human ancestry is consistent with coming from the same source as that of other non-Africans.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Genetic variation in the SGDP.
Figure 2: Cross-coalescence rates and effective population sizes for selected population pairs.
Figure 3: Present-day populations have negligible ancestry from an early dispersal of modern humans out of Africa.

Similar content being viewed by others

Accession codes

Primary accessions

European Nucleotide Archive

Data deposits

Raw data for 279 genomes for which the informed consent documentation is consistent with fully public data release are available through the EBI European Nucleotide Archive under accession numbers PRJEB9586 and ERP010710. For the remaining 21 genomes (designated by code ‘Y’ in the seventh column of Supplementary Data Table 1), data are deposited at the European Genome-phenome Archive (EGA), which is hosted by the EBI and the CRG, under accession number EGAS00001001959. Data for these 21 genomes can be obtained by submitting to the EGA Data Access Committee a signed letter containing the following text: “(a) I will not distribute the data outside my collaboration; (b) I will not post the data publicly; (c) I will make no attempt to connect the genetic data to personal identifiers for the samples; and (d) I will not use the data for any commercial purposes.” Compact versions of the SGDP dataset and software for accessing it are available at (http://genetics.med.harvard.edu/reichlab/Reich_Lab/Datasets.html). The short tandem repeat (STR) genotypes are available through dbVar under accession number nstd128 (http://www.ncbi.nlm.nih.gov/dbvar).

References

  1. Abecasis, G. R. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012)

    Article  ADS  CAS  PubMed  Google Scholar 

  2. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Li, H. FermiKit: assembly-based variant calling for Illumina resequencing data. Preprint at http://arxiv.org/abs/1504.06574 (2015)

  5. Sudmant, P. H. et al. Global diversity, population stratification, and selection of human copy-number variation. Science 349, aab3761 (2015)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Gymrek, M. & Erlich, Y. Profiling short tandem repeats from short reads. Methods Mol. Biol. 1038, 113–135 (2013)

    Article  CAS  PubMed  Google Scholar 

  7. Gymrek, M., Golan, D., Rosset, S. & Erlich, Y. lobSTR: A short tandem repeat profiler for personal genomes. Genome Res. 22, 1154–1162 (2012)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Keinan, A., Mullikin, J. C., Patterson, N. & Reich, D. Accelerated genetic drift on chromosome X during the human dispersal out of Africa. Nat. Genet. 41, 66–70 (2009)

    Article  CAS  PubMed  Google Scholar 

  11. Keinan, A. & Reich, D. Can a sex-biased human demography account for the reduced effective population size of chromosome X in non-Africans? Mol. Biol. Evol. 27, 2312–2321 (2010)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Verdu, P. et al. Sociocultural behavior, sex-biased admixture, and effective population sizes in Central African Pygmies and non-Pygmies. Mol. Biol. Evol. 30, 918–937 (2013)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Joiris, D. V. The framework of central African hunter-gatherers and neighbouring societies. African Study Monographs Suppl. 28, 57–79 (2003)

    Google Scholar 

  14. Green, R. E. et al. A draft sequence of the Neandertal genome. Science 328, 710–722 (2010)

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  15. Meyer, M. et al. A high-coverage genome sequence from an archaic Denisovan individual. Science 338, 222–226 (2012)

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  16. Wall, J. D. et al. Higher levels of neanderthal ancestry in East Asians than in Europeans. Genetics 194, 199–209 (2013)

    Article  PubMed  PubMed Central  Google Scholar 

  17. Reich, D. et al. Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468, 1053–1060 (2010)

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  18. Prüfer, K. et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 43–49 (2014)

    Article  ADS  CAS  PubMed  Google Scholar 

  19. Skoglund, P. & Jakobsson, M. Archaic human ancestry in East Asia. Proc. Natl Acad. Sci. USA 108, 18301–18306 (2011)

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  20. Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919–925 (2014)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Gronau, I., Hubisz, M. J., Gulko, B., Danko, C. G. & Siepel, A. Bayesian inference of ancient human demography from individual genome sequences. Nat. Genet. 43, 1031–1034 (2011)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Schlebusch, C. M. et al. Genomic variation in seven Khoe-San groups reveals adaptation and complex African history. Science 338, 374–379 (2012)

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  24. Veeramah, K. R. et al. An early divergence of KhoeSan ancestors from those of other modern humans is supported by an ABC-based analysis of autosomal resequencing data. Mol. Biol. Evol. 29, 617–630 (2012)

    Article  CAS  PubMed  Google Scholar 

  25. Labuda, D., Zietkiewicz, E. & Yotova, V. Archaic lineages in the history of modern humans. Genetics 156, 799–808 (2000)

    CAS  PubMed  PubMed Central  Google Scholar 

  26. Pickrell, J. K. et al. The genetic prehistory of southern Africa. Nat. Commun. 3, 1143 (2012)

    Article  ADS  CAS  PubMed  Google Scholar 

  27. Patin, E. et al. Inferring the demographic history of African farmers and pygmy hunter-gatherers using a multilocus resequencing data set. PLoS Genet. 5, e1000448 (2009)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Fu, Q. et al. Genome sequence of a 45,000-year-old modern human from western Siberia. Nature 514, 445–449 (2014)

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  29. Groucutt, H. S. et al. Rethinking the dispersal of Homo sapiens out of Africa. Evol. Anthropol. 24, 149–164 (2015)

    Article  PubMed  PubMed Central  Google Scholar 

  30. Reyes-Centeno, H., Hubbe, M., Hanihara, T., Stringer, C. & Harvati, K. Testing modern human out-of-Africa dispersal models and implications for modern human origins. J. Hum. Evol. 87, 95–106 (2015)

    Article  PubMed  Google Scholar 

  31. Rasmussen, M. et al. An Aboriginal Australian genome reveals separate human dispersals into Asia. Science 334, 94–98 (2011)

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  32. Patterson, N. et al. Ancient admixture in human history. Genetics 192, 1065–1093 (2012)

    Article  PubMed  PubMed Central  Google Scholar 

  33. Liu, W. et al. The earliest unequivocally modern humans in southern China. Nature 526, 696–699 (2015)

    Article  ADS  CAS  PubMed  Google Scholar 

  34. Fu, Q. et al. An early modern human from Romania with a recent Neanderthal ancestor. Nature 524, 216–219 (2015)

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  35. Do, R. et al. No evidence that selection has been less effective at removing deleterious mutations in Europeans than in Africans. Nat. Genet. 47, 126–131 (2015)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Harris, K. Evidence for recent, population-specific evolution of the human mutation rate. Proc. Natl Acad. Sci. USA 112, 3439–3444 (2015)

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  37. Ségurel, L., Wyman, M. J. & Przeworski, M. Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 15, 47–70 (2014)

    Article  CAS  PubMed  Google Scholar 

  38. Klein, R. G. & Edgar, B. The dawn of human culture. (Wiley, 2002)

  39. Racimo, F. Testing for ancient selection using cross-population allele frequency differentiation. Genetics 202, 733–750 (2015)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Turchin, M. C. et al. Evidence of widespread selection on standing variation in Europe at height-associated SNPs. Nat. Genet. 44, 1015–1019 (2012)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Mcbrearty, S. & Brooks, A. S. The revolution that wasn’t: a new interpretation of the origin of modern human behavior. J. Hum. Evol. 39, 453–563 (2000)

    Article  CAS  PubMed  Google Scholar 

  42. Renfrew, C. Prehistory: the Making of the Human Mind. (Modern Library, 2009)

  43. Alexander, D. H. & Lange, K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics 12, 246 (2011)

    Article  PubMed  PubMed Central  Google Scholar 

  44. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank the volunteers who donated samples. We thank H. Blanche, N. Boivin, H. Cann (deceased), E. Eichler, H. Greely, M. Petraglia, K. Prüfer, A. Rogers, M. Steinrücken, U. Stenzel and P. Sudmant for comments, critiques, discussions, or advice on assembling samples. We thank S. Fan for uploading 21 genomes to the European Genome-phenome archive. The sequencing was funded by the Simons Foundation (SFARI 280376) and the US National Science Foundation (BCS-1032255). I.M. was supported by a Long Term Fellowship grant LT001095/2014 from the Human Frontier Science program. P.S. was supported by the Wenner-Gren foundation and the Swedish Research Council (VR grant 2014-453). T.W. and M.G. were supported by an NIJ grant 2014-DN-BX-K089. Y.E. was supported by a Career Award at the Scientific Interface from the Burroughs Wellcome Fund and by NIJ grant 2014-DN-BX-K089. D.L. was supported by the Natural Sciences and Engineering Research Council of Canada. T.K. was supported by ERC Starting Investigator grant FP7 - 261213. R.S. received support from Russian Foundation for Basic Research (#15-04-02543). S.D. received support from the Russian Foundation for Basic Research (#16-34-00599). R.K., E.K. and S.L. were supported by the Russian Foundation for Basic Research (11-04-00725-a). E.B. was supported by the Russian Foundation for Basic Research (16-06-00303). O.B. was supported by the Russian Scientific Fund (14-04-00827) and by the Russian Foundation for Basic Research (16-04-00890). D.M.B., H.S., E.M., R.V. and M.M. were supported by Institutional Research Funding from the Estonian Research Council IUT24-1 and by the European Regional Development Fund (European Union) through the Centre of Excellence in Genomics to Estonian Biocentre and University of Tartu. D.C. was supported by the Spanish MINECO grant CGL-44351-P. L.B.J. and W.S.W. were supported by NIH grant GM59290. S.A.T. was supported by NIH grants 5DP1ES022577 05, 1R01DK104339-01, and 1R01GM113657-01. C.T.-S. and Y.X. were supported by The Wellcome Trust grant 098051. C.M.B. was supported by NSF grants 0924726 and 1153911. K.T. was supported by CSIR Network Project grant (GENESIS: BSC0121). J.P.S. and Y.S.S. were supported in part by an NIH grant R01-GM094402, and a Packard Fellowship for Science and Engineering. G.R., J.K and S.P. were funded by the Max Planck Society. N.P. and D.R. were supported by NIH grant GM100233 and D.R. is a Howard Hughes Medical Institute investigator.

Author information

Authors and Affiliations

Authors

Contributions

S.M., Y.E., Y.S.S., S.P., J.K., N.P. and D.R. supervised the study. S.N., N.R., C.G., G.P., F.B., G.D., I.G.R., A.R.J., P.D., D.M.B., C.M.B., C.C., T.H., A.M.-E., O.L.P., E.B., O.B., S.K.-Y., H.S., D.T., L.Y., C.T.-S., Y.X., M.S.A., A.R.-L., C.B., A.D.R., C.J., E.B.S., E.M., J.P., R.V., B.M.H., U.H., R.W.M., A.S., G.S., J.T.S.W., R.K., E.K., S.L., G.A., D.C., M.H., T.K., W.K., C.A.W., D.L., M.B., L.B.J., S.A.T., W.S.W., M.M., S.D., R.S., L.S., K.T. and D.R. assembled samples. S.M., H.L., M.L., I.M., M.G., F.R., J.P.S., M.Z., N.C., A.T., P.S., I.L., S.S., Q.F., G.R., Y.S., N.P. and D.R. performed analyses. S.M., H.L., M.L., I.M., M.G., F.R., M.Z., N.P. and D.R. wrote the manuscript with help from all co-authors.

Corresponding authors

Correspondence to Swapan Mallick or David Reich.

Ethics declarations

Competing interests

U.H. is employed by NextBio, a division of Illumina Ltd.

Additional information

Reviewer Information Nature thanks P. Bellwood and S. Ramachandran and the other anonymous reviewer(s) for their contribution to the peer review of this work.

Extended data figures and tables

Extended Data Figure 1 Heat map of fraction of heterozygous sites missed in the 1000 Genomes project.

For each sample, we examine all heterozygous sites passing filter level 1, and compute the fraction included as known polymorphisms in the 1000 Genomes project.

Extended Data Figure 2 Worldwide variation in human short tandem repeats.

a, Mean STR length is reported as the average of the length difference (in base pairs) from the GRCh37 reference for each genotype. Bubble area scales with the number of calls compared at each point. b, c, The first two principal components after performing principal component analysis on tetranucleotide and homopolymer genotypes, respectively. Colours represent the region of origin of each sample. d, Pairwise FST values between populations computed using only SNPs versus using combined SNP + STR loci. e, Block jackknife standard errors for the SNP versus SNP + STR FST analysis. The red dashed lines give the best-fit line, described by the formula in red. The black dashed line denotes the diagonal.

Extended Data Figure 3 ADMIXTURE analysis.

We carried out unsupervised ADMIXTURE 1.238,43 analysis over the 300 SGDP individuals in 20 replicates with randomly chosen initial seeds, varying the number of ancestral populations between K = 2 and K = 12 and using default fivefold cross-validation (–cv flag). We used genotypes of at least filter level 1, and restricted analysis to sites where at least two individuals carried the variant allele (as singleton variants are non-informative for population clustering). After further filtering of sites with at least 99% completeness and performing linkage-disequilibrium-based pruning in PLINK 1.944,45 with parameters (–indep-pairwise 1000 100 0.2), a total of 482,515 single nucleotide polymorphisms remained. This figure shows the highest likelihood replicate for each value of K. We found that log likelihood monotonically increases with K, while the value K = 5 minimizes cross-validation error (not shown). The solution at K = 5 corresponds to major continental groups (Sub-Saharan Africans, Oceanians, East Asians, Native Americans, and West Eurasians), but we show the full range of K here as they illustrate finer-scale population structure that may be useful to users of the data.

Extended Data Figure 4 Principal component analysis and neighbour joining tree.

a, Principal component analysis. b, Neighbour-joining tree based on FST values for all populations with at least two samples.

Extended Data Figure 5 Fewer accumulated mutations in Africans than in non-Africans confirmed by mapping to chimpanzee.

We compute a statistic D (Population A, Population B, Chimp), measuring the difference in the rate of matching to chimpanzee in Population A compared to Population B. The evidence of mismatching to chimpanzee is seen when we restrict to the male X chromosome to eliminate possible effects due to differences in heterozygosity across populations, and map to the chimpanzee genome which is phylogenetically symmetrically related to all present-day humans. We find that in 78 randomly chosen Population A = African and Population B = non-African pairs of males, transversion substitutions show no consistent skew from zero, but transition substitutions do.

Extended Data Figure 6 3P-CLR scan for positive selection.

The red line denotes the 99.9% quantile cut-off. The genes in the top five regions are labelled. a, Scan for selection on the San terminal branch. b, Scan for selection on the non-San terminal branch. c, Scan for selection on the ancestral modern human branch.

Extended Data Figure 7 Scan for genomic locations where the great majority of present-day humans share a recent common ancestor.

We carried out PSMC analysis on 40 pairs of haploid genomes chosen to sample some of the most deeply divergent present-day human lineages. We recorded the time since the most recent common ancestor (TMRCA) at each position, and rescaled to obtain an estimate of absolute time (Supplementary Information section 12). a, Distribution across the genome of the fraction of TMRCAs below specified date cut-offs. For the 100 kya cut-off, the maximum fraction observed anywhere in the genome is 68%. b, Distribution across the genome of the date T at which specified fractions of sample pairs are inferred to have a TMRCA less than T. c, Percentile points of the cumulative distribution function of B.

Extended Data Table 1 Fewer accumulated mutations in Africans than in non-Africans

Related audio

Supplementary information

Supplementary Information

This file contains Supplementary Text and Data, Supplementary Tables Supplementary Figures and additional references (see Contents for details). (PDF 8661 kb)

Supplementary Table 1

This file shows the data by each sample studied. (XLSX 124 kb)

Supplementary Table 2

This table shows the top hits for 3P-CLR run. (XLSX 71 kb)

PowerPoint slides

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mallick, S., Li, H., Lipson, M. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016). https://doi.org/10.1038/nature18964

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nature18964

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing