The idea and shape of modern India was an invention of its twentieth-century political leaders, who crafted citizenship defined by civic and universalist, rather than ethnic or religious, criteria precisely because that citizenship is so diverse1. As Jawaharlal Nehru, the nation's first prime minister, wrote2: “[India] is four hundred million separate individual men and women, each differing from the other ... a bundle of contradictions held together by strong but invisible threads.” Who are these diverse peoples separated by caste, customs and language? Where did they come from, and when? What are the “invisible threads”, beyond claims on the state, that bind them? Studies of biological kinship, which search for the stories of ancestry marked indelibly in a person's genome, help to provide answers to these questions because they illuminate that unwritten past3. The latest addition to our attempts to understand India through genes comes from Reich, Singh and their colleagues (page 489 of this issue)4, who arrive at some bold conclusions about its past population history from genome-variation studies.

The earliest occupation of the subcontinent was by Austro-Asiatic people about 60,000 years ago. They were dispersed and driven into smaller enclaves with the arrival of the Dravidian speakers around 3000 BCE (Before the Common Era, the Common Era marking the same divide as BC and AD). The latter people were themselves driven south with the arrival of the Indo-European speakers in about 1500 BCE. These early events shaped the growth of an indigenous civilization, with much later conquests by Persians (543 BCE), Alexander III of Macedon (325 BCE), numerous colonial Europeans starting with the Portuguese (1510 CE), and the Mughals (1526 CE). They all came and they were all absorbed — their cultures and their genes — to create the current stew. Although there has been a preoccupation, by both native and foreign scholars, with understanding caste in India and the genetic differences it engenders, there is great diversity at every level: geography, language, caste and customs.

Studies of human variation in India started with the seminal anthropometric surveys of P. C. Mahalanobis in 1941. Subsequently, numerous investigators used various genetic markers (blood groups, serum proteins, enzymes and, later, DNA) to make sense of the vast diversity within the subcontinent. In the genomic era, the Indian Genome Variation Consortium5 published a study of 420 single nucleotide polymorphisms (SNPs — base-pair variations in DNA) in 75 genes in 1,871 individuals. The consortium's sample was drawn from 55 groups representing all four language families (Austro-Asiatic, Dravidian, Indo-European, Tibeto-Burman), geography (north, south, east, west), social levels (caste, tribe, religion) and group abundance (small, large), to document the great genomic diversity and the clustering of variation by ethnicity and language (but see ref. 6). The results implied that genetic studies of disease in 'Indians' are hopelessly inadequate unless they account for their specific ancestry. This feature is genetic proof of a population structure first described by the social anthropologist Irawati Karve7 as a “patchwork quilt where bits of material of the same colour and shape may be used in a pattern, but where each bit may be of an origin different in place and time”.

Reich, Singh and colleagues4 instead examine entire genomes' worth of 560,000 SNPs in 132 individuals from 25 groups representing the breadth of social, language and geographic variation in India (see Table 1 and Fig. 1 of their paper on page 490). They sample, in addition, two small groups (the Onge and Great Andamanese) from the Andaman Islands in the Bay of Bengal.

First, the authors show that Indian populations bear the genetic imprint of European, Asian and even, though rarely, African genomes5. Second, they find that diversity within India is three to four times greater than that observed within Europe, from which they conclude that many Indian populations, although currently large, were founded by small numbers of individuals with subsequent limited migration7. These founder events are dated by the genomic data to between 750 and 2,500 years ago, and therefore occurred well after the arrival of the putative Indo-European speakers.

This provides a model of how diversity within India came about. As such, its details are imperfect, but its implications are significant.

Third, and most importantly, the authors clearly demonstrate that most of the Indian populations they sampled are mixtures of two groups that they term ANI (Ancestral North Indians) and ASI (Ancestral South Indians).The degree of ANI:ASI mixture varies between 39% and 71% across India, and is evident in all caste and even tribal groups, and in both extant Indo-European and Dravidian speakers. However, greater ANI ancestry is significantly associated with Indo-European speakers and with traditionally 'higher' caste membership, even after controlling for language. This provides a model of how diversity within India came about. As such, its details are imperfect and will surely be contested, revised and improved; but its implications are significant.

Genetically, the ANI are closest to current-day Europeans whereas the ASI are closest to the disappearing Onge, but neither of these shared ancestries is recent. Reich, Singh et al. speculate that the ancestor to both Europeans and ANI spoke a proto-Indo-European language ancestral to both Sanskrit and European languages; the Onge–ASI ancestry is even more remote, and it is unclear whether the ASI were Dravidian speakers. Thus, Indians seem to have a unique set of ancestries for which each population is the same with respect to common descent from two major peoples, but different by virtue of its ancestry proportions and specific genomic content inherited — much like the many hands that can be dealt from a deck of cards. These interpretations are now possible because the authors4 have developed new statistical methods to assess specific hypotheses regarding population relationships and ancestry, and also because comparable genomic-variation data on many additional worldwide samples are now available8,9,10.

The suggestion that each Indian population had small numbers of founders implies strong 'random genetic drift', whereby current frequencies of gene variants depart strongly from their ancestral frequencies simply by chance, thereby increasing genome similarities between members of the same group. This drift effect is largely a result of the early demographic history being shaped by limited numbers of founders, and creates an 'inbreeding' effect whereby genetic variation is lost. This aspect is independent of the additional loss of variation from consanguinity that is found in many parts of India. The cumulative effect is that gene variants may have quite distinct frequencies in India compared with that expected in many other 'related' populations, and that Indians bear the imprint of this very recent local shared ancestry.

This drift and differentiation has four implications. First, studies of relatively few individuals from any Indian population can characterize their common genomic variation adequately. Second, one predicts a high burden of genetically recessive disorders in India, many unique to each population, estimated to be greater because of local shared ancestry than consanguinity. Third, some diseases will have elevated frequencies in many regions of India owing to shared ANI or ASI ancestry5. Fourth, without accounting for local ancestry, genetic association studies can suffer from numerous false positives arising from systematic differences in ancestry between cases and controls.Indeed, language and caste membership may not be adequate control factors.

To a cynic, the existence of the ANI or ASI, their unique and remote ancestry within India, or their suggestive identities as Indo-European and Dravidian speakers, are already common knowledge. But the precise definition of their ancestral genomic content, their mixture throughout India and the importance of genetic drift are new and have serious implications for both human biology and medicine — and Indian society as well.

Nevertheless, the current analysis4 is only a beginning. The next stage will require samples from a much wider array of populations, including a better sampling of tribal populations and Tibeto-Burman speakers to understand their specific contributions. Indeed, sampling Indians, in the face of their diversity, is a challenge similar to that faced in Africa10. There is a strong impression that endogamy, the practice of preferring marriage within a group, in India has maintained genetic diversity. However, for this, endogamy must act locally where diverse populations interact7: its role can be assessed only by sampling humans locally, not populations distant from one another.

A more comprehensive analysis will require sampling Indians across a grid, assessing both their cultural and genetic diversities, for a deeper understanding of local population structure and the genetic effects of endogamy. Caste and custom may be strong barriers between groups, perhaps even today. But the common shared ancestry and rampant ANI:ASI mixture may be the strong, invisible thread that binds all Indians.