Of the estimated 7 billion humans, more than 1.2 billion reside in India. This mass of humanity comprises many diverse ethnicities, linguistic and religious groups. There is evidence of modern human inhabitation in the Indian sub-continent soon after their departure from Africa ∼50–60 000 years before present, and the land has witnessed several subsequent human migrations and invasions that have shaped its unique cultural, social and genetic structure.

The Y chromosome has a well-established phylogeny, and the variants located on its single copy region have been extensively used to give a male-specific evolutionary perspective. Continuing with this trend in the previous issue of this journal Debnath et al.1 examines the Y chromosomal variation in 375 male individuals belonging to 10 sub-Himalayan Indian populations and compare them with existing data sets.

The populations were sampled from the Terai-Duar savannah and grasslands in the present-day state of West Bengal (recently renamed ‘Paschim Banga’). This ecoregion lies between Nepal and Bhutan, and comprises the districts of Darjeeling, Jalpaiguri and Cooch Behar. Due to the climatic conditions in this part of the world, ancient human fossils are scanty, but a number of Mesolithic settlements have been found west of this region in the Ganges plains that indicate the presence of hunter gatherers since 10 000 years before present.2

The populations that were analysed included representatives from all four linguistic groups of Indo-European, Dravidian, Tibeto-Burman and Austro-Asiatic speakers that are found in the region. They included three Mundari-speaking Austro-Asiatic, four Tibeto-Burmese and a Dravidian tribe, and two Indo-European castes. On the basis of 76 bi-allelic Y markers, the authors demonstrate extensive genetic admixture among the linguistic groups and male gene flow from neighbouring populations, and from North and Southeast Asia. They conclude that geographical proximity instead of linguistic affinity is a better predictor of genetic relationships in this region, a finding that is in agreement with previously published work on populations from South Asia.3

The current study found an extremely diverse sub-Himalayan male gene pool. Three major haplogroups (O, H1a* and R1a1*) were shared across all four language families and accounted for 79% of the sampled population. The Indo-European castes were the most diverse, but shared many haplogroups with the Tibeto-Burmese. The latter retained a high frequency of O3a3c1 (M117-derived) Y chromosomes that distinguishes them from their Austro-Asiatic neighbours that lack this haplogroup and are predominantly O2a (derived for P31 and M95), which has been reported previously.4

The study also provides additional support for the revised Y haplogroup H phylogenetic tree5 and reports an additional single-nucleotide polymorphism on this branch (Figure 1). Thirteen mutations are now associated with 10 H haplogroups. The H1 lineage is thought to have an indigenous origin in peninsular India and decreases in frequency outside the subcontinent.6 The haplogroup frequency is ∼22% in India, 12% in Nepal, 3.9% in Pakistan and <1% in Iran, Turkey, Middle East, Central and Southeast Asia. High frequency of this haplogroup in European Roma Gypsies (∼17%) is taken as an evidence of their Indian ancestry. Within India, the haplogroup frequency differs widely among linguistic groups being highest in Dravidian castes and tribes and lowest among Tibeto-Burman speakers, and it is tempting to speculate that it may be a signature associated with native Dravidians. However, one must remember that these linguistic differences evolved much more recently in comparison with the observed Y haplogroup distributions, and it is highly unlikely that they would be linked to any particular Y haplogroup. A major shortcoming of the study is that no Y short tandem repeats were genotyped, and so, we can only observe patterns established by distribution of slowly evolving bi-allelic polymorphisms, which date back to the Palaeolithic period and may have reached high frequency in a given population due to genetic drift. Issues related to more recent and interesting events such as the displacement of Dravidians within the sub-continent cannot be addressed with such markers. Y short tandem repeats variance within certain lineages such as H1 (M52-derived) Y chromosomes could have been used to estimate coalescence times and provide a phylogeographic approach to origins of this and other lineages. Similarly, O3a3c1 haplotypes could have been used to address lingering questions about the arrival of Austro-Asiatic and Tibeto-Burman speakers in this sub-Himalayan region, although dating these events is still fraught with uncertainty.

Figure 1
figure 1

Revised Y chromosome haplogroup H parsimony tree based on 13 biallelic polymorphisms. The name of each polymorphism is shown along the branches, and haplogroup names are given at the tips of each branch.

Insights into male diversity from resequencing complete Y chromosomes is now technological and economically feasible,7 and it is hoped that such indigenous Indian samples will be made available to the wider genomic community to complement the data sets being generated by the 1000 Genomes Project,8 which lack such collections but includes five populations with South Asian ancestry (Bengalis from Bangladesh, Punjabis from Pakistan, and expatriate Sri Lankan Tamil, Indian Telugu and HapMap3 Gujarati in Houston). Genotyping the thousands of Y single-nucleotide polymorphisms generated by such studies will not only improve our resolution of the spatial and temporal distribution of the Y phylogeny, but also eventually the availability of whole genomic sequences from indigenous populations in this region will enable us to better model demographic history of South Asia and help unravel the distinction between linguistics, castes and tribes in this part of the world.