Search Results for: reich

Reich et al Duplicates

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Reich et al Indian dataset.

The dataset doesn't have any duplicate or likely relative samples itself. However, there are two Kharia samples that are the same as the Austroasiatic dataset. Since Austroasiatic dataset has more SNPs in common with 23andme, I removed these two samples from Reich et al.

The IBS/IBS analysis and the sample IDs are in a spreadsheet as usual.

Reich et al and Pan-Asian Datasets

I got access to the Reich et al (Nature 2009) dataset used in their paper "Reconstructing Indian population history".

It has the following populations:

Aonaga Aus Bhil
Chenchu Great_Andamanese Hallaki
Kamsali Kashmiri_Pandit Kharia
Kurumba Lodi Madiga
Mala Meghawal Naidu
Nysha Onge Sahariya
Santhal Satnami Siddi
Somali Srivastava Tharu
Vaish Velama Vysya

There are 141 individuals with 587,753 SNPs in their dataset which conveniently is in PED format.

Also, Blaise pointed me to the Pan-Asian SNP data used in the Dec 2009 Science paper "Mapping Human Genetic Diversity in Asia".

It includes the following 71 populations:

Maya Auca Quechua Karitiana Pima
Ami Atayal Melanesians Zhuang Han_Cantonese
Hmong Jiamao Jinuo Han_Shanghai Uyghur
Wa Alorese Dayak Javanese Batak_Karo
Lamaholot Lembata Malay Mentawai Manggarai
Kambera Sunda Batak_Toba Toraja Andhra_Pradesh
Karnataka Bengali-Assamese Rajasthan Uttaranchal Uttar Pradesh
Haryana Spiti Bhili Marathi Japanese
Ryukyuan Korean Bidayuh Jehai Kelantan
Kensiu Temuan Ayta Agta Ati
Iraya Minanubu Mamanwa Filipino Singapore_Chinese
Singapore_Indian Singapore_Malay Hmong (Miao) Karen Lawa
Mlabri Mon Paluang Plang Tai_Khuen
Tai_Lue H'tin Tai_Yuan Tai_Yong Yao
Hakka Minnan

It has 1,719 individuals with 54,794 SNPs. I wish it had more SNPs considering the wealth of populations.

Also, the Pan-Asian data is in the form of minor allele counts, so I need to convert that back to A/C/G/T. Since there are some HapMap populations included in the dataset, that shouldn't be too hard.

I am going to include both these datasets into my big reference set.

Genetic Evidence for Recent Population Mixture in India

Finally the paper I had been waiting for ever since the conference presentations on ANI-ASI admixture dating by Moorjani et al at Reich Lab is out:

Moorjani et al., Genetic Evidence for Recent Population Mixture in India, The American Journal of Human Genetics (2013),

Here's the abstract:

Most Indian groups descend from a mixture of two genetically divergent populations: Ancestral North Indians (ANI) related to Central Asians, Middle Easterners, Caucasians, and Europeans; and Ancestral South Indians (ASI) not closely related to groups outside the subcontinent. The date of mixture is unknown but has implications for understanding Indian history. We report genome-wide data from 73 groups from the Indian subcontinent and analyze linkage disequilibrium to estimate ANI-ASI mixture dates ranging from about 1,900 to 4,200 years ago. In a subset of groups, 100% of the mixture is consistent with having occurred during this period. These results show that India experienced a demographic transformation several thousand years ago, from a region in which major population mixture was common to one in which mixture even between closely related groups became rare because of a shift to endogamy.

In this paper, Moorjani et al calculate ANI (Ancestral North Indian) percentage as:

From Reich et al, they changed the outgroup from Papuan to Yoruba and the ANI clade group from CEU (Utahn Whites) to Georgians. I think both are much better choices. Looking at the D-statistics in Table S2, Georgians are definitely an appropriate choice for forming a clade with ANI.


Another important result from the paper is the difference in the date of admixture for Dravidians (108 generations or 3,132 years) and Indo-Europeans (72 generations = 2,088 years).

Testing for multiple waves of admixture, they find that it is more likely in upper-caste and middle-caste Indo-Europeans and the admixture history of a lot of Indian groups is more complex.

UPDATE: Razib and Dienekes comment.

HarappaWorld HRP0289-HRP0297

I have added the HarappaWorld Admixture results for HRP0289-HRP0297 to the individual spreadsheet.

Do note that the admixture components do not necessarily represent real ancestral populations. Also, the names I have chosen for the components should be thought of as mnemonics to ease discussion. I chose them based on which populations in my data these components peaked in. They do not tell anything directly about ancestral populations. The best way to look at these admixture results is by comparing individuals and populations. Finally, the standard error estimates on these results can be about 1%. Therefore, it is entirely possible that your 1% exotic admixture result is just noise.

I also updated the results for HRP0274 using FTDNA Family Finder data instead of the Genographic 2.0 data that was originally submitted. As the Geno2 data has only 14,000 SNPs in common with my HarappaWorld calculator, it's interesting to see HRP0274's admxiture results change:

Component Geno2 FTDNA
South Indian 48.68% 46.00%
Baloch 34.22% 32.99%
Caucasian 4.33% 5.02%
Northeast Euro 3.89% 3.57%
Southeast Asian 2.75% 1.06%
Siberian 1.25% 1.87%
Northeast Asian 1.16% 1.69%
Papuan 1.14% 1.85%
American 0.87% 1.23%
Beringian 0.01% 1.23%
Mediterranean 0.00% 0.39%
Southwest Asian 1.69% 3.10%
San 0.00% 0.00%
East African 0.00% 0.00%
Pygmy 0.00% 0.00%
West African 0.00% 0.00%

The only differences greater than 1% are South Indian (2.68%), Southeast Asian (1.69%), Southwest Asian (1.41%), Baloch (1.23%), and Beringian (1.22%). It's remarkable that only 14,000 SNPs could provide us a decent result.

We have two new Gujarati participants. HRP0292, a Gujarati Jain, seems to be more similar to somewhat southern populations. HRP0294, a Gujarati Sunni Vohra, has results somewhat similar to HRP0265 (Gujarati Patel Muslim) and more north-oriented. Therefore, I have separated a new ethnic category of Gujarati Muslims in my ethnic spreadsheet. I'll have averages when I compute them next time.

We have two Indian adoptee participants as well. HRP0297 has results which match well with the Bengalis (other than the Brahmins) in this project. HRP0290's results are somewhat harder to figure out. The closest groups, not too close, are probably Tharu from Uttarakhand and Satnami from Chhattisgarh (Reich et al dataset). A ChromoPainter analysis would be more useful here.

ANI-ASI Admixture Dating

Similar to an earlier conference poster, Reich Lab's Priya Moorjani et al have another poster at SMBE. Here's the abstract:

Estimating a date of mixture of ancestral South Asian populations
Linguistic and genetic studies have demonstrated that almost all groups in South Asia today descend from a mixture of two highly divergent populations: Ancestral North Indians (ANI) related to Central Asians, Middle Easterners and Europeans, and Ancestral South Indians (ASI) not related to any populations outside the Indian subcontinent. ANI and ASI have been estimated to have diverged from a common ancestor as much as 60,000 years ago, but the date of the ANI-ASI mixture is unknown. Here we analyze data from about 60 South Asian groups to estimate that major ANI-ASI mixture occurred 1,200-4,000 years ago. Some mixture may also be older--beyond the time we can query using admixture linkage disequilibrium--since it is universal throughout the subcontinent: present in every group speaking Indo-European or Dravidian languages, in all caste levels, and in primitive tribes. After the ANI-ASI mixture that occurred within the last four thousand years, a cultural shift led to widespread endogamy, decreasing the rate of additional mixture.

I bolded the portion which seems new compared to the previous abstract.

HarappaWorld Ancestral South Indian

Using the same method as I used for reference 3 admixture, I decided to guesstimate the Ancestral South Indian proportions, as given by Reich et al, for my HarappaWorld admixture run.

Basically, I used the 92 (out of the 96 samples Reich et al used) to find population averages for the South Indian component. Then, I used linear regression between the South Indian component average and Reich et al's estimate of Ancestral South Indian (ASI) ancestry. Since Reich et al actually list Ancestral North Indian percentages in their paper but their model is a two-ancestry ANI+ASI one, I simply calculated the ASI percentages as 100% minus ANI.

The correlation between Reich et al ASI and my HarappaWorld South Indian component for the relevant populations turns out to be 0.99277086.

And the linear regression fit for the data is:

ASI = 2.5218942 + 0.8104836 * S_INDIAN

where both ASI (Reich et al) and S_INDIAN (HarappaWorld) are given in percentages.

Of the individuals in HarappaWorld, I kept only those who had a South Indian component of at least 20% for computing the ASI proportions.

The resulting ASI percentages can be seen in a spreadsheet.

Please note that in the Group sheet, the averages are based on the samples which met the 20% South Indian component threshold. Thus, the 20% ASI in the Romanians is the average of the two Romanians who met the threshold out of a total of 16 Romanian samples.

The individual results are available in the Individual sheet. These results are a little different from the estimates using reference 3. Thus, I would point out that these should be taken only as a rough estimate.

HarappaWorld Tweaks

First of all, I wanted to draw your attention to the fact that I am using weighted means for population averages for HarappaWorld instead of just averaging all samples' results. The weighting gives less importance to outliers. I find this to be a better solution than a simple average or median. A median removes all outliers but it also rejects a lot of information.

An example of the weighted mean effect can be seen in the Behar et al Armenian samples. Four of the samples have higher NE European percentages than the rest. As you can see in the table below, the weighting makes their impact on the population results low.

Mean Weighted Mean
Ethnicity armenian armenian armenian armenian
Dataset behar yunusbayev behar yunusbayev
N 19 16 19 16
S Indian 0.37% 0.52% 0.41% 0.52%
Baloch 16.57% 17.73% 17.07% 17.65%
Caucasian 54.35% 56.43% 57.29% 56.61%
NE Euro 8.96% 2.98% 5.35% 2.95%
SE Asian 0.10% 0.12% 0.10% 0.13%
Siberian 0.49% 0.09% 0.29% 0.09%
NE Asian 0.14% 0.08% 0.16% 0.09%
Papuan 0.28% 0.27% 0.26% 0.27%
American 0.19% 0.18% 0.22% 0.18%
Beringian 0.26% 0.19% 0.23% 0.20%
Mediterranean 8.46% 8.37% 8.21% 8.40%
SW Asian 9.81% 13.03% 10.40% 12.91%
San 0.00% 0.00% 0.00% 0.00%
E African 0.02% 0.00% 0.01% 0.00%
Pygmy 0.00% 0.00% 0.00% 0.00%
W African 0.00% 0.00% 0.00% 0.00%

Another example is the Somali samples in Reich et al data. There is one sample (out of 6) who seems to be eastern Bantu. Let's compare the unweighted mean and weighted mean for Somalis in Reich et al and Harappa participants.

Mean Weighted Mean
Ethnicity somali somali somali somali
Dataset harappa reich harappa reich
N 2 6 2 6
S Indian 0.00% 1.62% 0.00% 1.49%
Baloch 0.00% 0.00% 0.00% 0.00%
Caucasian 2.76% 0.00% 2.76% 0.00%
NE Euro 0.00% 0.11% 0.00% 0.04%
SE Asian 0.27% 0.05% 0.27% 0.06%
Siberian 0.00% 0.04% 0.00% 0.05%
NE Asian 0.00% 0.41% 0.00% 0.46%
Papuan 0.26% 0.10% 0.26% 0.11%
American 0.14% 0.17% 0.14% 0.19%
Beringian 0.23% 0.33% 0.23% 0.38%
Mediterranean 2.12% 3.25% 2.12% 3.65%
SW Asian 31.73% 24.48% 31.73% 27.33%
San 1.96% 1.48% 1.96% 1.37%
E African 60.37% 56.75% 60.37% 60.13%
Pygmy 0.15% 1.78% 0.15% 1.23%
W African 0.00% 9.43% 0.00% 3.51%

Also, I have divided Singapore Indians into 4 groups (actually 3 groups and 1 outlier) since they are so heterogeneous. Here are the weighted mean admixture proportions for all Singapore Indians and the four subgroups.

Ethnicity singapore-indian singapore-indian-1 singapore-indian-2 singapore-indian-3 singapore-indian-4
Dataset sgvp sgvp sgvp sgvp sgvp
N 83 31 41 10 1
S Indian 53.57% 61.95% 50.39% 33.68% 27.81%
Baloch 33.97% 30.24% 36.00% 40.72% 14.27%
Caucasian 3.55% 1.92% 4.03% 9.32% 4.53%
NE Euro 2.93% 0.08% 3.89% 9.84% 35.38%
SE Asian 1.31% 1.30% 1.23% 0.63% 1.20%
Siberian 0.45% 0.47% 0.44% 0.43% 1.19%
NE Asian 0.92% 0.91% 0.80% 1.19% 3.26%
Papuan 0.72% 1.09% 0.50% 0.35% 0.62%
American 0.42% 0.35% 0.44% 0.69% 1.29%
Beringian 0.56% 0.38% 0.65% 0.76% 0.00%
Mediterranean 0.67% 0.40% 0.72% 1.33% 10.38%
SW Asian 0.90% 0.86% 0.87% 1.05% 0.06%
San 0.01% 0.00% 0.01% 0.00% 0.00%
E African 0.03% 0.02% 0.04% 0.00% 0.00%
Pygmy 0.00% 0.00% 0.00% 0.00% 0.00%
W African 0.01% 0.01% 0.00% 0.00% 0.00%

I have updated the spreadsheet as well as HarappaWorld Oracle.

HarappaWorld Admixture

Here is a new admixture calculator. This uses populations all over the world and I got the best results (i.e., lowest crossvalidation error) at K=16.

You can see the admixture results for different ethnic groups as well as results for individual (founder-only) project participants.

UPDATE: The population results have been calculated using weighted means.

The group results are also shown in the usual interactive bar chart below. You can click on the component labels to sort by that ancestral component.

Do note that the admixture components do not necessarily represent real ancestral populations. Also, the names I have chosen for the components should be thought of as mnemonics to ease discussion. I chose them based on which populations in my data these components peaked in. They do not tell anything directly about ancestral populations. The best way to look at these admixture results is by comparing individuals and populations.

I used about 188,173 SNPs for this run. The results for Henn2011 (181,223 SNPs for Hadza, Sandawe and San, 26,494 SNPs for other groups), Henn2012 (26,494 SNPs), Reich (48,967 SNPs) and Xing (18,986 SNPs) datasets reported above were however calculated using lower number of common SNPs. Hence caution should be exercised in interpreting those results.

You can also see the Fst distances between the ancestral components.

I should have HarappaWorldOracle and DIYHarappaWorld calculators out in the next few days.

Also, I am working on another calculator which will focus more closely on South Asia.

Pan-Asian Admixture Results

I ran Reference 3 based supervised ADMIXTURE on the HUGO Pan-Asian dataset. While it used only 5,400 SNPs, it did get me curious about any relationship between Onge and Jehai and Kensiu. Unfortunately, Pan-Asian data doesn't have a good overlap even with Reich et al. So as a first exercise, I decided to run unsupervised ADMIXTURE on the Pan-Asian dataset by itself.

Here are the bar charts for the admixture results. K=12 ancestral components had the lowest cross-validation error.

You can see these results in a spreadsheet too.

Pan-Asian Ref3 K=11 Admixture

The HUGO Pan-Asian dataset covers South and East Asia with the following South Asian populations:

  • 23 Andhra Pradesh & Karnataka
  • 10 Bengali
  • 23 Bhil (Rajasthan)
  • 20 Haryana
  • 23 Kashmir Spiti
  • 12 Marathi
  • 12 Rajasthani
  • 30 Singapore Indian
  • 20 Uttaranchal
  • 13 Uttar Pradesh

Unfortunately, they do not specify ethnic or caste background for most Indian groups. Instead, their focus is on Mongoloid/Caucasoid/Australoid etc.

Also, the SNP overlap with other datasets is really small. Therefore, this reference 3 admixture run was done using only 5,400 SNPs. I recommend a big bucket of salt when interpreting these results.

Here is the spreadsheet with the Pan-Asian group averages for reference 3 admixture at K=11 ancestral components.