Here is a new admixture calculator. This uses populations all over the world and I got the best results (i.e., lowest crossvalidation error) at K=16.
You can see the admixture results for different ethnic groups as well as results for individual (founder-only) project participants.
UPDATE: The population results have been calculated using weighted means.
The group results are also shown in the usual interactive bar chart below. You can click on the component labels to sort by that ancestral component.
Do note that the admixture components do not necessarily represent real ancestral populations. Also, the names I have chosen for the components should be thought of as mnemonics to ease discussion. I chose them based on which populations in my data these components peaked in. They do not tell anything directly about ancestral populations. The best way to look at these admixture results is by comparing individuals and populations.
I used about 188,173 SNPs for this run. The results for Henn2011 (181,223 SNPs for Hadza, Sandawe and San, 26,494 SNPs for other groups), Henn2012 (26,494 SNPs), Reich (48,967 SNPs) and Xing (18,986 SNPs) datasets reported above were however calculated using lower number of common SNPs. Hence caution should be exercised in interpreting those results.
You can also see the Fst distances between the ancestral components.
I should have HarappaWorldOracle and DIYHarappaWorld calculators out in the next few days.
Also, I am working on another calculator which will focus more closely on South Asia.
I ran Reference 3 based supervised ADMIXTURE on the HUGO Pan-Asian dataset. While it used only 5,400 SNPs, it did get me curious about any relationship between Onge and Jehai and Kensiu. Unfortunately, Pan-Asian data doesn't have a good overlap even with Reich et al. So as a first exercise, I decided to run unsupervised ADMIXTURE on the Pan-Asian dataset by itself.
Here are the bar charts for the admixture results. K=12 ancestral components had the lowest cross-validation error.
You can see these results in a spreadsheet too.
I have posted the reference 3 K=11 admixture results for all populations and datasets. Here are the relevant links:
So let's try a dendrogram of all these populations' average admixture results. Instead of using regular Euclidean distance, I used some weighting based on Fst distances between admixture components, very similar to what Palisto did.
Here's a dendrogram of all datasets using complete linkage.
Since the Pan-Asian dataset had only 5,400 SNPs common with reference 3, we need to be careful interpreting the tree above. Just to make sure, here's the dendrogram excluding Pan-Asian populations.
There have been two Henn et al papers since I started this project.
- Hunter-gatherer genomic diversity suggests a southern African origin for modern humans by Brenna M. Henn, Christopher R. Gignoux, Matthew Jobin, Julie M. Granka, J. M. Macpherson, Jeffrey M. Kidd, Laura Rodríguez-Botigué, Sohini Ramachandran, Lawrence Hon, Abra Brisbin, Alice A. Lin, Peter A. Underhill, David Comas, Kenneth K. Kidd, Paul J. Norman, Peter Parham, Carlos D. Bustamante, Joanna L. Mountain, and Marcus W. Feldman
- Genomic Ancestry of North Africans Supports Back-to-Africa Migrations by Brenna M. Henn, Laura R. Botigué, Simon Gravel, Wei Wang, Abra Brisbin, Jake K. Byrnes, Karima Fadhlaoui-Zid, Pierre A. Zalloua, Andres Moreno-Estrada, Jaume Bertranpetit, Carlos D. Bustamante, David Comas
The data for both is available online:
I ran reference 3 K=11 admixture on these datasets using about 48,000 SNPs.
Here is the spreadsheet with the Henn group averages for reference 3 admixture at K=11 ancestral components.
Note that the Sandawe, Hadza and San from Henn2011 were already included in Reference 3 and are not listed here.
The HUGO Pan-Asian dataset covers South and East Asia with the following South Asian populations:
- 23 Andhra Pradesh & Karnataka
- 10 Bengali
- 23 Bhil (Rajasthan)
- 20 Haryana
- 23 Kashmir Spiti
- 12 Marathi
- 12 Rajasthani
- 30 Singapore Indian
- 20 Uttaranchal
- 13 Uttar Pradesh
Unfortunately, they do not specify ethnic or caste background for most Indian groups. Instead, their focus is on Mongoloid/Caucasoid/Australoid etc.
Also, the SNP overlap with other datasets is really small. Therefore, this reference 3 admixture run was done using only 5,400 SNPs. I recommend a big bucket of salt when interpreting these results.
Here is the spreadsheet with the Pan-Asian group averages for reference 3 admixture at K=11 ancestral components.
Recently, I discovered that the paper Genetic Evidence for High-Altitude Adaptation in Tibet by Tatum S. Simonson, Yingzhong Yang, Chad D. Huff, Haixia Yun, Ga Qin, David J. Witherspoon, Zhenzhong Bai, Felipe R. Lorenzo, Jinchuan Xing, Lynn B. Jorde, Josef T. Prchal, RiLi Ge has its genotyping data online.
It contains 31 Tibetans from Madou county in Qinghai province. The chip is Affymetrix and there are 868,146 SNPs, which means it has a good overlap with Reich et al and Xing et al and also with my reference 3.
I ran reference 3 K=11 admixture on this dataset. Here are the individual results:
The average is as follows:
Xing et al dataset is interesting because it has a number of South Asian populations:
- 25 Andhra Pradesh Brahmin
- 10 Andhra Pradesh Madiga
- 11 Andhra Pradesh Mala
- 22 Irula
- 25 Nepalese
- 25 Punjabi Arain
- 14 Tamil Nadu Brahmin
- 12 Tamil Nadu Dalit
Unfortunately, the dataset does not have a lot of common SNPs with 23andme, FTDNA and the other data I am using.
However, I did run a reference 3 admixture on Xing data using about 30,000 SNPs. Since this is a lot less than the usual 118,000 SNPs, the noise levels are much larger.
Here is the spreadsheet with the Xing group averages for reference 3 admixture at K=11 ancestral components.
Dr. Mahley was nice enough to share his Turkish and Kyrgyz dataset from the paper Turkish Population Structure and Genetic Ancestry Reveal Relatedness among Eurasian Populations by Uğur Hodoğlugil and Robert W. Mahley.
- 16 Kyrgyz from Bishkek
- 20 Turks from Aydin
- 20 Turks from Istanbul
- 23 Turks from Kayseri
Here are the group averages for the reference 3 K=11 admixture analysis.
And here are the individual results.
Recently, there was a paper Identification of Close Relatives in the HUGO Pan-Asian SNP Database by Xiong Yang, Shuhua Xu, and the HUGO Pan-Asian SNP Consortium.
three individuals involved in MZ pairs were excluded from the whole dataset to construct standardized subset PASNP1716; seventy-six individuals involved in first-degree relationships were excluded from PASNP1716 to construct standardized subset PASNP1640; and 57 individuals involved in second-degree relationships were excluded from PASNP1640 to construct standardized subset PASNP1583. The individuals excluded were summarized in Table S6, S7, S8.
Let me engage in some blog triumphalism by saying I wrote about the duplicates and relatives in the Pan-Asian dataset in April 2011.
Here are my blog posts about relatedness in datasets:
Early on, I was removing only first degree relatives from the reference datasets. Nowadays, I try to remove all second degree relatives too. I leave the third degree relatives in the data since it's sometimes hard to figure out how real the low IBD values are in Plink. There are a lot of 3rd degree relatives if Plink is to be believed, but I am a little skeptical.
Since Plink's IBD analysis requires homogenous samples, I am now using KING (paper) for the purpose. I am also looking at kcoeff (paper)