Tag Archives: reference

HarappaWorld Admixture

Posted by Zack on May 4, 2012 18 comments

Here is a new admixture calculator. This uses populations all over the world and I got the best results (i.e., lowest crossvalidation error) at K=16.

You can see the admixture results for different ethnic groups as well as results for individual (founder-only) project participants.

UPDATE: The population results have been calculated using weighted means.

The group results are also shown in the usual interactive bar chart below. You can click on the component labels to sort by that ancestral component.

Do note that the admixture components do not necessarily represent real ancestral populations. Also, the names I have chosen for the components should be thought of as mnemonics to ease discussion. I chose them based on which populations in my data these components peaked in. They do not tell anything directly about ancestral populations. The best way to look at these admixture results is by comparing individuals and populations.

I used about 188,173 SNPs for this run. The results for Henn2011 (181,223 SNPs for Hadza, Sandawe and San, 26,494 SNPs for other groups), Henn2012 (26,494 SNPs), Reich (48,967 SNPs) and Xing (18,986 SNPs) datasets reported above were however calculated using lower number of common SNPs. Hence caution should be exercised in interpreting those results.

You can also see the Fst distances between the ancestral components.

I should have HarappaWorldOracle and DIYHarappaWorld calculators out in the next few days.

Also, I am working on another calculator which will focus more closely on South Asia.

Pan-Asian Admixture Results

Posted by Zack on April 3, 2012 1 comment

I ran Reference 3 based supervised ADMIXTURE on the HUGO Pan-Asian dataset. While it used only 5,400 SNPs, it did get me curious about any relationship between Onge and Jehai and Kensiu. Unfortunately, Pan-Asian data doesn't have a good overlap even with Reich et al. So as a first exercise, I decided to run unsupervised ADMIXTURE on the Pan-Asian dataset by itself.

Here are the bar charts for the admixture results. K=12 ancestral components had the lowest cross-validation error.

You can see these results in a spreadsheet too.

Ref3 Admixture Dendrograms

Posted by Zack on March 19, 2012 5 comments

I have posted the reference 3 K=11 admixture results for all populations and datasets. Here are the relevant links:

So let's try a dendrogram of all these populations' average admixture results. Instead of using regular Euclidean distance, I used some weighting based on Fst distances between admixture components, very similar to what Palisto did.

Here's a dendrogram of all datasets using complete linkage.

Since the Pan-Asian dataset had only 5,400 SNPs common with reference 3, we need to be careful interpreting the tree above. Just to make sure, here's the dendrogram excluding Pan-Asian populations.

Henn Ref3 K=11 Admixture

Posted by Zack on March 16, 2012 5 comments

There have been two Henn et al papers since I started this project.

Hunter-gatherer genomic diversity suggests a southern African origin for modern humans by Brenna M. Henn, Christopher R. Gignoux, Matthew Jobin, Julie M. Granka, J. M. Macpherson, Jeffrey M. Kidd, Laura RodrÃguez-BotiguÃ©, Sohini Ramachandran, Lawrence Hon, Abra Brisbin, Alice A. Lin, Peter A. Underhill, David Comas, Kenneth K. Kidd, Paul J. Norman, Peter Parham, Carlos D. Bustamante, Joanna L. Mountain, and Marcus W. Feldman
Genomic Ancestry of North Africans Supports Back-to-Africa Migrations by Brenna M. Henn, Laura R. BotiguÃ©, Simon Gravel, Wei Wang, Abra Brisbin, Jake K. Byrnes, Karima Fadhlaoui-Zid, Pierre A. Zalloua, Andres Moreno-Estrada, Jaume Bertranpetit, Carlos D. Bustamante, David Comas

The data for both is available online:

I ran reference 3 K=11 admixture on these datasets using about 48,000 SNPs.

Here is the spreadsheet with the Henn group averages for reference 3 admixture at K=11 ancestral components.

Note that the Sandawe, Hadza and San from Henn2011 were already included in Reference 3 and are not listed here.

Pan-Asian Ref3 K=11 Admixture

Posted by Zack on March 13, 2012 9 comments

The HUGO Pan-Asian dataset covers South and East Asia with the following South Asian populations:

23 Andhra Pradesh & Karnataka
10 Bengali
23 Bhil (Rajasthan)
20 Haryana
23 Kashmir Spiti
12 Marathi
12 Rajasthani
30 Singapore Indian
20 Uttaranchal
13 Uttar Pradesh

Unfortunately, they do not specify ethnic or caste background for most Indian groups. Instead, their focus is on Mongoloid/Caucasoid/Australoid etc.

Also, the SNP overlap with other datasets is really small. Therefore, this reference 3 admixture run was done using only 5,400 SNPs. I recommend a big bucket of salt when interpreting these results.

Here is the spreadsheet with the Pan-Asian group averages for reference 3 admixture at K=11 ancestral components.

Xing Ref3 Admixture South Asians

Posted by Zack on March 10, 2012 44 comments

As per AV's comment, here are the individual results for Xing et al South Asians.

Simonson Tibet Dataset

Posted by Zack on March 7, 2012 6 comments

Recently, I discovered that the paper Genetic Evidence for High-Altitude Adaptation in Tibet by Tatum S. Simonson, Yingzhong Yang, Chad D. Huff, Haixia Yun, Ga Qin, David J. Witherspoon, Zhenzhong Bai, Felipe R. Lorenzo, Jinchuan Xing, Lynn B. Jorde, Josef T. Prchal, RiLi Ge has its genotyping data online.

It contains 31 Tibetans from Madou county in Qinghai province. The chip is Affymetrix and there are 868,146 SNPs, which means it has a good overlap with Reich et al and Xing et al and also with my reference 3.

I ran reference 3 K=11 admixture on this dataset. Here are the individual results:

The average is as follows:

S Asian	E Asian	Siberian
1%	84%	14%

Xing Ref3 K=11 Admixture

Posted by Zack on February 27, 2012 11 comments

Xing et al dataset is interesting because it has a number of South Asian populations:

25 Andhra Pradesh Brahmin
10 Andhra Pradesh Madiga
11 Andhra Pradesh Mala
22 Irula
25 Nepalese
25 Punjabi Arain
14 Tamil Nadu Brahmin
12 Tamil Nadu Dalit

Unfortunately, the dataset does not have a lot of common SNPs with 23andme, FTDNA and the other data I am using.

However, I did run a reference 3 admixture on Xing data using about 30,000 SNPs. Since this is a lot less than the usual 118,000 SNPs, the noise levels are much larger.

Here is the spreadsheet with the Xing group averages for reference 3 admixture at K=11 ancestral components.

Hodoglugil Dataset

Posted by Zack on February 24, 2012 9 comments

Dr. Mahley was nice enough to share his Turkish and Kyrgyz dataset from the paper Turkish Population Structure and Genetic Ancestry Reveal Relatedness among Eurasian Populations by UÄŸur HodoÄŸlugil and Robert W. Mahley.

It has:

16 Kyrgyz from Bishkek
20 Turks from Aydin
20 Turks from Istanbul
23 Turks from Kayseri

Here are the group averages for the reference 3 K=11 admixture analysis.

And here are the individual results.

Relatives in Datasets

Posted by Zack on February 6, 2012 Comments Off

Recently, there was a paper Identification of Close Relatives in the HUGO Pan-Asian SNP Database by Xiong Yang, Shuhua Xu, and the HUGO Pan-Asian SNP Consortium.

three individuals involved in MZ pairs were excluded from the whole dataset to construct standardized subset PASNP1716; seventy-six individuals involved in first-degree relationships were excluded from PASNP1716 to construct standardized subset PASNP1640; and 57 individuals involved in second-degree relationships were excluded from PASNP1640 to construct standardized subset PASNP1583. The individuals excluded were summarized inÂ Table S6,Â S7,Â S8.

Let me engage in some blog triumphalism by saying I wrote about the duplicates and relatives in the Pan-Asian dataset in April 2011.

Here are my blog posts about relatedness in datasets:

Early on, I was removing only first degree relatives from the reference datasets. Nowadays, I try to remove all second degree relatives too. I leave the third degree relatives in the data since it's sometimes hard to figure out how real the low IBD values are in Plink. There are a lot of 3rd degree relatives if Plink is to be believed, but I am a little skeptical.

Since Plink's IBD analysis requires homogenous samples, I am now using KING (paper) for the purpose. I am also looking at kcoeff (paper)

Harappa Ancestry Project

Genetics and South Asia

Tag Archives: reference

HarappaWorld Admixture

Pan-Asian Admixture Results

Ref3 Admixture Dendrograms

Henn Ref3 K=11 Admixture

Pan-Asian Ref3 K=11 Admixture

Xing Ref3 Admixture South Asians

Simonson Tibet Dataset

Xing Ref3 K=11 Admixture

Hodoglugil Dataset

Relatives in Datasets

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Archives

Recent Comments

Blogroll

Genetics and South Asia

Tag Archives: reference

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Tags

Archives

Recent Comments

Blogroll