Harappa Ancestry Project

ADMIXTURE Seed and Cross-Validation

Posted by Zack on April 17, 2012 9 comments

I have been running some ADMIXTURE experiments recently. This is with my world dataset and about 180,000 SNPs.

I ran ADMIXTURE at K=15 ancestral components with different random seeds. Let's take a look at the final log likelihood and cross-validation errors I got for 11 runs.

As you can see, as the log likelihood increases, the cross-validation error decreases, though there is a fair bit of variation there.

For different runs, I got fairly different ancestral components. Remember that the only difference between the different runs was the random seed used to initialize the algorithm. Some ancestral components stayed very similar across the runs but others appeared and disappeared or switched subtly between different populations in a broad region.

The cross-validation (CV) error is important in my opinion since it gives you an idea of which run has results that generalize better. Basically, it is calculated by removing a portion of the individuals in the dataset.

At K=15, the minimum CV error I got was 0.52200 and the median was 0.52206. The maximum CV error was 0.52241, which is pretty large for this data. Let's superimpose this maximum CV value on a graph showing how CV error varies for different values of K (number of ancestral components).

The set of runs in this graph (other than the red line for the maximum CV error at K=15) used the default random seed for ADMIXTURE.

What this shows is that running ADMIXTURE only once using the default random seed (or any other seed) is fraught with problems. A better approach is to run it multiple times with different seeds so you can be sure that you have arrived at a computationally optimum solution.

Pan-Asian Admixture Results

Posted by Zack on April 3, 2012 1 comment

I ran Reference 3 based supervised ADMIXTURE on the HUGO Pan-Asian dataset. While it used only 5,400 SNPs, it did get me curious about any relationship between Onge and Jehai and Kensiu. Unfortunately, Pan-Asian data doesn't have a good overlap even with Reich et al. So as a first exercise, I decided to run unsupervised ADMIXTURE on the Pan-Asian dataset by itself.

Here are the bar charts for the admixture results. K=12 ancestral components had the lowest cross-validation error.

You can see these results in a spreadsheet too.

South Asian fineStructure Ref3 Admixture

Posted by Zack on March 27, 2012 1 comment

I was wondering what the admixture patterns of the clusters fineSTRUCTURE computed were for my South Asian run. So I computed the average admixture for each cluster (total: 89) using reference 3 admixture results.

The default order of the clusters is to keep the closer clusters together.

Harappa Oracle

Posted by Zack on March 23, 2012 15 comments

Based on the Dodecad Oracle, here is Harappa Oracle using reference 3 admixture results.

I am using Dienekes' code with a couple of changes. One of them is using weighted distance based on Fst divergences between ancestral components. Because of that it is several times slower than DodecadOracle. I plan to offer an option soon to switch between Euclidean distance and Fst-weighted distance.

You need to install R to use it. Then unzip the Oracle zip file. Double-click on the file or use the following in R:

load('HarappaOracleR3fst.RData')

In R, you can look at the 385 populations included by typing:

X[,1]

To use it to find your closest populations, you need your Harappa Reference 3 admixture results. Use them separated by commas like this (for me):

HarappaOracle(c(44,12,0,24,14,1,2,0,0,1,2))

You will get a result, with the first column showing the closest populations and the 2nd column their distance to you.

[,1] [,2]
[1,] "balochi" "8.0242"
[2,] "bene-israel" "9.2843"
[3,] "brahui" "9.5158"
[4,] "pathan" "9.7034"
[5,] "makrani" "10.1014"
[6,] "sindhi" "10.9236"
[7,] "Bhatia" "11.8441"
[8,] "Sindhi" "12.1704"
[9,] "Kashmiri" "13.4229"
[10,] "punjabi-arain" "13.9192"

You can also find out the closest populations to one of the reference populations:

HarappaOracle("punjabi-arain")

By default, the Oracle shows the 10 closest populations. You can change that:

HarappaOracle("punjabi-arain",k=20)

Also, by default, the Oracle excludes the Pan-Asian dataset since the overlap is only 5,400 SNPs. You can include Pan-Asian populations:

HarappaOracle("punjabi-arain",panasian=T)

There is also a mixed mode where the individual (or mean reference population) is compared against all pairs of populations as ancestors.

HarappaOracle("Haryana Jatt",mixedmode=T)

which has the following output:

[1,] "Haryana Jatt" "0"
[2,] "15.4% lithuanians + 84.6% Punjabi Brahmin" "1.9553"
[3,] "10.6% russian + 89.4% Rajasthani Brahmin" "2.0626"
[4,] "14.7% finnish + 85.3% Punjabi Brahmin" "2.0863"
[5,] "9.2% finnish + 90.8% Rajasthani Brahmin" "2.1142"
[6,] "89.4% Rajasthani Brahmin + 10.6% mordovians" "2.1727"
[7,] "9.6% lithuanians + 90.4% Rajasthani Brahmin" "2.1989"
[8,] "10.1% belorussian + 89.9% Rajasthani Brahmin" "2.2938"
[9,] "16.8% russian + 83.2% Punjabi Brahmin" "2.3015"
[10,] "16.2% belorussian + 83.8% Punjabi Brahmin" "2.3656"

You can of course combine any or all of the options.

Think of Harappa Oracle as a tool to help you interpret your admixture results by comparing who you are closest to. Do not think of it as giving you your real ancestry.

Ref3 Admixture Dendrograms

Posted by Zack on March 19, 2012 5 comments

I have posted the reference 3 K=11 admixture results for all populations and datasets. Here are the relevant links:

So let's try a dendrogram of all these populations' average admixture results. Instead of using regular Euclidean distance, I used some weighting based on Fst distances between admixture components, very similar to what Palisto did.

Here's a dendrogram of all datasets using complete linkage.

Since the Pan-Asian dataset had only 5,400 SNPs common with reference 3, we need to be careful interpreting the tree above. Just to make sure, here's the dendrogram excluding Pan-Asian populations.

Henn Ref3 K=11 Admixture

Posted by Zack on March 16, 2012 5 comments

There have been two Henn et al papers since I started this project.

Hunter-gatherer genomic diversity suggests a southern African origin for modern humans by Brenna M. Henn, Christopher R. Gignoux, Matthew Jobin, Julie M. Granka, J. M. Macpherson, Jeffrey M. Kidd, Laura RodrÃguez-BotiguÃ©, Sohini Ramachandran, Lawrence Hon, Abra Brisbin, Alice A. Lin, Peter A. Underhill, David Comas, Kenneth K. Kidd, Paul J. Norman, Peter Parham, Carlos D. Bustamante, Joanna L. Mountain, and Marcus W. Feldman
Genomic Ancestry of North Africans Supports Back-to-Africa Migrations by Brenna M. Henn, Laura R. BotiguÃ©, Simon Gravel, Wei Wang, Abra Brisbin, Jake K. Byrnes, Karima Fadhlaoui-Zid, Pierre A. Zalloua, Andres Moreno-Estrada, Jaume Bertranpetit, Carlos D. Bustamante, David Comas

The data for both is available online:

I ran reference 3 K=11 admixture on these datasets using about 48,000 SNPs.

Here is the spreadsheet with the Henn group averages for reference 3 admixture at K=11 ancestral components.

Note that the Sandawe, Hadza and San from Henn2011 were already included in Reference 3 and are not listed here.

Pan-Asian Ref3 K=11 Admixture

Posted by Zack on March 13, 2012 9 comments

The HUGO Pan-Asian dataset covers South and East Asia with the following South Asian populations:

23 Andhra Pradesh & Karnataka
10 Bengali
23 Bhil (Rajasthan)
20 Haryana
23 Kashmir Spiti
12 Marathi
12 Rajasthani
30 Singapore Indian
20 Uttaranchal
13 Uttar Pradesh

Unfortunately, they do not specify ethnic or caste background for most Indian groups. Instead, their focus is on Mongoloid/Caucasoid/Australoid etc.

Also, the SNP overlap with other datasets is really small. Therefore, this reference 3 admixture run was done using only 5,400 SNPs. I recommend a big bucket of salt when interpreting these results.

Here is the spreadsheet with the Pan-Asian group averages for reference 3 admixture at K=11 ancestral components.

Xing Ref3 Admixture South Asians

Posted by Zack on March 10, 2012 44 comments

As per AV's comment, here are the individual results for Xing et al South Asians.

Simonson Tibet Dataset

Posted by Zack on March 7, 2012 6 comments

Recently, I discovered that the paper Genetic Evidence for High-Altitude Adaptation in Tibet by Tatum S. Simonson, Yingzhong Yang, Chad D. Huff, Haixia Yun, Ga Qin, David J. Witherspoon, Zhenzhong Bai, Felipe R. Lorenzo, Jinchuan Xing, Lynn B. Jorde, Josef T. Prchal, RiLi Ge has its genotyping data online.

It contains 31 Tibetans from Madou county in Qinghai province. The chip is Affymetrix and there are 868,146 SNPs, which means it has a good overlap with Reich et al and Xing et al and also with my reference 3.

I ran reference 3 K=11 admixture on this dataset. Here are the individual results:

The average is as follows:

S Asian	E Asian	Siberian
1%	84%	14%

Dodecad South Asian ChromoPainter

Posted by Zack on March 4, 2012 17 comments

Dienekes ran ChromoPainter/fineSTRUCTURE analysis of South Asians along with some West Eurasian populations, something I had neglected to do in my own South Asian run.

Using Dienekes' data, I was trying to figure out which South Asian populations had more DNA chunks in common with other groups when I ran into something strange. Looking at the chunkcount spreadsheet, if we focus on a recipient population (i.e., one row), we can see which populations contributed more "chunks". For most populations, the results are expected. It's either the same population or some close population. For example, let's look at top 5 matches for Velamas_M,

	Velamas_M	Pulliyar_M	North_Kannadi	Chamar_M	Piramalai_Kallars_M
Velamas_M	1265.77	1259.38	1256.06	1255.6	1254.74

However, when we do the same for Pathans, Sindhis, Uttar Pradesh Brahmins, Kshatriyas and Muslims, we get strange results.

	Chamar_M	Velamas_M	UP_Scheduled_Caste_M	Piramalai_Kallars_M	Muslim_M
Pathan	1229.91	1229.56	1229.53	1229.32	1229.27

Do Pathans match Chamar the best? Pathans don't show up as a donor till #11.

	Chamar_M	Piramalai_Kallars_M	Pulliyar_M	Velamas_M	North_Kannadi
Sindhi	1234.09	1234.08	1233.85	1233.6	1233.55

Again, Sindhis as donors are #12.

	Pulliyar_M	Chamar_M	North_Kannadi	Kol_M	Piramalai_Kallars_M
Brahmins_UP_M	1244.6	1244.53	1243.44	1242.88	1241.94

The same Brahmins_UP_M are #13 as donors.

	Pulliyar_M	Chamar_M	North_Kannadi	Kol_M	Piramalai_Kallars_M
Kshatriya_M	1247.72	1247.36	1246.42	1244.98	1244.56

And #12.

	Pulliyar_M	Chamar_M	North_Kannadi	Kol_M	Piramalai_Kallars_M
Muslim_M	1255.96	1255.36	1253.96	1251.74	1250.86

Muslim_M are #8 as donors.

There is a pattern here among the top donors for these populations. The same populations show up time and again.

Compare to my results (with a larger South Asian dataset) now. The top 10 matches for Pathans are:

pathan
punjabi-jatt
bhatia
haryana-jatt
rajasthani-brahmin
punjabi
balochi
kashmiri
punjabi-brahmin
sindhi

For Sindhis,

sindhi
bhatia
balochi
makrani
brahui
punjabi-jatt
haryana-jatt
meghawal
pathan
punjabi

For Brahmins from Uttar Pradesh,

bihari-brahmin
haryana-jatt
brahmin-uttar-pradesh
punjabi-jatt
kurmi
sourastrian
bengali-brahmin
bihari-kayastha
bhatia
up-brahmin

For Kshatriyas,

bihari-brahmin
kurmi
meena
kshatriya
rajasthani-brahmin
haryana-jatt
punjabi-jatt
bengali-brahmin
kerala-muslim
sourastrian

For Muslims,

muslim
chamar
kol
oriya
uttar-pradesh-scheduled-caste
bihari-muslim
sourastrian
brahmin-uttaranchal
dusadh
bihari-brahmin

If Dienekes can post a chunkcount file for the clusters computed by fineSTRUCTURE, may be we can try to figure out what happened.

Harappa Ancestry Project

Genetics and South Asia

ADMIXTURE Seed and Cross-Validation

Pan-Asian Admixture Results

South Asian fineStructure Ref3 Admixture

Harappa Oracle

Ref3 Admixture Dendrograms

Henn Ref3 K=11 Admixture

Pan-Asian Ref3 K=11 Admixture

Xing Ref3 Admixture South Asians

Simonson Tibet Dataset

Dodecad South Asian ChromoPainter

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Archives

Recent Comments

Blogroll

Genetics and South Asia

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Tags

Archives

Recent Comments

Blogroll