Monthly Archives: March 2012

South Asian fineStructure Ref3 Admixture

I was wondering what the admixture patterns of the clusters fineSTRUCTURE computed were for my South Asian run. So I computed the average admixture for each cluster (total: 89) using reference 3 admixture results.

The default order of the clusters is to keep the closer clusters together.

Harappa Oracle

Based on the Dodecad Oracle, here is Harappa Oracle using reference 3 admixture results.

I am using Dienekes' code with a couple of changes. One of them is using weighted distance based on Fst divergences between ancestral components. Because of that it is several times slower than DodecadOracle. I plan to offer an option soon to switch between Euclidean distance and Fst-weighted distance.

You need to install R to use it. Then unzip the Oracle zip file. Double-click on the file or use the following in R:


In R, you can look at the 385 populations included by typing:


To use it to find your closest populations, you need your Harappa Reference 3 admixture results. Use them separated by commas like this (for me):


You will get a result, with the first column showing the closest populations and the 2nd column their distance to you.

[,1] [,2]
[1,] "balochi" "8.0242"
[2,] "bene-israel" "9.2843"
[3,] "brahui" "9.5158"
[4,] "pathan" "9.7034"
[5,] "makrani" "10.1014"
[6,] "sindhi" "10.9236"
[7,] "Bhatia" "11.8441"
[8,] "Sindhi" "12.1704"
[9,] "Kashmiri" "13.4229"
[10,] "punjabi-arain" "13.9192"

You can also find out the closest populations to one of the reference populations:


By default, the Oracle shows the 10 closest populations. You can change that:


Also, by default, the Oracle excludes the Pan-Asian dataset since the overlap is only 5,400 SNPs. You can include Pan-Asian populations:


There is also a mixed mode where the individual (or mean reference population) is compared against all pairs of populations as ancestors.

HarappaOracle("Haryana Jatt",mixedmode=T)

which has the following output:

[1,] "Haryana Jatt" "0"
[2,] "15.4% lithuanians + 84.6% Punjabi Brahmin" "1.9553"
[3,] "10.6% russian + 89.4% Rajasthani Brahmin" "2.0626"
[4,] "14.7% finnish + 85.3% Punjabi Brahmin" "2.0863"
[5,] "9.2% finnish + 90.8% Rajasthani Brahmin" "2.1142"
[6,] "89.4% Rajasthani Brahmin + 10.6% mordovians" "2.1727"
[7,] "9.6% lithuanians + 90.4% Rajasthani Brahmin" "2.1989"
[8,] "10.1% belorussian + 89.9% Rajasthani Brahmin" "2.2938"
[9,] "16.8% russian + 83.2% Punjabi Brahmin" "2.3015"
[10,] "16.2% belorussian + 83.8% Punjabi Brahmin" "2.3656"

You can of course combine any or all of the options.

Think of Harappa Oracle as a tool to help you interpret your admixture results by comparing who you are closest to. Do not think of it as giving you your real ancestry.

Ref3 Admixture Dendrograms

I have posted the reference 3 K=11 admixture results for all populations and datasets. Here are the relevant links:

So let's try a dendrogram of all these populations' average admixture results. Instead of using regular Euclidean distance, I used some weighting based on Fst distances between admixture components, very similar to what Palisto did.

Here's a dendrogram of all datasets using complete linkage.

Since the Pan-Asian dataset had only 5,400 SNPs common with reference 3, we need to be careful interpreting the tree above. Just to make sure, here's the dendrogram excluding Pan-Asian populations.

Henn Ref3 K=11 Admixture

There have been two Henn et al papers since I started this project.

  1. Hunter-gatherer genomic diversity suggests a southern African origin for modern humans by Brenna M. Henn, Christopher R. Gignoux, Matthew Jobin, Julie M. Granka, J. M. Macpherson, Jeffrey M. Kidd, Laura Rodríguez-Botigué, Sohini Ramachandran, Lawrence Hon, Abra Brisbin, Alice A. Lin, Peter A. Underhill, David Comas, Kenneth K. Kidd, Paul J. Norman, Peter Parham, Carlos D. Bustamante, Joanna L. Mountain, and Marcus W. Feldman
  2. Genomic Ancestry of North Africans Supports Back-to-Africa Migrations by Brenna M. Henn, Laura R. Botigué, Simon Gravel, Wei Wang, Abra Brisbin, Jake K. Byrnes, Karima Fadhlaoui-Zid, Pierre A. Zalloua, Andres Moreno-Estrada, Jaume Bertranpetit, Carlos D. Bustamante, David Comas

The data for both is available online:

I ran reference 3 K=11 admixture on these datasets using about 48,000 SNPs.

Here is the spreadsheet with the Henn group averages for reference 3 admixture at K=11 ancestral components.

Note that the Sandawe, Hadza and San from Henn2011 were already included in Reference 3 and are not listed here.

Pan-Asian Ref3 K=11 Admixture

The HUGO Pan-Asian dataset covers South and East Asia with the following South Asian populations:

  • 23 Andhra Pradesh & Karnataka
  • 10 Bengali
  • 23 Bhil (Rajasthan)
  • 20 Haryana
  • 23 Kashmir Spiti
  • 12 Marathi
  • 12 Rajasthani
  • 30 Singapore Indian
  • 20 Uttaranchal
  • 13 Uttar Pradesh

Unfortunately, they do not specify ethnic or caste background for most Indian groups. Instead, their focus is on Mongoloid/Caucasoid/Australoid etc.

Also, the SNP overlap with other datasets is really small. Therefore, this reference 3 admixture run was done using only 5,400 SNPs. I recommend a big bucket of salt when interpreting these results.

Here is the spreadsheet with the Pan-Asian group averages for reference 3 admixture at K=11 ancestral components.

Xing Ref3 Admixture South Asians

As per AV's comment, here are the individual results for Xing et al South Asians.

Simonson Tibet Dataset

Recently, I discovered that the paper Genetic Evidence for High-Altitude Adaptation in Tibet by Tatum S. Simonson, Yingzhong Yang, Chad D. Huff, Haixia Yun, Ga Qin, David J. Witherspoon, Zhenzhong Bai, Felipe R. Lorenzo, Jinchuan Xing, Lynn B. Jorde, Josef T. Prchal, RiLi Ge has its genotyping data online.

It contains 31 Tibetans from Madou county in Qinghai province. The chip is Affymetrix and there are 868,146 SNPs, which means it has a good overlap with Reich et al and Xing et al and also with my reference 3.

I ran reference 3 K=11 admixture on this dataset. Here are the individual results:

The average is as follows:

S Asian E Asian Siberian
1% 84% 14%

Dodecad South Asian ChromoPainter

Dienekes ran ChromoPainter/fineSTRUCTURE analysis of South Asians along with some West Eurasian populations, something I had neglected to do in my own South Asian run.

Using Dienekes' data, I was trying to figure out which South Asian populations had more DNA chunks in common with other groups when I ran into something strange. Looking at the chunkcount spreadsheet, if we focus on a recipient population (i.e., one row), we can see which populations contributed more "chunks". For most populations, the results are expected. It's either the same population or some close population. For example, let's look at top 5 matches for Velamas_M,

Velamas_M Pulliyar_M North_Kannadi Chamar_M Piramalai_Kallars_M
Velamas_M 1265.77 1259.38 1256.06 1255.6 1254.74

However, when we do the same for Pathans, Sindhis, Uttar Pradesh Brahmins, Kshatriyas and Muslims, we get strange results.

Chamar_M Velamas_M UP_Scheduled_Caste_M Piramalai_Kallars_M Muslim_M
Pathan 1229.91 1229.56 1229.53 1229.32 1229.27

Do Pathans match Chamar the best? Pathans don't show up as a donor till #11.

Chamar_M Piramalai_Kallars_M Pulliyar_M Velamas_M North_Kannadi
Sindhi 1234.09 1234.08 1233.85 1233.6 1233.55

Again, Sindhis as donors are #12.

Pulliyar_M Chamar_M North_Kannadi Kol_M Piramalai_Kallars_M
Brahmins_UP_M 1244.6 1244.53 1243.44 1242.88 1241.94

The same Brahmins_UP_M are #13 as donors.

Pulliyar_M Chamar_M North_Kannadi Kol_M Piramalai_Kallars_M
Kshatriya_M 1247.72 1247.36 1246.42 1244.98 1244.56

And #12.

Pulliyar_M Chamar_M North_Kannadi Kol_M Piramalai_Kallars_M
Muslim_M 1255.96 1255.36 1253.96 1251.74 1250.86

Muslim_M are #8 as donors.

There is a pattern here among the top donors for these populations. The same populations show up time and again.

Compare to my results (with a larger South Asian dataset) now. The top 10 matches for Pathans are:

  1. pathan
  2. punjabi-jatt
  3. bhatia
  4. haryana-jatt
  5. rajasthani-brahmin
  6. punjabi
  7. balochi
  8. kashmiri
  9. punjabi-brahmin
  10. sindhi

For Sindhis,

  1. sindhi
  2. bhatia
  3. balochi
  4. makrani
  5. brahui
  6. punjabi-jatt
  7. haryana-jatt
  8. meghawal
  9. pathan
  10. punjabi

For Brahmins from Uttar Pradesh,

  1. bihari-brahmin
  2. haryana-jatt
  3. brahmin-uttar-pradesh
  4. punjabi-jatt
  5. kurmi
  6. sourastrian
  7. bengali-brahmin
  8. bihari-kayastha
  9. bhatia
  10. up-brahmin

For Kshatriyas,

  1. bihari-brahmin
  2. kurmi
  3. meena
  4. kshatriya
  5. rajasthani-brahmin
  6. haryana-jatt
  7. punjabi-jatt
  8. bengali-brahmin
  9. kerala-muslim
  10. sourastrian

For Muslims,

  1. muslim
  2. chamar
  3. kol
  4. oriya
  5. uttar-pradesh-scheduled-caste
  6. bihari-muslim
  7. sourastrian
  8. brahmin-uttaranchal
  9. dusadh
  10. bihari-brahmin

If Dienekes can post a chunkcount file for the clusters computed by fineSTRUCTURE, may be we can try to figure out what happened.

Genetic Affinities of the Central Indian Tribal Populations

Genetic Affinities of the Central Indian Tribal Populations by Gunjan Sharma, Rakesh Tamang, Ruchira Chaudhary, Vipin Kumar Singh, Anish M. Shah, Sharath Anugula, Deepa Selvi Rani, Alla G. Reddy, Muthukrishnan Eaaswarkhanth, Gyaneshwer Chaubey, Lalji Singh, Kumarasamy Thangaraj:

The central Indian state Madhya Pradesh is often called as ‘heart of India’ and has always been an important region functioning as a trinexus belt for three major language families (Indo-European, Dravidian and Austroasiatic). There are less detailed genetic studies on the populations inhabited in this region. Therefore, this study is an attempt for extensive characterization of genetic ancestries of three tribal populations, namely; Bharia, Bhil and Sahariya, inhabiting this region using haploid and diploid DNA markers.

Methodology/Principal Findings
Mitochondrial DNA analysis showed high diversity, including some of the older sublineages of M haplogroup and prominent R lineages in all the three tribes. Y-chromosomal biallelic markers revealed high frequency of Austroasiatic-specific M95-O2a haplogroup in Bharia and Sahariya, M82-H1a in Bhil and M17-R1a in Bhil and Sahariya. The results obtained by haploid as well as diploid genetic markers revealed strong genetic affinity of Bharia (a Dravidian speaking tribe) with the Austroasiatic (Munda) group. The gene flow from Austroasiatic group is further confirmed by their Y-STRs haplotype sharing analysis, where we determined their founder haplotype from the North Munda speaking tribe, while, autosomal analysis was largely in concordant with the haploid DNA results.

Bhil exhibited largely Indo-European specific ancestry, while Sahariya and Bharia showed admixed genetic package of Indo-European and Austroasiatic populations. Hence, in a landscape like India, linguistic label doesn't unequivocally follow the genetic footprints.

Did they seriously use only 48 AIMs (ancestrally informative markers) for their autosomal analysis?

UPDATE: Here is their autosomal analysis using STRUCTURE on 48 AIMs.

Can't say I am impressed. It is very noisy. They have the African component varying from 6.2% to 13.2% in populations that should have none. They also have Bhil at 10.8% East Asian (I got 0%), Sahariya at 15.8% (me at 12%), and Gond at 9.2% (I got 7%).

In short, using 48 AIMs instead of 118,000 SNPs leads to really noisy results.