ADMIXTURE Seed and Cross-Validation

I have been running some ADMIXTURE experiments recently. This is with my world dataset and about 180,000 SNPs.

I ran ADMIXTURE at K=15 ancestral components with different random seeds. Let's take a look at the final log likelihood and cross-validation errors I got for 11 runs.

As you can see, as the log likelihood increases, the cross-validation error decreases, though there is a fair bit of variation there.

For different runs, I got fairly different ancestral components. Remember that the only difference between the different runs was the random seed used to initialize the algorithm. Some ancestral components stayed very similar across the runs but others appeared and disappeared or switched subtly between different populations in a broad region.

The cross-validation (CV) error is important in my opinion since it gives you an idea of which run has results that generalize better. Basically, it is calculated by removing a portion of the individuals in the dataset.

At K=15, the minimum CV error I got was 0.52200 and the median was 0.52206. The maximum CV error was 0.52241, which is pretty large for this data. Let's superimpose this maximum CV value on a graph showing how CV error varies for different values of K (number of ancestral components).

The set of runs in this graph (other than the red line for the maximum CV error at K=15) used the default random seed for ADMIXTURE.

What this shows is that running ADMIXTURE only once using the default random seed (or any other seed) is fraught with problems. A better approach is to run it multiple times with different seeds so you can be sure that you have arrived at a computationally optimum solution.

Pan-Asian Admixture Results

I ran Reference 3 based supervised ADMIXTURE on the HUGO Pan-Asian dataset. While it used only 5,400 SNPs, it did get me curious about any relationship between Onge and Jehai and Kensiu. Unfortunately, Pan-Asian data doesn't have a good overlap even with Reich et al. So as a first exercise, I decided to run unsupervised ADMIXTURE on the Pan-Asian dataset by itself.

Here are the bar charts for the admixture results. K=12 ancestral components had the lowest cross-validation error.

You can see these results in a spreadsheet too.

South Asian fineStructure Ref3 Admixture

I was wondering what the admixture patterns of the clusters fineSTRUCTURE computed were for my South Asian run. So I computed the average admixture for each cluster (total: 89) using reference 3 admixture results.

The default order of the clusters is to keep the closer clusters together.

Harappa Oracle

Based on the Dodecad Oracle, here is Harappa Oracle using reference 3 admixture results.

I am using Dienekes' code with a couple of changes. One of them is using weighted distance based on Fst divergences between ancestral components. Because of that it is several times slower than DodecadOracle. I plan to offer an option soon to switch between Euclidean distance and Fst-weighted distance.

You need to install R to use it. Then unzip the Oracle zip file. Double-click on the file or use the following in R:

load('HarappaOracleR3fst.RData')

In R, you can look at the 385 populations included by typing:

X[,1]

To use it to find your closest populations, you need your Harappa Reference 3 admixture results. Use them separated by commas like this (for me):

HarappaOracle(c(44,12,0,24,14,1,2,0,0,1,2))

You will get a result, with the first column showing the closest populations and the 2nd column their distance to you.

[,1] [,2]
[1,] "balochi" "8.0242"
[2,] "bene-israel" "9.2843"
[3,] "brahui" "9.5158"
[4,] "pathan" "9.7034"
[5,] "makrani" "10.1014"
[6,] "sindhi" "10.9236"
[7,] "Bhatia" "11.8441"
[8,] "Sindhi" "12.1704"
[9,] "Kashmiri" "13.4229"
[10,] "punjabi-arain" "13.9192"

You can also find out the closest populations to one of the reference populations:

HarappaOracle("punjabi-arain")

By default, the Oracle shows the 10 closest populations. You can change that:

HarappaOracle("punjabi-arain",k=20)

Also, by default, the Oracle excludes the Pan-Asian dataset since the overlap is only 5,400 SNPs. You can include Pan-Asian populations:

HarappaOracle("punjabi-arain",panasian=T)

There is also a mixed mode where the individual (or mean reference population) is compared against all pairs of populations as ancestors.

HarappaOracle("Haryana Jatt",mixedmode=T)

which has the following output:

[1,] "Haryana Jatt" "0"
[2,] "15.4% lithuanians + 84.6% Punjabi Brahmin" "1.9553"
[3,] "10.6% russian + 89.4% Rajasthani Brahmin" "2.0626"
[4,] "14.7% finnish + 85.3% Punjabi Brahmin" "2.0863"
[5,] "9.2% finnish + 90.8% Rajasthani Brahmin" "2.1142"
[6,] "89.4% Rajasthani Brahmin + 10.6% mordovians" "2.1727"
[7,] "9.6% lithuanians + 90.4% Rajasthani Brahmin" "2.1989"
[8,] "10.1% belorussian + 89.9% Rajasthani Brahmin" "2.2938"
[9,] "16.8% russian + 83.2% Punjabi Brahmin" "2.3015"
[10,] "16.2% belorussian + 83.8% Punjabi Brahmin" "2.3656"

You can of course combine any or all of the options.

Think of Harappa Oracle as a tool to help you interpret your admixture results by comparing who you are closest to. Do not think of it as giving you your real ancestry.

Ref3 Admixture Dendrograms

I have posted the reference 3 K=11 admixture results for all populations and datasets. Here are the relevant links:

So let's try a dendrogram of all these populations' average admixture results. Instead of using regular Euclidean distance, I used some weighting based on Fst distances between admixture components, very similar to what Palisto did.

Here's a dendrogram of all datasets using complete linkage.

Since the Pan-Asian dataset had only 5,400 SNPs common with reference 3, we need to be careful interpreting the tree above. Just to make sure, here's the dendrogram excluding Pan-Asian populations.

Henn Ref3 K=11 Admixture

There have been two Henn et al papers since I started this project.

  1. Hunter-gatherer genomic diversity suggests a southern African origin for modern humans by Brenna M. Henn, Christopher R. Gignoux, Matthew Jobin, Julie M. Granka, J. M. Macpherson, Jeffrey M. Kidd, Laura Rodríguez-Botigué, Sohini Ramachandran, Lawrence Hon, Abra Brisbin, Alice A. Lin, Peter A. Underhill, David Comas, Kenneth K. Kidd, Paul J. Norman, Peter Parham, Carlos D. Bustamante, Joanna L. Mountain, and Marcus W. Feldman
  2. Genomic Ancestry of North Africans Supports Back-to-Africa Migrations by Brenna M. Henn, Laura R. Botigué, Simon Gravel, Wei Wang, Abra Brisbin, Jake K. Byrnes, Karima Fadhlaoui-Zid, Pierre A. Zalloua, Andres Moreno-Estrada, Jaume Bertranpetit, Carlos D. Bustamante, David Comas

The data for both is available online:

I ran reference 3 K=11 admixture on these datasets using about 48,000 SNPs.

Here is the spreadsheet with the Henn group averages for reference 3 admixture at K=11 ancestral components.

Note that the Sandawe, Hadza and San from Henn2011 were already included in Reference 3 and are not listed here.

Pan-Asian Ref3 K=11 Admixture

The HUGO Pan-Asian dataset covers South and East Asia with the following South Asian populations:

  • 23 Andhra Pradesh & Karnataka
  • 10 Bengali
  • 23 Bhil (Rajasthan)
  • 20 Haryana
  • 23 Kashmir Spiti
  • 12 Marathi
  • 12 Rajasthani
  • 30 Singapore Indian
  • 20 Uttaranchal
  • 13 Uttar Pradesh

Unfortunately, they do not specify ethnic or caste background for most Indian groups. Instead, their focus is on Mongoloid/Caucasoid/Australoid etc.

Also, the SNP overlap with other datasets is really small. Therefore, this reference 3 admixture run was done using only 5,400 SNPs. I recommend a big bucket of salt when interpreting these results.

Here is the spreadsheet with the Pan-Asian group averages for reference 3 admixture at K=11 ancestral components.

Xing Ref3 Admixture South Asians

As per AV's comment, here are the individual results for Xing et al South Asians.

Simonson Tibet Dataset

Recently, I discovered that the paper Genetic Evidence for High-Altitude Adaptation in Tibet by Tatum S. Simonson, Yingzhong Yang, Chad D. Huff, Haixia Yun, Ga Qin, David J. Witherspoon, Zhenzhong Bai, Felipe R. Lorenzo, Jinchuan Xing, Lynn B. Jorde, Josef T. Prchal, RiLi Ge has its genotyping data online.

It contains 31 Tibetans from Madou county in Qinghai province. The chip is Affymetrix and there are 868,146 SNPs, which means it has a good overlap with Reich et al and Xing et al and also with my reference 3.

I ran reference 3 K=11 admixture on this dataset. Here are the individual results:

The average is as follows:

S Asian E Asian Siberian
1% 84% 14%

Dodecad South Asian ChromoPainter

Dienekes ran ChromoPainter/fineSTRUCTURE analysis of South Asians along with some West Eurasian populations, something I had neglected to do in my own South Asian run.

Using Dienekes' data, I was trying to figure out which South Asian populations had more DNA chunks in common with other groups when I ran into something strange. Looking at the chunkcount spreadsheet, if we focus on a recipient population (i.e., one row), we can see which populations contributed more "chunks". For most populations, the results are expected. It's either the same population or some close population. For example, let's look at top 5 matches for Velamas_M,

Velamas_M Pulliyar_M North_Kannadi Chamar_M Piramalai_Kallars_M
Velamas_M 1265.77 1259.38 1256.06 1255.6 1254.74

However, when we do the same for Pathans, Sindhis, Uttar Pradesh Brahmins, Kshatriyas and Muslims, we get strange results.

Chamar_M Velamas_M UP_Scheduled_Caste_M Piramalai_Kallars_M Muslim_M
Pathan 1229.91 1229.56 1229.53 1229.32 1229.27

Do Pathans match Chamar the best? Pathans don't show up as a donor till #11.

Chamar_M Piramalai_Kallars_M Pulliyar_M Velamas_M North_Kannadi
Sindhi 1234.09 1234.08 1233.85 1233.6 1233.55

Again, Sindhis as donors are #12.

Pulliyar_M Chamar_M North_Kannadi Kol_M Piramalai_Kallars_M
Brahmins_UP_M 1244.6 1244.53 1243.44 1242.88 1241.94

The same Brahmins_UP_M are #13 as donors.

Pulliyar_M Chamar_M North_Kannadi Kol_M Piramalai_Kallars_M
Kshatriya_M 1247.72 1247.36 1246.42 1244.98 1244.56

And #12.

Pulliyar_M Chamar_M North_Kannadi Kol_M Piramalai_Kallars_M
Muslim_M 1255.96 1255.36 1253.96 1251.74 1250.86

Muslim_M are #8 as donors.

There is a pattern here among the top donors for these populations. The same populations show up time and again.

Compare to my results (with a larger South Asian dataset) now. The top 10 matches for Pathans are:

  1. pathan
  2. punjabi-jatt
  3. bhatia
  4. haryana-jatt
  5. rajasthani-brahmin
  6. punjabi
  7. balochi
  8. kashmiri
  9. punjabi-brahmin
  10. sindhi

For Sindhis,

  1. sindhi
  2. bhatia
  3. balochi
  4. makrani
  5. brahui
  6. punjabi-jatt
  7. haryana-jatt
  8. meghawal
  9. pathan
  10. punjabi

For Brahmins from Uttar Pradesh,

  1. bihari-brahmin
  2. haryana-jatt
  3. brahmin-uttar-pradesh
  4. punjabi-jatt
  5. kurmi
  6. sourastrian
  7. bengali-brahmin
  8. bihari-kayastha
  9. bhatia
  10. up-brahmin

For Kshatriyas,

  1. bihari-brahmin
  2. kurmi
  3. meena
  4. kshatriya
  5. rajasthani-brahmin
  6. haryana-jatt
  7. punjabi-jatt
  8. bengali-brahmin
  9. kerala-muslim
  10. sourastrian

For Muslims,

  1. muslim
  2. chamar
  3. kol
  4. oriya
  5. uttar-pradesh-scheduled-caste
  6. bihari-muslim
  7. sourastrian
  8. brahmin-uttaranchal
  9. dusadh
  10. bihari-brahmin

If Dienekes can post a chunkcount file for the clusters computed by fineSTRUCTURE, may be we can try to figure out what happened.