Tag Archives: reference

HarappaWorld Admixture

Here is a new admixture calculator. This uses populations all over the world and I got the best results (i.e., lowest crossvalidation error) at K=16.

You can see the admixture results for different ethnic groups as well as results for individual (founder-only) project participants.

UPDATE: The population results have been calculated using weighted means.

The group results are also shown in the usual interactive bar chart below. You can click on the component labels to sort by that ancestral component.

Do note that the admixture components do not necessarily represent real ancestral populations. Also, the names I have chosen for the components should be thought of as mnemonics to ease discussion. I chose them based on which populations in my data these components peaked in. They do not tell anything directly about ancestral populations. The best way to look at these admixture results is by comparing individuals and populations.

I used about 188,173 SNPs for this run. The results for Henn2011 (181,223 SNPs for Hadza, Sandawe and San, 26,494 SNPs for other groups), Henn2012 (26,494 SNPs), Reich (48,967 SNPs) and Xing (18,986 SNPs) datasets reported above were however calculated using lower number of common SNPs. Hence caution should be exercised in interpreting those results.

You can also see the Fst distances between the ancestral components.

I should have HarappaWorldOracle and DIYHarappaWorld calculators out in the next few days.

Also, I am working on another calculator which will focus more closely on South Asia.

Related Reading:

The Harappa Files
Ancient Cities of the Indus Valley Civilization

Pan-Asian Admixture Results

I ran Reference 3 based supervised ADMIXTURE on the HUGO Pan-Asian dataset. While it used only 5,400 SNPs, it did get me curious about any relationship between Onge and Jehai and Kensiu. Unfortunately, Pan-Asian data doesn't have a good overlap even with Reich et al. So as a first exercise, I decided to run unsupervised ADMIXTURE on the Pan-Asian dataset by itself.

Here are the bar charts for the admixture results. K=12 ancestral components had the lowest cross-validation error.

You can see these results in a spreadsheet too.

Related Reading:

Pan-Asian Express: Quick Fixes for Asian-Food Fans
Pan-Asian Integration: Linking East and South Asia
The Steamy Kitchen Cookbook: 101 Asian Recipes Simple Enough for Tonight's Dinner
The Politics of Anti-Westernism in Asia: Visions of World Order in Pan-Islamic and Pan-Asian Thought (Columbia Studies in International and Global History)
Paris Pan Takes the Dare

Ref3 Admixture Dendrograms

I have posted the reference 3 K=11 admixture results for all populations and datasets. Here are the relevant links:

So let's try a dendrogram of all these populations' average admixture results. Instead of using regular Euclidean distance, I used some weighting based on Fst distances between admixture components, very similar to what Palisto did.

Here's a dendrogram of all datasets using complete linkage.

Since the Pan-Asian dataset had only 5,400 SNPs common with reference 3, we need to be careful interpreting the tree above. Just to make sure, here's the dendrogram excluding Pan-Asian populations.

Related Reading:

Numerical Ecology with R (Use R!)
Programming Collective Intelligence: Building Smart Web 2.0 Applications
Deteriogenic cyanobacteria on historic buildings in Brazil detected by culture and molecular techniques [An article from: International Biodeterioration & Biodegradation]
Pocket Ref 4th Edition
Legends of the middle ages, narrated with special reference to literature and art

Henn Ref3 K=11 Admixture

There have been two Henn et al papers since I started this project.

  1. Hunter-gatherer genomic diversity suggests a southern African origin for modern humans by Brenna M. Henn, Christopher R. Gignoux, Matthew Jobin, Julie M. Granka, J. M. Macpherson, Jeffrey M. Kidd, Laura Rodríguez-Botigué, Sohini Ramachandran, Lawrence Hon, Abra Brisbin, Alice A. Lin, Peter A. Underhill, David Comas, Kenneth K. Kidd, Paul J. Norman, Peter Parham, Carlos D. Bustamante, Joanna L. Mountain, and Marcus W. Feldman
  2. Genomic Ancestry of North Africans Supports Back-to-Africa Migrations by Brenna M. Henn, Laura R. Botigué, Simon Gravel, Wei Wang, Abra Brisbin, Jake K. Byrnes, Karima Fadhlaoui-Zid, Pierre A. Zalloua, Andres Moreno-Estrada, Jaume Bertranpetit, Carlos D. Bustamante, David Comas

The data for both is available online:

I ran reference 3 K=11 admixture on these datasets using about 48,000 SNPs.

Here is the spreadsheet with the Henn group averages for reference 3 admixture at K=11 ancestral components.

Note that the Sandawe, Hadza and San from Henn2011 were already included in Reference 3 and are not listed here.

Related Reading:

The Reading Club: A Guide to a Literacy Intervention Program for Reluctant Readers in Spanish and English
The New York Times Guide to Essential Knowledge, Second Edition: A Desk Reference for the Curious Mind
Study Bible KJV - Scofield Reference Bible
Diet Pills (Drugs: The Straight Facts)
Merriam-Webster's Everyday Language Reference Set

Pan-Asian Ref3 K=11 Admixture

The HUGO Pan-Asian dataset covers South and East Asia with the following South Asian populations:

  • 23 Andhra Pradesh & Karnataka
  • 10 Bengali
  • 23 Bhil (Rajasthan)
  • 20 Haryana
  • 23 Kashmir Spiti
  • 12 Marathi
  • 12 Rajasthani
  • 30 Singapore Indian
  • 20 Uttaranchal
  • 13 Uttar Pradesh

Unfortunately, they do not specify ethnic or caste background for most Indian groups. Instead, their focus is on Mongoloid/Caucasoid/Australoid etc.

Also, the SNP overlap with other datasets is really small. Therefore, this reference 3 admixture run was done using only 5,400 SNPs. I recommend a big bucket of salt when interpreting these results.

Here is the spreadsheet with the Pan-Asian group averages for reference 3 admixture at K=11 ancestral components.

Related Reading:

Paris Pan Takes the Dare
Pocket Ref 4th Edition
The New York Times Guide to Essential Knowledge: A Desk Reference for the Curious Mind
The New York Times Guide to Essential Knowledge, Second Edition: A Desk Reference for the Curious Mind
Merriam-Webster's Everyday Language Reference Set

Xing Ref3 Admixture South Asians

As per AV's comment, here are the individual results for Xing et al South Asians.

Related Reading:

Legends of the middle ages, narrated with special reference to literature and art
Wu Xing, Five Elements
Merriam-Webster's Everyday Language Reference Set
Statistical Models and Methods for Financial Markets (Springer Texts in Statistics)
Teaching and Learning Chinese as a Foreign Language: A Pedagogical Grammar

Simonson Tibet Dataset

Recently, I discovered that the paper Genetic Evidence for High-Altitude Adaptation in Tibet by Tatum S. Simonson, Yingzhong Yang, Chad D. Huff, Haixia Yun, Ga Qin, David J. Witherspoon, Zhenzhong Bai, Felipe R. Lorenzo, Jinchuan Xing, Lynn B. Jorde, Josef T. Prchal, RiLi Ge has its genotyping data online.

It contains 31 Tibetans from Madou county in Qinghai province. The chip is Affymetrix and there are 868,146 SNPs, which means it has a good overlap with Reich et al and Xing et al and also with my reference 3.

I ran reference 3 K=11 admixture on this dataset. Here are the individual results:

The average is as follows:

S Asian E Asian Siberian
1% 84% 14%

Related Reading:

Lombardi
Ugly's Electrical References, 2011 Edition
Tibet: Culture on the Edge
The Foolish Dictionary An exhausting work of reference to un-certain English words, their origin, meaning, legitimate and illegitimate use, confused by a few pictures [not included]
The Gregg Reference Manual 10e

Xing Ref3 K=11 Admixture

Xing et al dataset is interesting because it has a number of South Asian populations:

  • 25 Andhra Pradesh Brahmin
  • 10 Andhra Pradesh Madiga
  • 11 Andhra Pradesh Mala
  • 22 Irula
  • 25 Nepalese
  • 25 Punjabi Arain
  • 14 Tamil Nadu Brahmin
  • 12 Tamil Nadu Dalit

Unfortunately, the dataset does not have a lot of common SNPs with 23andme, FTDNA and the other data I am using.

However, I did run a reference 3 admixture on Xing data using about 30,000 SNPs. Since this is a lot less than the usual 118,000 SNPs, the noise levels are much larger.

Here is the spreadsheet with the Xing group averages for reference 3 admixture at K=11 ancestral components.

Related Reading:

Ugly's Electrical References, 2011 Edition
The New York Times Guide to Essential Knowledge, Second Edition: A Desk Reference for the Curious Mind
Statistical Models and Methods for Financial Markets (Springer Texts in Statistics)
The inquiring Minds: Cattle X-ing
The Foolish Dictionary An exhausting work of reference to un-certain English words, their origin, meaning, legitimate and illegitimate use, confused by a few pictures [not included]

Hodoglugil Dataset

Dr. Mahley was nice enough to share his Turkish and Kyrgyz dataset from the paper Turkish Population Structure and Genetic Ancestry Reveal Relatedness among Eurasian Populations by Uğur Hodoğlugil and Robert W. Mahley.

It has:

  • 16 Kyrgyz from Bishkek
  • 20 Turks from Aydin
  • 20 Turks from Istanbul
  • 23 Turks from Kayseri

Here are the group averages for the reference 3 K=11 admixture analysis.

And here are the individual results.

Related Reading:

The New York Times Guide to Essential Knowledge, Second Edition: A Desk Reference for the Curious Mind
Study Bible KJV - Scofield Reference Bible
The Foolish Dictionary An exhausting work of reference to un-certain English words, their origin, meaning, legitimate and illegitimate use, confused by a few pictures [not included]
The New York Times Guide to Essential Knowledge: A Desk Reference for the Curious Mind
The Gregg Reference Manual 10e

Relatives in Datasets

Recently, there was a paper Identification of Close Relatives in the HUGO Pan-Asian SNP Database by Xiong Yang, Shuhua Xu, and the HUGO Pan-Asian SNP Consortium.

three individuals involved in MZ pairs were excluded from the whole dataset to construct standardized subset PASNP1716; seventy-six individuals involved in first-degree relationships were excluded from PASNP1716 to construct standardized subset PASNP1640; and 57 individuals involved in second-degree relationships were excluded from PASNP1640 to construct standardized subset PASNP1583. The individuals excluded were summarized in Table S6S7S8.

Let me engage in some blog triumphalism by saying I wrote about the duplicates and relatives in the Pan-Asian dataset in April 2011.

Here are my blog posts about relatedness in datasets:

Early on, I was removing only first degree relatives from the reference datasets. Nowadays, I try to remove all second degree relatives too. I leave the third degree relatives in the data since it's sometimes hard to figure out how real the low IBD values are in Plink. There are a lot of 3rd degree relatives if Plink is to be believed, but I am a little skeptical.

Since Plink's IBD analysis requires homogenous samples, I am now using KING (paper) for the purpose. I am also looking at kcoeff (paper)

Related Reading:

Honga's Lotus Petal: Pan-Asian Cuisine
The Politics of Anti-Westernism in Asia: Visions of World Order in Pan-Islamic and Pan-Asian Thought (Columbia Studies in International and Global History)
The Gregg Reference Manual 10e
Living with IBD & IBS: A Personal Journey of Success
Investor's Business Daily and the Making of Millionaires: How IBD Rewrote the Rules of Investing and Business News