Monthly Archives: January 2012

ChromoPainter/fineStructure South Asians

You have probably heard of ChromoPainter/fineSTRUCTURE by now (Eurogenes, Dienekes, MDLP and Razib).

So I decided to run the South Asian samples data which I had earlier done PCA/MClust on through ChromoPainter and fineSTRUCTURE.

Here is the coancestry matrix among the 715 participants visualized as a heat map.

UPDATE: Here's a huge image showing the same.

fineSTRUCTURE can use this coancestry matrix to classify individuals into clusters, 52 in this case (compared to 38 using PCA and MClust). You can check the cluster assignments in a spreadsheet.

Note that I have named the clusters. That's just a shorthand so we don't have to refer to them by cluster number. Instead I used the population with the largest number of individuals in a cluster to label that cluster.

Here's the cluster-level coancestry heat map.

And the pairwise coincidence:

And finally PCA plots for the first 10 dimensions from fineSTRUCTURE.

UPDATE (Feb 9, 2012): New PCA plots with better markers for the clusters.

Related Reading:

The Rough Guide to Southeast Asia On A Budget
Birds of Southeast Asia (Princeton Field Guides)
Southeast Asia: Past and Present

Project Anniversary

It has been one year for the Harappa Ancestry Project. I announced it on January 17, 2011 and then moved it to its own domain on January 19.

It started out fast and furious with participants sending their data every day and I was blogging it multiple times a day. Now it has slowed down quite a bit with only one South Asian unrelated participant (not counting any Romany) in the last 2 months.

Speaking of participants, there have been a couple of complaints.

The decision to include non-South Asian participants from countries that do not neighbour the Subcontinent contradicts the Harappa Project's original inclusion criterion.

I concur with DMXX. While I'm certainly not anyone to tell Zack how to run his own show, I think the project is losing it's original focus. Accepting folks from West-Asia was also fine, given that South-Asians derive a lot of ancient ancestry from the area and thus may be deemed a secondary focus area. The same could also be said for those with partial Roma Gypsy ancestry. But, some of the runs seem to be almost entirely dominated by non-South Asian participants, who have absolutely no connection with the subcontinent. I can't help but ask on what basis these Brazilian, Belizean, Mexican, Hispanic, Somali, African-American and European participants were accepted into the project.

My approach has been to make it clear to any potential participants that my focus is on South Asia. Thus any Admixture components I have computed have been with a heavily South Asian dataset. Also, a number of PCA and clustering analyses that I do are at times limited to South Asians etc. On the other hand, I have accepted any participant who has asked to be included. I run a basic Admixture run for everyone which is a fairly automated process and sometimes include them in other analyses.

Let me illustrate with an example. I am working on a new Admixture calculator. For computing the components, I am using all the South Asian project participants in addition to reference datasets. I am going to select the data for that in such a way that we get Admixture components which gives us a better idea of South Asian genetic ancestry.

While participation in the project has slowed down, the research on South Asian genetics has picked up. We had the Metspalu et al paper and dataset. Also. 1000genomes is expected to release 400 South Asian samples (100 Lahori Punjabis, 100 Bangladeshis, 100 Sri Lankan Tamil and 100 Indian Telegu) over the summer.

Related Reading:

Capcom 30th Anniversary Character Encyclopedia
Getting the Love You Want: A Guide for Couples, 20th Anniversary Edition
Where the Sidewalk Ends 30th Anniversary Edition: Poems and Drawings
The Plan: Eliminate the Surprising "Healthy" Foods That Are Making You Fat--and Lose Weight Fast
Power Foods for the Brain: An Effective 3-Step Plan to Protect Your Mind and Strengthen Your Memory

Admixture (Ref3 K=11) HRP0201-HRP0210

Here are the admixture results using Reference 3 for Harappa participants HRP0201 to HRP0210.

You can see the participant results in a spreadsheet as well as their ethnic breakdowns and the reference population results.

Here's our bar chart and table. Remember you can click on the legend or the table headers to sort.

If the above interactive charts are not working, here's a static bar graph.

Do note that small percentages for your results can be noise.

Related Reading:

The Official Guide to Ancestry.com
Ancient Cities of the Indus Valley Civilization
From Harappa to Hastinapura: A Study of the Earliest South Asian City and Civilization (American School of Prehistoric Research Monograph Series)
The Harappa Files

My Phased Genome

I released my 23andme raw data last year. Since I had my parents also genotyped, I am now releasing my phased genome.

I haven't done anything with it yet. So if you have any ideas of what use I can put it to, chime in.

Related Reading:

Public Garden Penny
Big Data: A Revolution That Will Transform How We Live, Work, and Think
Phased Array Antennas : Floquet Analysis, Synthesis, BFNs and Active Array Systems
There's a Beagle in My Bed!
Phased Transition: A Responsible Way Forward and Out of Iraq

South Asian PCA 3D Plot

Here's a 3-D plot of my South Asian PCA run, showing the first three principal components.

The principal components have been scaled according to their respective eigenvalues. The plot is rotating about the vertical 1st eigenvector.

You can find out your position on the plot by using the dropdown below the plot and selecting your Harappa ID.

Related Reading:

3D Sudoku
The Harappa Files
Reference Guide for Essential Oils Hard Cover 2012
The New York Times Guide to Essential Knowledge: A Desk Reference for the Curious Mind

South Asian PCA Plots

I did a South Asian PCA + Mclust analysis last month. Here are the PCA plots from that analysis.

First, the eigenvectors are not scaled to the eigenvalues in the plots. So here's a table explaining how much each eigenvector is worth.

Eigenvector Percentage variation explained
1 1.134%
2 0.452%
3 0.351%
4 0.263%
5 0.254%
6 0.236%
7 0.228%
8 0.224%
9 0.215%
10 0.209%
11 0.207%
12 0.205%
13 0.203%
14 0.201%
15 0.198%
16 0.194%
17 0.191%
18 0.189%
19 0.189%
20 0.188%
21 0.188%
22 0.187%
23 0.186%
24 0.185%
25 0.184%
26 0.184%
27 0.183%
28 0.182%
29 0.180%
30 0.180%
31 0.179%
32 0.179%

Eigenvector 1 looks like the Indian cline but it's actually a West-East Eurasian cline. It's quite similar to Reich et al's Indian cline for their subset of populations (correlation between pc1 and ASI is 0.998869) but since East Asian is not separated out here due to the lack of any East Asian samples, we get a mix of East Asian and Ancestral South Indian towards the right of the plot.

Eigenvector 2 separates Kalash from everyone else.

Related Reading:

Advancements of Ancient India's Vedic Culture: The Planet's Earliest Civilization and How it Influenced the World
Lonely Planet Southeast Asia: On a Shoestring (Shoestring Travel Guide)
Desk Reference to the Diagnostic Criteria from DSM-5(TM)
The Rough Guide to Southeast Asia On A Budget
Asian Godfathers: Money and Power in Hong Kong and Southeast Asia

Metspalu Ref3 Admixture Individual Results

I ran supervised admixture on the Metspalu et al dataset using my reference 3 data. AV asked for individual results, so here they are.

Here's the spreadsheet for Metspalu individual admixture results. You can compare with the reference 3 results.

Here's our bar chart. Remember you can click on the legend or the table headers to sort.

Related Reading:

Pocket Ref 4th Edition
Human Mitochondrial DNA and the Evolution of Homo sapiens (Nucleic Acids and Molecular Biology)
The Evolution of Human Populations in Arabia: Paleoenvironments, Prehistory and Genetics (Vertebrate Paleobiology and Paleoanthropology)
Microsoft Excel 2010 Introduction Quick Reference Guide (Cheat Sheet of Instructions, Tips & Shortcuts - Laminated Card)
Reference Guide for Essential Oils Hard Cover 2012

Metspalu Dataset Update

Dr. Metspalu, who has been very good about sharing data and information, has informed me about a couple of cases of mislabeling in the Metspalu et al dataset.

Our sample labelled D238 and reported as Tharu is in fact a Brahmin sample from Uttar Pradesh.

Following the publication we have identified that sample evo_32 was erroneously labelled as Kanjar before any genetic analyses. We hereby re-label the sample as belonging to Kol population.

Thus, I have updated the Metspalu admixture results and clustering results.

Related Reading:

Pocket Ref 4th Edition
Egyptian Language: The Mountains of the Moon , Niger-Congo Speakers and the Origin of Egypt
Errata: An Examined Life
Reference Guide for Essential Oils Hard Cover 2012
The Evolution and History of Human Populations in South Asia: Inter-disciplinary Studies in Archaeology, Biological Anthropology, Linguistics and ... Paleobiology and Paleoanthropology)

Happy New Year

Hope y'all have a great 2012! And that we see another fruitful year for personal genetics.

I have been away over the holidays driving around the country and am just getting back to try to catch up with my email. I owe several people IDs which I'll send out by tomorrow.

Related Reading:

The Twelve Layers of DNA (Kryon, Book 12)
Genome: The Autobiography of a Species in 23 Chapters (P.S.)
Genetics: From Genes to Genomes (Hartwell, Genetics)
The Seven Daughters of Eve: The Science That Reveals Our Genetic Ancestry
The Genome Generation