Tag Archives: south asia

Participation Changes

Now that I have DIY HarappaWorld out, I am changing the participation requirements a little bit with somewhat different requirements for South Asians compared to other regions.

If you have any real ancestry from a South Asian origin, you are eligible to participate. Partial South Asian ancestry is okay. The list of countries of origin I count as South Asian are as follows:

  • Afghanistan
  • Bangladesh
  • Bhutan
  • India
  • Maldives
  • Nepal
  • Pakistan
  • Sri Lanka

Note that 2-3% South Asian from Dr. McDonald's BGA or Dodecad Project does not count as South Asian ancestry.

If you have all four of your grandparents from one of the following countries or regions, you can also send me your data.

  • Burma
  • Tibet
  • Uyghur from Xinjiang, China
  • Tajikistan
  • Kyrgyzstan
  • Kazakhstan
  • Uzbekistan
  • Turkmenistan
  • Iran
  • Turkey
  • Azerbaijan
  • Armenia
  • Georgia
  • North Caucasian Federal District, Russia
  • Iraq
  • Syria
  • Lebanon
  • Jordan

Relatives will only be accepted when they are a better replacement for current participants. For example, replacing a participant by his/her parents or his maternal uncle and paternal aunt gets us two unrelated participants (assuming, of course, that the two sides of the family are not related by blood). Another example could be if a participant is of partial South Asian ancestry and they get replaced by a relative who has more South Asian ancestry.

Everyone else can use DIY HarappaWorld. It's fairly easy to use on both Windows and Linux. The only hard part right now is that you have to install R to standardize your genome file. I might look into creating an executable for that to make it easier.

Finally, please be honest.

Related Reading:

Roaming Kyrgyzstan: Beyond the Tourist Track
Submission (The Plume)
Down a Narrow Road: Identity and Masculinity in a Uyghur Community in Xinjiang China (Harvard East Asian Monographs)
Kingdoms of Ruin: The Art and Architectural Splendours of Ancient Turkey
Kazakhstan, 2nd (Bradt Travel Guide Kazakhstan)

South Asian fineStructure Ref3 Admixture

I was wondering what the admixture patterns of the clusters fineSTRUCTURE computed were for my South Asian run. So I computed the average admixture for each cluster (total: 89) using reference 3 admixture results.

The default order of the clusters is to keep the closer clusters together.

Related Reading:

Strange Parallels: Volume 2, Mainland Mirrors: Europe, Japan, China, South Asia, and the Islands: Southeast Asia in Global Context, c.800-1830 (Studies in Comparative World History)
The Ultimate Guide to Teaching English in Thailand (Teaching English in Southeast Asia)

Dense South Asian ChromoPainter

I had run ChromoPainter/fineSTRUCTURE for 715 South Asians using only about 90,000 SNPs. I thought it would be a useful exercise to use more SNPs, so I had to drop the Reich et al dataset. That left me with 615 individuals and 418,854 SNPs.

The "chunkcounts" file has the donors in columns and recipients in rows. Here's a heat map of the same.

fineSTRUCTURE classified these 615 individuals into 89 clusters. I have named these clusters for convenience, however, the names do not imply that anyone in the Punjab cluster is Punjabi.

While I created the cluster tree at the top of the spreadsheet, here's how the clusters are related.

The most interesting thing is how Gujarati A (likely Patels) are an out-group to everyone else. Another major grouping is that of the Baloch, Brahui and Makrani, along with 4 Sindhis (might be one of the Baloch tribe of Sindh?).

The Punjabis, Sindhis and Pathan get better classification here than they did last time.

The Punjab cluster includes 3 Gujarati B, 4 Pathans, 2 Singapore Indians, Punjabis, Haryanvis, Kashmiris, and a Rajasthani Brahmin. Even using this method, HRP0036, who is half-Sri Lankan and half-German/Polish was classified in the same cluster.

The Dharkar and Kanjar could not be separated at all here. According to Metspalu:

There are three second degree relatives groups in our sample: ..snip.. [Kanjar evo_37 and Dharkar HA023]. Again the last pair needs further explanation. The Dharkar and Kanjar practice a nomadic lifestyle and were living side by side at the time of sampling. As the ethnic border between the two is permeable we cannot rule out neither our error during sample collection and/or subsequent labelling nor shifted self-identity.

The inter-cluster heat map:

And you can see the chunkcounts donated from each cluster to recipient individuals in a spreadsheet.

The pairwise coincidence:

And the PCA plots:

Related Reading:

Frommer's Southeast Asia (Frommer's Complete Guides)
Fine-Structure: Webster's Timeline History, 1868 - 2007
Hot Sour Salty Sweet: A Culinary Journey Through Southeast Asia
Lonely Planet Southeast Asia: On a Shoestring (Shoestring Travel Guide)

ChromoPainter/fineStructure South Asians

You have probably heard of ChromoPainter/fineSTRUCTURE by now (Eurogenes, Dienekes, MDLP and Razib).

So I decided to run the South Asian samples data which I had earlier done PCA/MClust on through ChromoPainter and fineSTRUCTURE.

Here is the coancestry matrix among the 715 participants visualized as a heat map.

UPDATE: Here's a huge image showing the same.

fineSTRUCTURE can use this coancestry matrix to classify individuals into clusters, 52 in this case (compared to 38 using PCA and MClust). You can check the cluster assignments in a spreadsheet.

Note that I have named the clusters. That's just a shorthand so we don't have to refer to them by cluster number. Instead I used the population with the largest number of individuals in a cluster to label that cluster.

Here's the cluster-level coancestry heat map.

And the pairwise coincidence:

And finally PCA plots for the first 10 dimensions from fineSTRUCTURE.

UPDATE (Feb 9, 2012): New PCA plots with better markers for the clusters.

Related Reading:

Strange Parallels: Volume 2, Mainland Mirrors: Europe, Japan, China, South Asia, and the Islands: Southeast Asia in Global Context, c.800-1830 (Studies in Comparative World History)
The Ultimate Guide to Teaching English in Thailand (Teaching English in Southeast Asia)
Lonely Planet Southeast Asia (Shoestring)

South Asian PCA 3D Plot

Here's a 3-D plot of my South Asian PCA run, showing the first three principal components.

The principal components have been scaled according to their respective eigenvalues. The plot is rotating about the vertical 1st eigenvector.

You can find out your position on the plot by using the dropdown below the plot and selecting your Harappa ID.

Related Reading:

Southeast Asia in World History (New Oxford World History)
Magic Eye Beyond 3D: Improve Your Vision
Study Bible KJV - Scofield Reference Bible
Hot Sour Salty Sweet: A Culinary Journey Through Southeast Asia

South Asian PCA Plots

I did a South Asian PCA + Mclust analysis last month. Here are the PCA plots from that analysis.

First, the eigenvectors are not scaled to the eigenvalues in the plots. So here's a table explaining how much each eigenvector is worth.

Eigenvector Percentage variation explained
1 1.134%
2 0.452%
3 0.351%
4 0.263%
5 0.254%
6 0.236%
7 0.228%
8 0.224%
9 0.215%
10 0.209%
11 0.207%
12 0.205%
13 0.203%
14 0.201%
15 0.198%
16 0.194%
17 0.191%
18 0.189%
19 0.189%
20 0.188%
21 0.188%
22 0.187%
23 0.186%
24 0.185%
25 0.184%
26 0.184%
27 0.183%
28 0.182%
29 0.180%
30 0.180%
31 0.179%
32 0.179%

Eigenvector 1 looks like the Indian cline but it's actually a West-East Eurasian cline. It's quite similar to Reich et al's Indian cline for their subset of populations (correlation between pc1 and ASI is 0.998869) but since East Asian is not separated out here due to the lack of any East Asian samples, we get a mix of East Asian and Ancestral South Indian towards the right of the plot.

Eigenvector 2 separates Kalash from everyone else.

Related Reading:

Pocket Ref 4th Edition
The Foolish Dictionary An exhausting work of reference to un-certain English words, their origin, meaning, legitimate and illegitimate use, confused by a few pictures [not included]
Strange Parallels: Volume 2, Mainland Mirrors: Europe, Japan, China, South Asia, and the Islands: Southeast Asia in Global Context, c.800-1830 (Studies in Comparative World History)
Legends of the middle ages, narrated with special reference to literature and art
Ugly's Electrical References, 2011 Edition

South Asian PCA + Mclust

I combined reference 3 with Metspalu et al data and Harappa Ancestry Project participants (up to HRP0200). Then I kept only those individuals whose combined proportion of South Asian and Onge components on my reference 3 admixture results was more than 50%.

I ran PCA on these South Asian samples and kept 31 dimensions. Running Mclust on the PCA results gave me 37 clusters.

The clustering results are in a spreadsheet.

For an individual, the value under a specific cluster shows the probability of that person belonging to that cluster. For example, HRP0152 has a 58% probability of belonging to cluster CL8 and 42% probability of being in cluster CL14.

For the populations in the first sheet, I added up the probabilities of all the samples in that population to get the expected number of individuals of that ethnicity belonging to a specific cluster.

In the second sheet, I have listed all the individual samples' clustering results.

There are some outliers who didn't belong in any cluster: HRP0001 (me, of course), 7 (out of 18) Makranis, 4 (out of 23) Sindhis, 3 (all) Great Andamanese, 1 (out of 20) Balochi, 1 (out of 4) Madiga, and 1 (only) Onge.

Related Reading:

A Handbook of Statistical Analyses Using R
The Handy Cyclopedia of Things Worth Knowing A Manual of Ready Reference
India Divided Religion 'Then' (1947) (East-West): 'Now' What Languages ( North-South ) ?....
Computational Paleontology

Shared and Unique Components of Human Population Structure and Genome-Wide Signals of Positive Selection in South Asia

Metspalu et al have a new paper in American Journal of Human Genetics about South Asian genetics. Here's the abstract:

South Asia harbors one of the highest levels genetic diversity in Eurasia, which could be interpreted as a result of its long-term large effective population size and of admixture during its complex demographic history. In contrast to Pakistani populations, populations of Indian origin have been underrepresented in previous genomic scans of positive selection and population structure. Here we report data for more than 600,000 SNP markers genotyped in 142 samples from 30 ethnic groups in India. Combining our results with other available genome-wide data, we show that Indian populations are characterized by two major ancestry components, one of which is spread at comparable frequency and haplotype diversity in populations of South and West Asia and the Caucasus. The second component is more restricted to South Asia and accounts for more than 50% of the ancestry in Indian populations. Haplotype diversity associated with these South Asian ancestry components is significantly higher than that of the components dominating the West Eurasian ancestry palette. Modeling of the observed haplotype diversities suggests that both Indian ancestry components are older than the purported Indo-Aryan invasion 3,500 YBP. Consistent with the results of pairwise genetic distances among world regions, Indians share more ancestry signals with West than with East Eurasians. However, compared to Pakistani populations, a higher proportion of their genes show regionally specific signals of high haplotype homozygosity. Among such candidates of positive selection in India are MSTN and DOK5, both of which have potential implications in lipid metabolism and the etiology of type 2 diabetes.

I'll have some comments later today.

Related Reading:

The Ultimate Guide to Teaching English in Thailand (Teaching English in Southeast Asia)
Lonely Planet Southeast Asia: On a Shoestring (Shoestring Travel Guide)
Southeast Asia in World History (New Oxford World History)
Hot Sour Salty Sweet: A Culinary Journey Through Southeast Asia
The Evolution and History of Human Populations in South Asia: Inter-disciplinary Studies in Archaeology, Biological Anthropology, Linguistics and ... Paleobiology and Paleoanthropology)

Indian Cline III

I have been working on creating 100% ASI (Ancestral South Indian) samples recently. So it was really interesting that Dienekes did similar experiments:

I am going about creating the "pure" allele frequencies somewhat differently, so that would be a useful exercise.

Anyway, I thought you guys would be itching for some new results. So here's a PCA plot:

This used the same Principal Component Analysis as the one here using the 96 Indian Cline samples, Utahn Whites and Onge. However, I projected three extra "populations" on this plot.

These three populations are simulated genetic data of 25 individuals using the allele frequencies from Reference 3 Admixture results.

  1. Onge11 is generated from the Onge (C2) component from K=11 admixture for Reference 3.
  2. SA11 is generated from the South Asian (C1) component from the same K=11 admixture.
  3. SA12 is generated from the South Asian (C1) component from the K=12 admixture.

As you can see, the SA12 population lies between 100% ASI and the Indian Cline samples.

The Onge11 generated samples are a bit beyond 100% ASI on the first principal component, but they are also shifted towards the real Onge on pc2.

Related Reading:

Ready Player One: A Novel
Beyond Outrage: What has gone wrong with our economy and our democracy, and how to fix them (Kindle Single)
Tat Tvam Asi (Namaste Stories)
Hot Sour Salty Sweet: A Culinary Journey Through Southeast Asia
Modeh Ani: A Good Morning Book

Misuse of Correlation

I have been misusing correlation in computing Ancestral South Indian percentages from PCA/ADMIXTURE and Reich et al population-level averages.

I have tried to make it clear that just looking at the correlation is not enough, that an admixture component is not similar to ASI just because it correlates well with Reich et al's ASI averages for the 18 Indian cline populations. Even when the correlation is higher than 0.99. To illustrate what I mean, let's look at the Ref4C admixture runs.

I calculated the mean for each admixture component from the K=2 to K=12 runs for the 18 Indian cline populations and then computed the correlation between that and the Reich et results. Let's take a look:

K Component Correlation
2 C1 Euro-Afro -0.9941887
3 C2 East Asian 0.9955347
4 C3 European -0.993933
5 C3 European -0.993277
6 C1 South Asian 0.9675099
7 C1 South Asian 0.993081
8 C1 South Asian 0.9932762
9 C1 South Asian 0.9914145
10 C1 South Asian 0.9918095
11 C1 South Asian 0.9919097
12 C1 South Asian 0.9918594

Where do you see the highest correlation? At K=3 ancestral populations, the East Asian component is very highly correlated with ASI for the Indian cline populations. Does that mean that we could use that to compute ASI? No, not at all. While it is expected that at K=3, ASI would be a little closer to East Asian than to European, East Asian is not a good proxy for ASI at all since we cannot extrapolate to other individuals and populations.

Related Reading:

The Ultimate Guide to Teaching English in Thailand (Teaching English in Southeast Asia)
Parenting Teens With Love And Logic (Updated and Expanded Edition)
Aftershock: The Next Economy and America's Future (Vintage)
Muu. Beee. ¡Así fue! / Moo, Baa, La La La, Spanish Edition
Rules of Betrayal (Jonathan Ransom, Book 3)