Tag Archives: south asia - Page 2

South Asian PCA + Mclust

I combined reference 3 with Metspalu et al data and Harappa Ancestry Project participants (up to HRP0200). Then I kept only those individuals whose combined proportion of South Asian and Onge components on my reference 3 admixture results was more than 50%.

I ran PCA on these South Asian samples and kept 31 dimensions. Running Mclust on the PCA results gave me 37 clusters.

The clustering results are in a spreadsheet.

For an individual, the value under a specific cluster shows the probability of that person belonging to that cluster. For example, HRP0152 has a 58% probability of belonging to cluster CL8 and 42% probability of being in cluster CL14.

For the populations in the first sheet, I added up the probabilities of all the samples in that population to get the expected number of individuals of that ethnicity belonging to a specific cluster.

In the second sheet, I have listed all the individual samples' clustering results.

There are some outliers who didn't belong in any cluster: HRP0001 (me, of course), 7 (out of 18) Makranis, 4 (out of 23) Sindhis, 3 (all) Great Andamanese, 1 (out of 20) Balochi, 1 (out of 4) Madiga, and 1 (only) Onge.

Shared and Unique Components of Human Population Structure and Genome-Wide Signals of Positive Selection in South Asia

Metspalu et al have a new paper in American Journal of Human Genetics about South Asian genetics. Here's the abstract:

South Asia harbors one of the highest levels genetic diversity in Eurasia, which could be interpreted as a result of its long-term large effective population size and of admixture during its complex demographic history. In contrast to Pakistani populations, populations of Indian origin have been underrepresented in previous genomic scans of positive selection and population structure. Here we report data for more than 600,000 SNP markers genotyped in 142 samples from 30 ethnic groups in India. Combining our results with other available genome-wide data, we show that Indian populations are characterized by two major ancestry components, one of which is spread at comparable frequency and haplotype diversity in populations of South and West Asia and the Caucasus. The second component is more restricted to South Asia and accounts for more than 50% of the ancestry in Indian populations. Haplotype diversity associated with these South Asian ancestry components is significantly higher than that of the components dominating the West Eurasian ancestry palette. Modeling of the observed haplotype diversities suggests that both Indian ancestry components are older than the purported Indo-Aryan invasion 3,500 YBP. Consistent with the results of pairwise genetic distances among world regions, Indians share more ancestry signals with West than with East Eurasians. However, compared to Pakistani populations, a higher proportion of their genes show regionally specific signals of high haplotype homozygosity. Among such candidates of positive selection in India are MSTN and DOK5, both of which have potential implications in lipid metabolism and the etiology of type 2 diabetes.

I'll have some comments later today.

Indian Cline III

I have been working on creating 100% ASI (Ancestral South Indian) samples recently. So it was really interesting that Dienekes did similar experiments:

I am going about creating the "pure" allele frequencies somewhat differently, so that would be a useful exercise.

Anyway, I thought you guys would be itching for some new results. So here's a PCA plot:

This used the same Principal Component Analysis as the one here using the 96 Indian Cline samples, Utahn Whites and Onge. However, I projected three extra "populations" on this plot.

These three populations are simulated genetic data of 25 individuals using the allele frequencies from Reference 3 Admixture results.

  1. Onge11 is generated from the Onge (C2) component from K=11 admixture for Reference 3.
  2. SA11 is generated from the South Asian (C1) component from the same K=11 admixture.
  3. SA12 is generated from the South Asian (C1) component from the K=12 admixture.

As you can see, the SA12 population lies between 100% ASI and the Indian Cline samples.

The Onge11 generated samples are a bit beyond 100% ASI on the first principal component, but they are also shifted towards the real Onge on pc2.

Misuse of Correlation

I have been misusing correlation in computing Ancestral South Indian percentages from PCA/ADMIXTURE and Reich et al population-level averages.

I have tried to make it clear that just looking at the correlation is not enough, that an admixture component is not similar to ASI just because it correlates well with Reich et al's ASI averages for the 18 Indian cline populations. Even when the correlation is higher than 0.99. To illustrate what I mean, let's look at the Ref4C admixture runs.

I calculated the mean for each admixture component from the K=2 to K=12 runs for the 18 Indian cline populations and then computed the correlation between that and the Reich et results. Let's take a look:

K Component Correlation
2 C1 Euro-Afro -0.9941887
3 C2 East Asian 0.9955347
4 C3 European -0.993933
5 C3 European -0.993277
6 C1 South Asian 0.9675099
7 C1 South Asian 0.993081
8 C1 South Asian 0.9932762
9 C1 South Asian 0.9914145
10 C1 South Asian 0.9918095
11 C1 South Asian 0.9919097
12 C1 South Asian 0.9918594

Where do you see the highest correlation? At K=3 ancestral populations, the East Asian component is very highly correlated with ASI for the Indian cline populations. Does that mean that we could use that to compute ASI? No, not at all. While it is expected that at K=3, ASI would be a little closer to East Asian than to European, East Asian is not a good proxy for ASI at all since we cannot extrapolate to other individuals and populations.

Indian Cline II

One thing I forgot in the post yesterday about the Indian cline was to try to extrapolate from the PCA results to 100% ANI (Ancestral North Indian) and 100% ASI (Ancestral South Indian).

This is a simple linear extrapolation which should be okay since PCA is linear.Men's Club - Онлайн Журнал

The "N" denotes the extrapolated position of ANI and "S" denotes the ASI. The points to the left of "N" are all Utahn Whites while the Onge are on the bottom right of the graph.

As you can see, the ASI is about the same as Onge in terms of eigenvector 1 (which represents the Indian cline approximately), but ASI is far from Onge on the 2nd eigenvector. That is expected since the Onge have been separated from the mainland populations for a long time.

The more interesting thing is that the extrapolated position of ANI is a little to the right of all the Utahn Whites.

We'll need a similar analysis of the Indian cline with more populations to see which one the ANI is closest to.

PS. I should point out that I am using correlation between a limited number of population statistics to find a relationship between the 1st principal component and Reich et al's ASI estimate. This has a number of drawbacks. It would be much better to compute ASI directly.

Indian Cline

I had used linear regression to estimate Ancestral South Indian (ASI) component from Reference 3 K=11 admixture run. Now here are a couple more exercises along the same lines but much simpler.

Just using the 96 Indian cline samples from Reich et al to compute PCA or admixture doesn't work as the Chenchu separate out in both analyses from the rest. So I added the Utahn White (CEU) samples from HapMap and the Onge from Reich et al.

First, I ran supervised admixture with two ancestral components, Utahn Whites and Onge. Here's the Onge component plotted against Reich et al's ASI estimate along with a linear regression estimate. The correlation between the two is 0.9908.

Second, I ran Principal Component Analysis (PCA) on the Indian cline samples plus Utahn Whites and Onge. Here are the first two PCA dimensions plotted. The first eigenvector explains 4.04% of the total variation and the 2nd explains 1.94%.

The first principal component is mostly along the Indian cline while the second one basically separates the Onge from everyone else.

Using the 1st principal component to estimate ASI, here's the plot with Reich et al's ASI estimate along with a regression line. The correlation between pc1 and ASI is 0.9929.

Note that both these methods work only if the samples are on the Indian cline, i.e., they don't have any other admixture.

And now for comparison, here's the linear regression for the Reference 3 K=11 admixture Onge component and ASI. The correlation here is 0.9949. Note that this is a little different than my previous analysis since I calculated the population averages using only the 96 samples recommended by Reich et al.

Here's a spreadsheet containing the data for these three runs.

There are a couple more tricks I have to figure out some things regarding Ancestral South Indian admixture. Let's hope they provide us some insight.

East Asian Admixture

Let's look at the East Asian admixture among South Asians and other surrounding populations from a previous admixture run (K=12).

I have listed the different kinds of East Asian admixture components among selected populations. The three relevant components are:

  1. Southeast Asian: Highest among the Dai, Cambodians, Lahu and Malay, this is the most common East Asian component among South Asians.
  2. Northeast Asian: Highest among the Naga, Nysha, Japanese and north Han.
  3. Siberians: Highest among the Nganassans and Evenkis, this is lowest among South Asians overall. While this is not quite Turkic, it is the one most related to them.

Let's look at the total East Asian percentage among South Asians.

As expected, the eastern part of South Asia is where we see most of the East Asian admixture.

Now instead of looking at the absolute percentages of Southeast Asian admixture, let's look at the Southeast Asian component as a percentage of total East Asian component.

South and East India seem like mostly Southeast Asian admixture.

Now the same map for Northeast Asian as a proportion of total East Asian:

The Northeast Asian component dominates along the northern border of South Asia.

Finally the Siberian:

Compared to the other two, Siberian component is fairly low among South Asians, so it's difficult to separate the noise from real admixture here. Most of the peaks you see are among populations that have low East Asian admixture.

Reference 3 PCA Clustering for South Asians

Using the first 32 dimensions of the Reference 3 PCA, I tried to classify the 51 South Asian populations. I did not try a full clustering on all populations because that took too long and seemed like there were more than 150 clusters.

You can see the South Asians on 3-D PCA plots of the first four principal components.

The clustering results from Mclust are in a spreadsheet.

PS. I used 32 eigenvectors as that's what gave me the maximum number of clusters with a small number of outliers.

Reference 3 South Asians PCA

Let's zoom into the PCA plots of Reference 3 (more here) and look at how the different South Asian populations line up.

First the 3-D plot of eigenvectors 1, 2 & 3 with principal component 1 being vertical (and axis of rotation).

And now principal components 2, 3 & 4 (with the vertical axis of rotation being 2):

Note that I performed PCA on the whole set of reference 3, so you are looking at the axes of variation of all populations, not just South Asians.

Reich et al Duplicates

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Reich et al Indian dataset.

The dataset doesn't have any duplicate or likely relative samples itself. However, there are two Kharia samples that are the same as the Austroasiatic dataset. Since Austroasiatic dataset has more SNPs in common with 23andme, I removed these two samples from Reich et al.

The IBS/IBS analysis and the sample IDs are in a spreadsheet as usual.