Monthly Archives: May 2011

Indian Cline III

I have been working on creating 100% ASI (Ancestral South Indian) samples recently. So it was really interesting that Dienekes did similar experiments:

I am going about creating the "pure" allele frequencies somewhat differently, so that would be a useful exercise.

Anyway, I thought you guys would be itching for some new results. So here's a PCA plot:

This used the same Principal Component Analysis as the one here using the 96 Indian Cline samples, Utahn Whites and Onge. However, I projected three extra "populations" on this plot.

These three populations are simulated genetic data of 25 individuals using the allele frequencies from Reference 3 Admixture results.

  1. Onge11 is generated from the Onge (C2) component from K=11 admixture for Reference 3.
  2. SA11 is generated from the South Asian (C1) component from the same K=11 admixture.
  3. SA12 is generated from the South Asian (C1) component from the K=12 admixture.

As you can see, the SA12 population lies between 100% ASI and the Indian Cline samples.

The Onge11 generated samples are a bit beyond 100% ASI on the first principal component, but they are also shifted towards the real Onge on pc2.

Related Reading:

Misuse of Correlation

I have been misusing correlation in computing Ancestral South Indian percentages from PCA/ADMIXTURE and Reich et al population-level averages.

I have tried to make it clear that just looking at the correlation is not enough, that an admixture component is not similar to ASI just because it correlates well with Reich et al's ASI averages for the 18 Indian cline populations. Even when the correlation is higher than 0.99. To illustrate what I mean, let's look at the Ref4C admixture runs.

I calculated the mean for each admixture component from the K=2 to K=12 runs for the 18 Indian cline populations and then computed the correlation between that and the Reich et results. Let's take a look:

K Component Correlation
2 C1 Euro-Afro -0.9941887
3 C2 East Asian 0.9955347
4 C3 European -0.993933
5 C3 European -0.993277
6 C1 South Asian 0.9675099
7 C1 South Asian 0.993081
8 C1 South Asian 0.9932762
9 C1 South Asian 0.9914145
10 C1 South Asian 0.9918095
11 C1 South Asian 0.9919097
12 C1 South Asian 0.9918594

Where do you see the highest correlation? At K=3 ancestral populations, the East Asian component is very highly correlated with ASI for the Indian cline populations. Does that mean that we could use that to compute ASI? No, not at all. While it is expected that at K=3, ASI would be a little closer to East Asian than to European, East Asian is not a good proxy for ASI at all since we cannot extrapolate to other individuals and populations.

Related Reading:

Indian Cline II

One thing I forgot in the post yesterday about the Indian cline was to try to extrapolate from the PCA results to 100% ANI (Ancestral North Indian) and 100% ASI (Ancestral South Indian).

This is a simple linear extrapolation which should be okay since PCA is linear.

The "N" denotes the extrapolated position of ANI and "S" denotes the ASI. The points to the left of "N" are all Utahn Whites while the Onge are on the bottom right of the graph.

As you can see, the ASI is about the same as Onge in terms of eigenvector 1 (which represents the Indian cline approximately), but ASI is far from Onge on the 2nd eigenvector. That is expected since the Onge have been separated from the mainland populations for a long time.

The more interesting thing is that the extrapolated position of ANI is a little to the right of all the Utahn Whites.

We'll need a similar analysis of the Indian cline with more populations to see which one the ANI is closest to.

PS. I should point out that I am using correlation between a limited number of population statistics to find a relationship between the 1st principal component and Reich et al's ASI estimate. This has a number of drawbacks. It would be much better to compute ASI directly.

Related Reading:

Indian Cline

I had used linear regression to estimate Ancestral South Indian (ASI) component from Reference 3 K=11 admixture run. Now here are a couple more exercises along the same lines but much simpler.

Just using the 96 Indian cline samples from Reich et al to compute PCA or admixture doesn't work as the Chenchu separate out in both analyses from the rest. So I added the Utahn White (CEU) samples from HapMap and the Onge from Reich et al.

First, I ran supervised admixture with two ancestral components, Utahn Whites and Onge. Here's the Onge component plotted against Reich et al's ASI estimate along with a linear regression estimate. The correlation between the two is 0.9908.

Second, I ran Principal Component Analysis (PCA) on the Indian cline samples plus Utahn Whites and Onge. Here are the first two PCA dimensions plotted. The first eigenvector explains 4.04% of the total variation and the 2nd explains 1.94%.

The first principal component is mostly along the Indian cline while the second one basically separates the Onge from everyone else.

Using the 1st principal component to estimate ASI, here's the plot with Reich et al's ASI estimate along with a regression line. The correlation between pc1 and ASI is 0.9929.

Note that both these methods work only if the samples are on the Indian cline, i.e., they don't have any other admixture.

And now for comparison, here's the linear regression for the Reference 3 K=11 admixture Onge component and ASI. The correlation here is 0.9949. Note that this is a little different than my previous analysis since I calculated the population averages using only the 96 samples recommended by Reich et al.

Here's a spreadsheet containing the data for these three runs.

There are a couple more tricks I have to figure out some things regarding Ancestral South Indian admixture. Let's hope they provide us some insight.

Related Reading:

East Asian Admixture

Let's look at the East Asian admixture among South Asians and other surrounding populations from a previous admixture run (K=12).

I have listed the different kinds of East Asian admixture components among selected populations. The three relevant components are:

  1. Southeast Asian: Highest among the Dai, Cambodians, Lahu and Malay, this is the most common East Asian component among South Asians.
  2. Northeast Asian: Highest among the Naga, Nysha, Japanese and north Han.
  3. Siberians: Highest among the Nganassans and Evenkis, this is lowest among South Asians overall. While this is not quite Turkic, it is the one most related to them.

Let's look at the total East Asian percentage among South Asians.

As expected, the eastern part of South Asia is where we see most of the East Asian admixture.

Now instead of looking at the absolute percentages of Southeast Asian admixture, let's look at the Southeast Asian component as a percentage of total East Asian component.

South and East India seem like mostly Southeast Asian admixture.

Now the same map for Northeast Asian as a proportion of total East Asian:

The Northeast Asian component dominates along the northern border of South Asia.

Finally the Siberian:

Compared to the other two, Siberian component is fairly low among South Asians, so it's difficult to separate the noise from real admixture here. Most of the peaks you see are among populations that have low East Asian admixture.

Related Reading:

Ref4C Admixture

I removed the Gujarati-A samples from the previous set of runs and ran admixture on the resulting dataset.

Nothing new pops out except that the Siberian component splits into Turkic/Tungusic and Nganasan components at K=12.

The admixture results are in a spreadsheet as usual.

K=11 & 12 have the lowest errors.

At K=13, Chenchu split off as their own cluster.

Related Reading:

Participants' Help Needed

A journalist from Times of India has contacted me to talk about the Harappa Ancestry Project. She is interested in talking to some Indian participants about it.

If any of you are interested, let me know in comments or by email and I'll forward your contact information to the journalist.

UPDATE: I have sent the contact info of the six people who volunteered.

Related Reading:

Taking Suggestions

What would you want from this project? What sort of analyses would you like me to do?

I know several of you want regional admixture/PCA analyses and those are coming starting next week.

In addition to that, is there something specific you would like to be investigated?

For example, is there some specific supervised admixture you would like me to run? A specific PCA/MDS analysis?

Or do want me to try to synthesize all the results we have gotten into some sort of coherent theory instead of throwing out the numbers like I have been doing?

Related Reading:

Harappa Participants Map

Davidski created a Google map of Eurogenes participants and Reiver suggested something similar would be cool for Harappa Ancestry Project too.

So I have copied the idea and created a google map for Harappa Ancestry Project.


View Harappa Ancestry Project in a larger map

Now participants need to go and add themselves to the map at their ancestral location. Here are the instructions:

  1. Login to your Google account. (If you don't have one, you'll have to create one.)
  2. Click Edit button. Do not edit title or description on the left of the map.
  3. Click on Add Placement marker and drag it to the desired location. Choose the most appropriate location for your ancestry.
  4. Put your project ID in the title for the placemark and your ancestry in the description.
  5. Click Save button.

That's it!

Related Reading:

More Reference Admixture Runs

In addition to the removals and changes in the previous set of runs, I removed the Onge, Great Andamanese and Kalash for this set.

The admixture results of this dataset are in a spreadsheet as usual and the bar chart is below.

K=10, 11, 12 are the ones with the lowest cross-validation error.

I wonder if anyone is going to mind my calling C2 at K=9 Pakistani instead of Balochistan/Caucasus? ;-)

I like K=12 here and K=12 or 13 in the previous run. So the question is which one of all these K runs with two different datasets should I use to replace the old reference I K=12 admixture runs?

Related Reading: