I had used linear regression to estimate Ancestral South Indian (ASI) component from Reference 3 K=11 admixture run. Now here are a couple more exercises along the same lines but much simpler.

Just using the 96 Indian cline samples from Reich et al to compute PCA or admixture doesn't work as the Chenchu separate out in both analyses from the rest. So I added the Utahn White (CEU) samples from HapMap and the Onge from Reich et al.

First, I ran supervised admixture with two ancestral components, Utahn Whites and Onge. Here's the Onge component plotted against Reich et al's ASI estimate along with a linear regression estimate. The correlation between the two is 0.9908.

Second, I ran Principal Component Analysis (PCA) on the Indian cline samples plus Utahn Whites and Onge. Here are the first two PCA dimensions plotted. The first eigenvector explains 4.04% of the total variation and the 2nd explains 1.94%.

The first principal component is mostly along the Indian cline while the second one basically separates the Onge from everyone else.

Using the 1st principal component to estimate ASI, here's the plot with Reich et al's ASI estimate along with a regression line. The correlation between *pc1* and ASI is 0.9929.

Note that both these methods work only if the samples are on the Indian cline, i.e., they don't have any other admixture.

And now for comparison, here's the linear regression for the Reference 3 K=11 admixture Onge component and ASI. The correlation here is 0.9949. Note that this is a little different than my previous analysis since I calculated the population averages using only the 96 samples recommended by Reich et al.

Here's a spreadsheet containing the data for these three runs.

There are a couple more tricks I have to figure out some things regarding Ancestral South Indian admixture. Let's hope they provide us some insight.

I will appreciate if you also explain the regression analysis method (how to do with whole genome data) in a separate thread.

thanks

I didn't do regression analysis over genome data. The regression fit was for going from admixture or PCA output to ASI.

Also, Reff4C, k = 8 correlation between ASI and South Asian is 0.992986

Actually, it's 0.9932762.

Thanks for running these as well as the standard error estimates. I suppose you tried running Reich's samples without Chenchu at K=2 and this did not work as some other population separated out instead!

Since all the Indian cline samples are fairly mixed between two ancestral populations, ADMIXTURE does not work in separating out even a weaker version of the ancestral components without any form of purer ancestral samples included.

You can try the same experiment with African Americans and Mexicans instead of the Indian cline and get into similar problems.