Similar to an earlier conference poster, Reich Lab's Priya Moorjani et al have another poster at SMBE. Here's the abstract:
Estimating a date of mixture of ancestral South Asian populations
Linguistic and genetic studies have demonstrated that almost all groups in South Asia today descend from a mixture of two highly divergent populations: Ancestral North Indians (ANI) related to Central Asians, Middle Easterners and Europeans, and Ancestral South Indians (ASI) not related to any populations outside the Indian subcontinent. ANI and ASI have been estimated to have diverged from a common ancestor as much as 60,000 years ago, but the date of the ANI-ASI mixture is unknown. Here we analyze data from about 60 South Asian groups to estimate that major ANI-ASI mixture occurred 1,200-4,000 years ago. Some mixture may also be olderâ€”beyond the time we can query using admixture linkage disequilibriumâ€”since it is universal throughout the subcontinent: present in every group speaking Indo-European or Dravidian languages, in all caste levels, and in primitive tribes. After the ANI-ASI mixture that occurred within the last four thousand years, a cultural shift led to widespread endogamy, decreasing the rate of additional mixture.
I bolded the portion which seems new compared to the previous abstract.
Using the same method as I used for reference 3 admixture, I decided to guesstimate the Ancestral South Indian proportions, as given by Reich et al, for my HarappaWorld admixture run.
Basically, I used the 92 (out of the 96 samples Reich et al used) to find population averages for the South Indian component. Then, I used linear regression between the South Indian component average and Reich et al's estimate of Ancestral South Indian (ASI) ancestry. Since Reich et al actually list Ancestral North Indian percentages in their paper but their model is a two-ancestry ANI+ASI one, I simply calculated the ASI percentages as 100% minus ANI.
The correlation between Reich et al ASI and my HarappaWorld South Indian component for the relevant populations turns out to be 0.99277086.
And the linear regression fit for the data is:
ASI = 2.5218942 + 0.8104836 * S_INDIAN
where both ASI (Reich et al) and S_INDIAN (HarappaWorld) are given in percentages.
Of the individuals in HarappaWorld, I kept only those who had a South Indian component of at least 20% for computing the ASI proportions.
The resulting ASI percentages can be seen in a spreadsheet.
Please note that in the Group sheet, the averages are based on the samples which met the 20% South Indian component threshold. Thus, the 20% ASI in the Romanians is the average of the two Romanians who met the threshold out of a total of 16 Romanian samples.
The individual results are available in the Individual sheet. These results are a little different from the estimates using reference 3. Thus, I would point out that these should be taken only as a rough estimate.
Via Razib, here's an interesting abstract from the International Congress of Human GeneticsÂ by David Reich's group:
Estimating a date of mixture of ancestral South Asian populations.
Linguistic and genetic studies have shown that most Indian groups have ancestry from two genetically divergent populations, Ancestral North Indians (ANI) and Ancestral South Indians (ASI). However, the date of mixture still remains unknown. We analyze genome-wide data from about 60 South Asian groups using a newly developed method that utilizes information related to admixture linkage disequilibrium to estimate mixture dates. Our analyses suggest that major ANI-ASI mixture occurred in the ancestors of both northern and southern Indians 1,200-3,500 years ago, overlapping the time when Indo-European languages first began to be spoken in the subcontinent. These results suggest that this formative period of Indian history was accompanied by mixtures between two highly diverged populations, although our results do not rule other, older ANI-ASI admixture events. A cultural shift subsequently led to widespread endogamy, which decreased the rate of additional population mixtures.
I would be very interested in reading that paper. Also, I wonder how many new samples did they genotype beyond the ones in Reich et al' Reconstructing Indian Population History and if I could get my hands on the new data.
I have a feeling that ANI (Ancestral North Indian) captures a bunch of different migrations and conquests etc, so I am not sure if it can be equated to Indo-European language movement.
I wonder if I can use HAPMIX or StepPCO to get similar admixture dating.
One of my met peeves is the confusion which some ethno-linguistic terms can cause. For example, the fact that there were Iranian language speakers on the plains of Ukraine ~2,000 years ago naturally indicates to people that Scythian nomads issued out of Iran northwards. Similarly, the existence of Indo-Aryan Mitanni in what is today Syria also suggests to people that there was a migration of Indians which traversed much of West Asia in a drive toward the Mediterranean from the Indus. Of course our perception of the center of gravity of these ethno-linguistic groups today is a function of historical contingency. If we didn't know much more about Antique and Medieval European history we might posit that the Celtic Galatians of ancient Anatolia were originally from Ireland, based on the contemporary distribution of Celtic languages!
This issue is now cropping up South Asian archaeogenetics. In my opinion the paper Reconstructing Indian population history is probably the most important contribution to the field in a generation. The authors explain technically why a "South Asian" ancestral component falls out of ancestry inference algorithms at the heart of ADMIXTURE, STRUCTURE, or frappe. In short, when you have a population which is a hybrid, but where the hybridization event is very distant in the past, recombination breaks up the signatures of that event (a decay of the linkage disequilibrium between two putative ancestral populations). Additionally, in the Indian case there doesn't seem to be a "pure" population of one of the two ancestral groups, what they termed "Ancestral South Indians" (ASI). The closest reference they found were Onge Andaman Islanders, whose last common ancestors with ASI was on the order of tens of thousands of years in the past. They do have excellent proxies for the other population, "Ancestral North Indians" (ANI). Compared to ASI all West Eurasians can be used as reasonable proxies for ANI.
Read more »
I have been working on creating 100% ASI (Ancestral South Indian) samples recently. So it was really interesting that Dienekes did similar experiments:
I am going about creating the "pure" allele frequencies somewhat differently, so that would be a useful exercise.
Anyway, I thought you guys would be itching for some new results. So here's a PCA plot:
This used the same Principal Component Analysis as the one here using the 96 Indian Cline samples, Utahn Whites and Onge. However, I projected three extra "populations" on this plot.
These three populations are simulated genetic data of 25 individuals using the allele frequencies from Reference 3 Admixture results.
- Onge11 is generated from the Onge (C2) component from K=11 admixture for Reference 3.
- SA11 is generated from the South Asian (C1) component from the same K=11 admixture.
- SA12 is generated from the South Asian (C1) component from the K=12 admixture.
As you can see, the SA12 population lies between 100% ASI and the Indian Cline samples.
The Onge11 generated samples are a bit beyond 100% ASI on the first principal component, but they are also shifted towards the real Onge on pc2.
I have been misusing correlation in computing Ancestral South Indian percentages from PCA/ADMIXTURE and Reich et al population-level averages.
I have tried to make it clear that just looking at the correlation is not enough, that an admixture component is not similar to ASI just because it correlates well with Reich et al's ASI averages for the 18 Indian cline populations. Even when the correlation is higher than 0.99. To illustrate what I mean, let's look at the Ref4C admixture runs.
I calculated the mean for each admixture component from the K=2 to K=12 runs for the 18 Indian cline populations and then computed the correlation between that and the Reich et results. Let's take a look:
||C2 East Asian
||C1 South Asian
||C1 South Asian
||C1 South Asian
||C1 South Asian
||C1 South Asian
||C1 South Asian
||C1 South Asian
Where do you see the highest correlation? At K=3 ancestral populations, the East Asian component is very highly correlated with ASI for the Indian cline populations. Does that mean that we could use that to compute ASI? No, not at all. While it is expected that at K=3, ASI would be a little closer to East Asian than to European, East Asian is not a good proxy for ASI at all since we cannot extrapolate to other individuals and populations.
One thing I forgot in the post yesterday about the Indian cline was to try to extrapolate from the PCA results to 100% ANI (Ancestral North Indian) and 100% ASI (Ancestral South Indian).
This is a simple linear extrapolation which should be okay since PCA is linear.
The "N" denotes the extrapolated position of ANI and "S" denotes the ASI. The points to the left of "N" are all Utahn Whites while the Onge are on the bottom right of the graph.
As you can see, the ASI is about the same as Onge in terms of eigenvector 1 (which represents the Indian cline approximately), but ASI is far from Onge on the 2nd eigenvector. That is expected since the Onge have been separated from the mainland populations for a long time.
The more interesting thing is that the extrapolated position of ANI is a little to the right of all the Utahn Whites.
We'll need a similar analysis of the Indian cline with more populations to see which one the ANI is closest to.
PS. I should point out that I am using correlation between a limited number of population statistics to find a relationship between the 1st principal component and Reich et al's ASI estimate. This has a number of drawbacks. It would be much better to compute ASI directly.
I had used linear regression to estimate Ancestral South Indian (ASI) component from Reference 3 K=11 admixture run. Now here are a couple more exercises along the same lines but much simpler.
Just using the 96 Indian cline samples from Reich et al to compute PCA or admixture doesn't work as the Chenchu separate out in both analyses from the rest. So I added the Utahn White (CEU) samples from HapMap and the Onge from Reich et al.
First, I ran supervised admixture with two ancestral components, Utahn Whites and Onge. Here's the Onge component plotted against Reich et al's ASI estimate along with a linear regression estimate. The correlation between the two is 0.9908.
Second, I ran Principal Component Analysis (PCA) on the Indian cline samples plus Utahn Whites and Onge. Here are the first two PCA dimensions plotted. The first eigenvector explains 4.04% of the total variation and the 2nd explains 1.94%.
The first principal component is mostly along the Indian cline while the second one basically separates the Onge from everyone else.
Using the 1st principal component to estimate ASI, here's the plot with Reich et al's ASI estimate along with a regression line. The correlation between pc1 and ASI is 0.9929.
Note that both these methods work only if the samples are on the Indian cline, i.e., they don't have any other admixture.
And now for comparison, here's the linear regression for the Reference 3 K=11 admixture Onge component and ASI. The correlation here is 0.9949. Note that this is a little different than my previous analysis since I calculated the population averages using only the 96 samples recommended by Reich et al.
Here's a spreadsheet containing the data for these three runs.
There are a couple more tricks I have to figure out some things regarding Ancestral South Indian admixture. Let's hope they provide us some insight.
Here's my first admixture run using Reference 3 for Harappa participants. Since K=11 was the run with the Onge-ASI connection, I ran admixture at K=11 with all the 90 Harappa participants.
You can see the participant results in a spreadsheet as well as their ethnic breakdowns and the reference population results.
Here's our bar chart and table. Remember you can click on the legend or the table headers to sort.
Using the comparison between the Onge component here and Reich et al's Ancestral South Indian one, I get the following linear regression.
The correlation is 0.9949 which is probably as high as it can get. So let's calculate the ASI percentage for all the Harappa participants.
Note that I didn't calculate the ASI percentage for those who had a really low Onge component since the linear regression above would not be valid outside the range we have in our original data.
You can see the percentages in a spreadsheet too.
Let's compare with the Dodecad ANI-ASI results. I have 22.5% ASI here while it was 20.6% in the Dodecad analysis. Overall, it seems like my technique results in about 2% more ASI than Dodecad's, with a few exceptions: Like Razib who jumps from 34.3% to 43.3% (averaging his parents who are very close).
Since the Onge component on my K=11 admixture run was very strongly correlated with Reich et al's Ancestral South Indian (r2Simranjit has been kind enough to let me share his map of the Onge component in South Asia.
He also has maps of the K=12 admixture run.