Misuse of Correlation

I have been misusing correlation in computing Ancestral South Indian percentages from PCA/ADMIXTURE and Reich et al population-level averages.

I have tried to make it clear that just looking at the correlation is not enough, that an admixture component is not similar to ASI just because it correlates well with Reich et al's ASI averages for the 18 Indian cline populations. Even when the correlation is higher than 0.99. To illustrate what I mean, let's look at the Ref4C admixture runs.

I calculated the mean for each admixture component from the K=2 to K=12 runs for the 18 Indian cline populations and then computed the correlation between that and the Reich et results. Let's take a look:

K Component Correlation
2 C1 Euro-Afro -0.9941887
3 C2 East Asian 0.9955347
4 C3 European -0.993933
5 C3 European -0.993277
6 C1 South Asian 0.9675099
7 C1 South Asian 0.993081
8 C1 South Asian 0.9932762
9 C1 South Asian 0.9914145
10 C1 South Asian 0.9918095
11 C1 South Asian 0.9919097
12 C1 South Asian 0.9918594

Where do you see the highest correlation? At K=3 ancestral populations, the East Asian component is very highly correlated with ASI for the Indian cline populations. Does that mean that we could use that to compute ASI? No, not at all. While it is expected that at K=3, ASI would be a little closer to East Asian than to European, East Asian is not a good proxy for ASI at all since we cannot extrapolate to other individuals and populations.


  1. Besides the correlation there is the character of the component itself. An ASI-like component should be generally absent from East Asia and West Asia except in the case of historic admixture. I would say that C1 for k >= 7 is ASI-like.

    • Ibra,
      Can you explain your reasoning as to why an ASI-like component should be absent from East and West Asia?

      • Basically because Indian YC and mtDNA haplogroups associated with ASI are mainly restricted to South Asia and there seems to be little gene flow to other regions after 35kya.

        • Putting a 35kybp limit for the stoppage of gene flow looks too early as there are very few humans anywhere in that period except for Australia and Africa.
          In addition, the re-population of northern Eurasia including northern India after the last glaciation has to be taken in account.
          We also have the issue of which Y and mtDNA lines are ASI. Is y-R or R-2 a ASI line? Reich's blind assigner seemed to think so ("Haplogroups were designated as typical of Ancient South Indians (ASI) or Ancient North Indians (ANI) based on the
          judgement of an expert on mtDNA and Y chromosome variation (KT) who was blinded to ancestry estimates from the autosomes.") mtDNA U2 - ASI too.

          • Parasar I have to point you to this, if you haven't already seen it.


            As for Reich's blind assigner, he seems blind to many things, including that fact that O2a is not to identified with ASI or that any M haplogroup should not identifiable with ANI. However using that data on the cline population to see if YC frequencies correlated with ANI. I found that J2 strongly positively correlated with ANI while R2 and R1a1 correlated mildly.

            Your question about U2 is a good one. Where does U2 fit, ANI or ASI? It's a upper paleolithic haplogroup that might have entered South Asia 30 kya and also know to be the mtDNA of the Kostenki specimen in Russia.

          • Ibra,

            Yes, I had seen the paper. Part of what I had mentioned about post 35kybp genetic exchange was based on this comment in the paper: "Conditions became worse after 34 ka ... potential of the Iranian region to support larger game herds and human groups declined, and ... South Asia can be expected to have seen immigration as these conditions set in, making genetic and cultural diversity higher within the subcontinent."

            Petraglia is quite convinced about a pre upper paleolithic AMH in South Asia. His finds at Jwalapuram indicated some tool continuity but no AMH has been found as yet that early in South Asia. I do associate M mtDNA with early South Asians but I do not think mtDNA M originated in South Asia and therefore to me what Petraglia finds puzzling is not puzzling enough to look for alternative explanations. "The puzzling fact remains, however, that the coalescent time of haplogroup M in India using both functional and nonfunctional coding region sites appears to be ≈30% younger (ca. 45 ka) than the respective age estimate from the global sample (65 ka), which does not make sense given South Asia's central position on Out-of-Africa dispersal routes ...Relative to the rest of the world, the Indian M lineages show a 20% younger age estimate."

      • It's fairly obvious from looking at the PCA plots used to infer ASI that it is distant from West Asian and European.

        With that definition, the absence of ASI in West Asia seems pretty clear.

    • The Southeast Asians and people of eastern India and Bangladesh are an issue though and the correlation method is not useful there.

      • Zack, I was wondering if admixture would behave any differently if there were relatively pure ASI representatives. Even though pure ASI populations don't exist in nature, is it possible to simulate them from data?

        • I can use allele frequencies of ancestral components from Admixture to create pure ASI. Then that could be projected on PCA plots to see if it makes sense.

  2. I think that having pure ASI populations as a bound on the South Asian pole would act as better gauge at measuring ASI admixture. For example I imagine that the South Asian component would reduce in magnitude to Reich's ASI. This would be good for the other components which at the moment look a bit squeezed.

  3. Could you use Dienekes' zombie method (http://dodecad.blogspot.com/2011/05/how-to-create-zombies-from-admixture.html) to construct pure ASIs? Also ANI?

    As Ibra suggests, the "real" ASI is lower than the "South Asian" component, as that may reflect the agriculture-fueled expansion of an already admixed West Asian/ASI group.

    • Interesting how the Kalash are mainly West Asian-South Asian here. So the 7% or so East-Eurasian admixture Zack was able to extract for the Kalash during his K=11 exercises must in reality be ASI admixture, which is part of the South Asian in Dienekes' run?

    • While I have been using the Admixture allele frequencies as well, ASI is not as simple as the ancestral components Admixture computes. I am working on the allele frequencies of Onge/South Asian components and trying to get them closer to something like ASI.

      As for the South Asian component being higher than ASI, that's simply an artifact of how Admixture works.