Reference Admixture Analysis K=2-5

Let's do admixture analysis on my reference population.

Since I wasn't sure what value of K would be appropriate, I ran admixture with different values of K, which defines the number of ancestral populations.

The proportion of ancestral populations for each ethnic group is given in this spreadsheet. These are the mean values for that group, calculated by averaging the ancestral proportion across all the samples belonging to that group. I have also calculated the standard deviation across each ethnic group and that's included in the spreadsheet. The higher values of standard deviation are highlighted in blue (>1%) and red (>5%). Those population groups have samples that have somewhat different ancestries.

Let's start with two ancestral populations, i.e. K = 2.

Admixture: Reference populations K=2

The second ancestral component C2 (cyan) seems to be African and the 1st one C1 (red) is maximum among East Asians. Since all populations are constrained to be made of these two ancestral components, Europeans, Middle Easterners and South Asians all have about half African ancestral component (C2) and the rest East Asian (C1). This is as I expected with the classification of humanity into African and non-African.

The Fst divergences between estimated ancestral populations are as follows:

C1
C2 0.157

The K=3 analysis ancestral components can be roughly said to be European, East Asian and African.

Admixture: Reference populations K=3

The component C1 (red) is maximum among Europeans and is the major ancestry component for Middle Easterners, Central Asians and South Asians. Ancestral component C2 (green) is East Asian. South Asians also have a significant fraction of C2. African populations are represented by C3 (blue). Yemenese, Mozabits and Ethiopian Jews also have appreciable proportions of this African ancestral component.

Looking at the standard deviations of ancestral components for our sample groups, we see that while the Bedouin, Jordanians, Makrani, Moroccons, Mozabite, Saudis and Yemenese are mostly West Eurasian, their proportion of African ancestry vary quite a bit. The large standard deviation in Paniya is due to one sample (C1=55%, C2=42%, C3=3%) being very different (i.e. much more West Eurasian) from the other three (C1=11%, C2=85%, C3=4%).

There are also a couple of Sindhis with some African admixture. These are possibly partly or wholly Siddi.

HGDP Sindhi Samples Admixture K=3

Fst divergences between estimated populations for K=3:

C1 C2
C2 0.102
C3 0.144 0.182

With four ancestral components (K=4), component C1 (red) is a South Asian ancestral component. It is maximum among central and south Indians as well as among Papuans and Melanesians. It could thus possibly related to the ASI (Ancestral South Indian) component. C4 (violet) is the African component. C3 (cyan) is the East Asian component and C2 (green) is the European component.

Admixture: Reference populations K=4

Fst divergences between estimated populations for K=4:

C1 C2 C3
C2 0.071
C3 0.083 0.109
C4 0.152 0.152 0.184

When we increase K to 5, we get the following graph:

Admixture: Reference populations K=5

Ancestral component C1 (red) is Austronesian/South Asian. It is maximum among the Papuans at 75% and is higher among South Indians as compared to Pakistanis. It is about the same component as C1 in K=4.

C4 (blue) is Southwest Asian/West Asian. It peaks in Yemeni Jews at 66% and is high among Saudis, Bedouin, Samaritans, Egyptians, and Palestinians. It's 32% among Turks, so the Southwest Asian part is dominating the West Asian in this component. Notice how Ethiopians and Ethiopian jews have about half of their ancestry from this component.

C3 (green) is the East Asian component and is the same as C3 in the K=4 analysis.

C5 (magenta) is the African ancestry component and is about the same as C4 in the K=4 analysis.

C2 (yellow) is the European component. In K=4, the European component was high among both southern and northern Europeans. Now in K=5, we have the C4 (Southwest/West Asian) component among southern Europeans, so this European component has taken on more of a north European outlook.

Fst divergences between estimated populations for K=5:

C1 C2 C3 C4
C2 0.081
C3 0.084 0.114
C4 0.085 0.054 0.129
C5 0.154 0.165 0.186 0.155

Let's continue this admixture analysis for higher values of K.

37 Comments.

  1. As you have begun interpreting the reference results, let me make a friendly warning: you have to keep in mind that most of the reference populations of ethnic groups are extremely limited in sample size (with only between 2 and 25 individuals) and from very obscure sources, and you should keep away from drawing conclusions about millions of people based on such limited number of individuals.

  2. and you should keep away from drawing conclusions about millions of people based on such limited number of individuals.

    i depends on the granularity of the conclusions. instead of making blanket characterizations, we just need to keep in mind the parameters of power.

  3. i depends on the granularity of the conclusions. instead of making blanket characterizations, we just need to keep in mind the parameters of power.

    Many populations are very heterogeneous, so I am afraid that such limited sample sizes may have negative effects on seeing the real picture.

    • Many populations are very heterogeneous, so I am afraid that such limited sample sizes may have negative effects on seeing the real picture.

      what does this mean? which populations? what are the characteristics of the "real picture." what sample size would suit you?

  4. may have negative effects on seeing the real picture.

    what does that even mean? concretely, what false inferences do you think will be made?

  5. We have a C1(red) southasian component which might be related to ASI but is there a ANI too? Is it possible to detect a ANI?

    • wait at a different K. also, remember that ASI was inferred using a different method than ADMIXTURE. there is no pure reference ASI. ANI is pretty much west eurasian though. the argument in reich et al. seems to be that the "south asian" cluster is really an artifact, a cluster of hybridization of ANI-ASI which is really old.

  6. Klaus: I think we can make some careful conclusions. There is some heterogeneity in populations but there's some similarity as well.

    svetozar: I would say that at K=4-5, we can't separate out ASI from ANI. The C1 (red) component likely represents South Asian ancestry. I was just thinking aloud about ASI since it is fairly high among Papuans and Melanesians. However, since this component is present in West Asians, Caucasians and soem Europeans, it's more likely to be a combination of ANI and ASI.

  7. what does that even mean? concretely, what false inferences do you think will be made?

    It is a known fact that many ethnic groups are genetically very heterogeneous, and these are especially the ones living in a broad territory and having a relatively high population (e.g., more than 10 million).

  8. By broad territory, I mean a territory at least as large as Bulgaria.

  9. For most ethnic groups we need at least several hundred individuals and a sampling based on the population sizes of sub-regions and sub-ethnic groups for a good representativeness. Many haplogroup studies have such big sample sizes for many ethnic groups, but autosomal studies generally lack such big sample sizes for ethnic groups.

    • For most ethnic groups we need at least several hundred individuals and a sampling based on the population sizes of sub-regions and sub-ethnic groups for a good representativeness.

      this depends on the question you're asking. what questions do you think zack is asking?

    • Where are those studies with hundreds of samples per ethnic group? Reich et al, for example, had only 132 Indian samples from 25 groups. While it would be great to have more samples, even a smaller number does provide us with some information.

      Representativeness is more of an issue and that's why Razib and I have been calling for South Asians, especially those underrepresented in these datasets, to participate.

  10. what questions do you think zack is asking?

    I don't know. You should ask this question to him. His ambiguous statement "I think we can make some careful conclusions" is far from satisfying me.

  11. good to know you aren’t satisfied!

    Since when constructive criticism is a reason for making sarcastic remarks?

  12. While it would be great to have more samples, even a smaller number does provide us with some information.

    I don't deny that even a small number provides some information, but we should be extremely careful and cautious (in a reticent degree I'd say) in making inferences based on such small numbers.

  13. David of the BGA Project generally refrains from making broad conclusions about ethnic groups when presenting his test results. Dienekes of the Dodecad Project is less careful in this respect and makes some mistakes in his inferences.

  14. How about making specific criticisms of any of my conclusions that are not warranted by the data I have?

    My biggest criticism for now is about drawing conclusions based on the percentages of the current Admixture components. In much bigger sample sizes their percentages can be significantly different from the current ones, and this is especially true for minor components.

    • In much bigger sample sizes their percentages can be significantly different from the current ones, and this is especially true for minor components.

      yes. this is a real criticism, and i am happy to acknowledge it's validity. that's the sort of thing i'm saying would be constructive.

    • True.

      I am going to do the same admixture analysis with my reference dataset 2 later. And the differences between the results should be instructive.

  15. because i don’t think it’s constructive, i think it’s banal and vague.

    Sorry if I sounded banal and vague in my first comments.

    yes. this is a real criticism, and i am happy to acknowledge it’s validity. that’s the sort of thing i’m saying would be constructive.

    Thanks for your appreciation. Just know that my criticisms are always intended to be useful for the benefit of this project.

  16. Admixture K=2-5, HRP0001 to HRP0010 | Harappa Ancestry Project - pingback on February 2, 2011 at 1:35 am
  17. Reference Admixture Analysis K=6-9 | Harappa Ancestry Project - pingback on February 9, 2011 at 5:09 am
  18. Admixture K=4,7,9, HRP0011 to HRP0020 | Harappa Ancestry Project - pingback on February 10, 2011 at 1:32 pm
  19. Admixture K=4,7,9, HRP0021 to HRP0030 | Harappa Ancestry Project - pingback on February 15, 2011 at 8:22 am
  20. Admixture K=4, HRP0041-HRP0050 | Harappa Ancestry Project - pingback on March 8, 2011 at 6:47 am
  21. Admixture K=4, HRP0051-HRP0060 | Harappa Ancestry Project - pingback on March 21, 2011 at 7:01 am
  22. Admixture K=4, HRP0061-HRP0070 | Harappa Ancestry Project - pingback on March 28, 2011 at 11:09 am
  23. Admixture K=4, HRP0071-HRP0080 | Harappa Ancestry Project - pingback on April 5, 2011 at 5:57 am
  24. Admixture K=4, HRP0001-HRP0040 | Harappa Ancestry Project - pingback on April 13, 2011 at 1:35 pm
  25. Changes due to San/Pygmy Removal | Harappa Ancestry Project - pingback on April 13, 2011 at 1:39 pm
  26. Admixture K=4, HRP0081-HRP0090 | Harappa Ancestry Project - pingback on April 24, 2011 at 11:34 pm