Admixture Ref3 Dendrogram HRP0001-HRP0160

This uses admixture results using Reference 3. As usual, I used complete linkage for the hierarchical clustering.

Let's look at the dendrogram using regular Euclidean distance measure between admixture results.

I also decided to use chi squared distance measure to do the clustering.

PS. Any thoughts on the trees based on two different distance measures?


  1. So, it seems West Asia is divided into two major genetic clusters: one group comprising Anatolians (Turks, Georgians and Azeris) and all Iranic speakers of West Asia and the other group comprising Semitic speakers. HRP0080 is Georgian, but he/she is genetically unusual for Georgians and don't resemble the other Georgian participant HRP0138 and, more importantly, the reference Georgians, so we can confidently say that he/she doesn't represent Georgians.

    The first major cluster of West Asians is in turn divided into the Anatolian and the Iranic sub-clusters. It would be interesting to see Armenians in this dendrogram. Based on the reference 3 dendrogram, Armenian participants should appear in the Anatolian sub-cluster. Kurds, as in the ADMIXTURE analyses in this blog, seem genetically indistinguishable from the rest of West Asian Iranics in the dendrogram. But in Dienekes' Dodecad blog, in one analysis Kurds appeared genetically fairly unique compared to the rest of West Asians. I think that is because that Dienekes used the Xing et al. Kurds (he didn't use his only Kurdish Dodecad participant for that analysis probably because he/she hadn't joined yet, he used only the Xing et al. Kurds) for that analysis and, as is well known, the SNPs used by Xing et al. are largely different from those used by 23andMe. OTOH, in another of Dienekes' Dodecad analyses, his only Kurdish Dodecad participant (tested with 23andMe) clustered with Iranians (Dienekes didn't use the Xing et al. Kurds for that analysis). So I think Kurds are genetically nothing but West Asian Iranics (like Persians). This is quite compatible with historiography, as the term Kurd (a term exclusively used in West Asia BTW) initially simply meant Iranic speaking pastoral nomad without any ethnic or sub-linguistic connotations.

    The second major West Asian cluster is harder to interpret. More samples could clarify the situation.

    • These results are only based on Kurds from Iraq and Iran. The lone Kurd from Turkey(HRP0141)(DOD731), has mentioned on forums that his ancestors came from Iran.

      50% of Kurds live in Turkey and we do not know how they will cluster. Iranians did not settle heavily in Turkey as they did in Iran and Iraq. So their results might differ slightly, they might lean towards Turks and Armenians more then the other Kurds.

      • So their results might differ slightly, they might lean towards Turks and Armenians more then the other Kurds.

        You may be right. We will know these issues much better in the near future with ever increasing genetic studies and samples.

    HRP0080 is Georgian, but he/she is genetically unusual for Georgians and doesn't resemble the other Georgian participant (HRP0138) and

  3. BTW, I suspect that HRP0080 has some recent non-Georgian admixture from the north (possibly Russian), judging by his/her elevated European component and reduced South Asian and SW Asian components compared to the rest of Georgians (including the reference 3 Georgians).

  4. Hi Zack,
    Pardon my ignorance, but I am having a hard time digesting the information. If we look at the Euclidean chart, is it saying that the Gujarati's are closer to the Tamil Nadar's and Andhra Reddy's than say the Sindhi's/the Ganchi or the Punjabi's?

    • The hierarchical clustering with complete linkage tries to find compact clusters. So if there are two clusters it is trying to combine at some level of the tree, it calculates the distance (according to whatever distance measure you are using: Euclidean in the first tree and chi squared in the 2nd here) between the two furthest members of the two clusters. The two clusters with the minimum such furthest distance are combined at that iteration.

      Because of this furthest neighbor strategy and the hierarchical process, we can't say that person X is closer to person Y or Z from looking at the tree. However, we can say that group A as a whole seems to be closer to group B.

      As for Gujaratis, their admixture (and PCA) results are fairly unique but they do seem to be quite different from Northwest Indians.

      • Thanks for the comment! I would think that groupwise closeness would also be a stretch considering that the majority population(south asians in this case) seems to skew the results for others, thus leaving a Kazakh and Afro-Belizean on the same tree, or maybe it is an illusion because of the square plots. Is a Tree Plot like the ones used at Genbank easy to make? I could try to make a distance based tree plot for the dendogram if I knew where to start! Anyway, thanks again for the info.

        • Of course, the tree seems skewed for non-South Asians, but do note that the more to the right the branches join the more distant the groups/individuals being joined together.

          I am going to see if I can make a javascript-based collapsible tree for all the reference individuals plus the Harappa participants that can be searched so it is presentable.

          For these, I used R's dist, hclust and plot functions.

  5. I'm HRP0080, and indeed, I have recent European admixture - my maternal grandmother is Ukrainian, with some Polish ancestry.
    HRP0138 is my maternal grandfather. Actually, he also has Euro admixture - his maternal grandfather was of unknown European (possibly Polish) origin. But strangely, this doesn't show much in his results.

    • That means you are at least 1/4 non-Georgian. This is an important fraction in genetics (even plain 1/4). If I were Zack, I would label you as Georgian (3/4), Ukrainian (1/4) instead of simply Georgian in the ethnicity column.

      As for your maternal grandfather (HRP0138), is his maternal grandfather European or partially European (for instance, only one of his parents or grandparents is European)? If partial, that may explain the result of HRP0138.

      BTW, do you live in Turkey? I am asking this as a significant fraction of Muslim Georgians live in Turkey today.

        "is his maternal grandfather European or partially European"

        Unfortunately we know almost nothing. But it seems he was fully European.

        No, I live in Georgia. It would be very interesting to study the genetics of those Georgians and also the Lazs. But I think it would be against Turkish laws to single them out as a distinct group for such studies, no?

        • There is no law today as far as I know against singling various ethnic groups as a study population in Turkey. There are even several genetic studies I know by Turkish (not just foreign) scholars of several minority ethnic groups in Turkey (I am not aware of genetic studies of Lazes and Georgians, but among minority ethnic groups there are genetic studies of Kurds, Arabs and Adygei in Turkey).

            Do you by chance know, where can I find the study of Adygei, that you mentioned?

            against singling out various ethnic groups as

  6. BTW, Zack, isn't it against the Harappa Ancestry Project rules to join the project together with a grandparent? A grandparent is a close relative after all.

    • As long as I know about the relatives and it is understood that only one will be included in PCA and some other analyses, it's okay to send relatives.

      However, it makes sense only for mixed individuals since there is a chance you could find something interesting in the results.

  7. Do you by chance know, where can I find the study of Adygei, that you mentioned?

    In that study only a few Adygei individuals were studied and there were no ADMIXTURE type analysis, so I don't think it would be useful. Kurds are surely better studied in Turkey (after all, Kurds are the only minority ethnic group with meaningful numbers in Turkey).

  8. Unfortunately we know almost nothing. But it seems he was fully European.

    Then HRP0138 is 3/4 Georgian too. Hmm.

  9. Can anyone comment on SB's question? From the Dendrogram looks like Gujarati people are closer to Andhra Pradesh, Tamil Nadar's and Sinhalese (me!) than their fellow North Indians. I'm interested because when I did the DNA tribes SNP match, my 1# and 2# match were with Gujarati's and Andhra Pradesh populations respectively.

    • At least since AD696 (SamastaBhuvanasRaya Vijayaditya), though they had become tributaries even earlier (see eg Aihole inscription of Pulakesin), Gujarat has been ruled by southern empires - Chalukyas, Rastrakutas, Chalukyas, Chaulukyas, Marathas - which may play a role in the genetic similarity we are seeing. To a lesser extent you will see southern influence in Bengal, Bihar, Nepal, and Uttar Pradesh where the southern rule was present but more limited. Even as late as late as the Surat grant of Trilochanapala 1151 A.D. the southern powers were the paramount rulers of the north — "Kanyakubje Maharaja Rashtrakutaya kanyakam \ labdhra sukhaya tasi/dm tvam Chaluky-apnuhi santatim"

    • In the Indian Genome Variation Project, the studied population in Gujarat was found to belong to the Dravidian cluster. I believe this is due to the legacy of a relatively densely populated IVC region, where the IVC legacy persisted later than in Harappa and where the impact of IA migration was linguistically significant but genetically minimal. It would be interesting to see the clustering of Gujarati groups by caste (Brahmin vs. non-Brahmin). This caste effect is prominent in the lower Gangetic plain (i.e. Bengali Brahmins cluster with NIs while other Bengalis cluter with SIs). Contra Parasar, I doubt very much that the clustering of Gujaratis with SIs is due to the possible early and late medieval presence of Deccan ruling elites in Gujarat.

      • Estimating a date of mixture of ancestral South Asian populations.

        "Our analyses suggest that major ANI-ASI mixture occurred in the ancestors of both northern and southern Indians 1,200-3,500 years ago [!?], overlapping the time when Indo-European languages first began to be spoken in the subcontinent."

        • Our analyses suggest that major ANI-ASI mixture occurred in the ancestors of both northern and southern Indians 1,200-3,500 years ago, overlapping the time when Indo-European languages first began to be spoken in the subcontinent.

          That may explain why there is still significant and consistent genetic difference between the castes of a certain region in terms of ANI-ASI ratio in South Asia (especially in the southern and central regions). If the major ANI-ASI mixture occurred in South Asia thousands of years before the Aryan invasion, ANI-ASI ratio differences should be expected to be less or even non related to the caste system, which is almost certainly a legacy of the Aryan invasion.

          • 'ANI' as used by the Reich group at Harvard is an umbrella term that likely captures early Neolithic 'Caucasoid' western Eurasian intrusive elements, later Bronze Age Indo-Aryan elements and, particularly in northwestern India and present-day Pakistan, historic intrusions such as that of Scythians, Kushans and Hephthalites. This implies that there have been at least two ANI-ASI admixture events. This is reflected in the admixture analysis on this site: South Asian = early Neolithic 'Caucasoid' component of ANI; European + SW Asian = Bronze Age and historic components of ANI; Onge = ASI. This would explain the marked North-South and high caste-low caste gradients in the European+SW Asian components and would confirm Onur's statement regarding the introduction of the Indo-Aryan elements being structured by caste hierarchy for the most part. There are exceptions to this, however: the middle caste Panjabi Jatts have a higher European+SW Asian component than the upper caste Panjabi Brahmins. This could be the result of the selective incorporation of later intrusive Central Asian populations, whom brahminic culture deemed to be barbarians, into the middle castes.

          • @RKM

            What database are you referring to when describing the panjabi jatts data. I am not saying that Jatts were not part of a later invasion or migration but curious to where you get your data for the higher european. Also middle caste and Jatts were not two things that were synonymous with each other.

          • @JDP - Look at the Harappa K=11 results for the project participants. Sort by "European" in either the spreadsheet or the bar charts. The Northern European modal component peaks among Jatts, North Indian Brahmins and generic northwest Indians. The Haryanvi Jatt in specific has an elevated European component.

            @RKM - With regards to the Jatts, I made a similar comment in the comments section of this ( ADMIXTURE batch.

          • @Av

            I have just saw the data in European order. I must say the UP/Haryana Jatt 27% is quite astounding but I cannot say it is or will be the same for all Jatts in general. This Jatt even has higher European data then Iranians. According to the list:

            1. UP/Haryana Jatt 27%
            2. Punjabi Jatt 22%
            3. Punjabi Jatt 21%
            4. Punjabi Jatt 20%
            5. Punjabi Jatt 19%
            6. Punjabi Ramgarhia 18%
            7. Punjabi Jatt 18%
            8. Punjabi 16%
            9. Punjabi 16%
            10.Punjabi Brahmin 16%
            11.Punjabi 16%
            12.Punjabi 13%
            13.Punjabi 13%
            14.Punjabi 13%
            15.Punjabi 13%
            16.Punjabi Rajput 12%

            It seems out of the 16 Punjabi Participants (hopefully I got them all, though I counted only those who were alone punjabi) 13% seems to be the cut off and the average to be around 17% European. In this case IMO the only Jatt that stands out and rightly does is the UP/Haryana Jatt at 27%, 10% higher then the average. There is only about 6 Jatts from what I counted from, I do not think from this data set we can conclude so readily that all Jatts do in general, seeing how populace they are. I would love to see more Jatt participants as well as Ramgarhia ones as well, this would give a better picture. As of now I can only say the only stand out is UP jatt and perhaps the other after it.

          • @AV I would also like a clairification in what you supposely claim as "generic northwest Indians"?

            Thank you for telling about the European data set as well.

          • JDP, yes, the Haryanavi Jatt does indeed stand out as an outlier among the Jatt participants of the project. Most of the Jatts have a European component that moderately exceeds their SW Asian component. They are far more balanced out in terms of their exogenous West-Eurasian admixture than the Haryana Jatt, who is heavily biased towards the European component as opposed to the South-West Asian component. In the same comments section of the ADMIXTURE batch I posted in my previous comment, HRP0131 mentioned that his U.P side claims ancestry from an area in present day Afghanistan – while I am not too sure whether to attribute his high European score to that side of his ancestry, as even the reference Pakistani Pashtuns are not as European as he is; it is no doubt a possibility. I totally agree with you with regards to the sample size - to make too robust an inference and conclusion based on these samples would be fallacious. I do think it's possible that there will be a segment of variation among Jatts that might differ from the current Jatt participants given their populousness and various gots/clans. What I meant by generic North-West Indians were the Sindhis, the Kashmiri and the part Baloch-part Punjabi participant, along with the other Punjabis that find themselves among the participants with the highest European component. Here are the top 20 participants who are of fully South-Asian descent sorted in descending order as per the European component. Clearly, we see North-West South Asians and Indo-European speaking Brahmins on the list, for the most part-

            HRP0131 - UP/Haryana Jatt – 27%
            HRP0093 - Punjabi Jatt (HRP006 using FTDNA data) – 22%
            HRO033 - Rajasthani Brahmin – 21%
            HRP0021 - Kashmiri - 21%
            HRP005 - Punjabi Jatt - 21%
            HRP008 - Punjabi Jatt - 20%
            HRP0129 - U.P Brahmin - 20%
            HRP006 - Punjabi Jatt (23andMe data) - 19%
            HRP0085 - Thathai Bhatia (Sindhi Rajput) - 19%
            HRP0108 - Halai Bhatia* - 18%
            HRP0062 – Sindhi**– 18%
            HRP0136 – Punjabi Ramgharia – 18%
            HRP0126 – Punjabi Jatt – 18%
            HRP0063 – U.P Brahmin – 18%
            HRP0004 – Punjabi Brahmin – 18%
            HRP0003 – Bihari Brahmin – 17%
            HRP0077 – Bengali Brahmin – 17%
            HRP0125 – Punjabi (unspecified) – 16%
            HRP0132 – Balochi (1/2) Punjabi (1/2) – 16%
            HRP0073 – Punjabi Ramgharia (Tarkhan) ***

            *Assumably Khatri or Rajput of some sort.
            **A Hindu Sindhi from Shikarpur and of the trader (Vaish) caste, going by my exchanges with the participant on 23andMe a while ago.
            ***Mentioned by HRP0073, here.

          • AV, I do not know if can clearly conclude this but it seems the ethnic dividness or division is not as clear in perhaps other parts of India when discussing the North West of India.

  10. Can anyone explain the difference between how data is represented based on Euclidean distance and Chi Squared?

    While in both charts, I see individuals cluster within their region, the difference i have seen is that the Euclidean distance is much more regional while the Chi Squared seems to go across a band of regions.

    Will be interesting to understand the key data each chart leverages to arrive at these clusters of information.


    • I don't know how Zacl implemented it, but when calculating Chi-squared distance, the square of the frequency of occurrence of that particular term is also taken into account. Thus it is a way to normalize the data for the variance,so that outliers do not swamp/skew results.

    • Basically, in this case, the key difference between the two is that the chi squared one uses a weighted Euclidean distance measure. I'll try to post the actual weights I computed for each component tonight. They are the reciprocal of the mean value of that component across all samples (including reference).

  11. Hi Zack,

    first of all I'm a totally newbie using admixture, i have my results and my barplots and I want to create a dendogram like this in order to show how my samples are distributed among the populations that I use in ADMIXTURE, K=3. How can obtain the dendogram from Q and P files? Do I need an extra phylogenetic software or use an R or PERL script?


  12. Using the Q output file without modifications or do I have to calculate prior the Euclidean distances between the populations? I mean, if I did admixture with K=3 I have 3 columns, so Must I calculate distance between 1 and 2, 2 and 3 and 1 and 3??

    Thanks for the fast answer!