Harappa and Reference I Dendrograms

Looking at the Harappa dendrogram and the dendrogram for reference I, I thought I would combine them to see where our project participants fit.

Then I got more curious. I wanted to see a similarity tree of all the samples in reference I (2,654) plus the 40 Harappa participants I have processed till now. That came out to be such a huge tree it was impossible to save it in a way to be legible. Finally I compromised by selecting only the South Asian samples from the Reference I dataset and putting them together with the Harappa data. Unfortunately, that doesn't give the Iranian and European-admixed participants any information. I'll have to analyze those separately.

Anyway, here's the South Asian Admixture Dendrogram in PDF format. That means you can search for "HRP" to find all the project members, which is why I like PDF in this case better than an image.

Note that Singapore Indians are such a good stand-in for South Indians.


  1. How come the relative distances between the participants changes between the two dendrograms? For instance, in the first dendrogram, HRP0017 is closest to HRP0014, but in the second one (with the individual reference samples added), HRP0017 is closer to HRP0025 and HRP0024 than to HRP0014.

    • I used complete linkage, which results in compact clusters but is basically furthest neighbor. So when a sample is combined with a cluster, the operative measure is the distance between the two furthest samples involved.

  2. I noticed something. In the Project Participant dendrogram, Zack wrote the following:

    If we make a cut at about 0.3 on this tree, we get 3 South Asian clusters:
    [1]the Northwest of South Asia
    [2]South Indian Brahmins, Bihari Brahmin, UP Brahmin
    [3]South Indian non-Brahmin, Bihari non-Brahmin, Bengalis, Caribbean Indians

    I have numbered the clusters for ease of reference.

    In that analysis, clusters 2 and 3 clustered together against cluster 1. However, in the combined dendrogram (reference + participants), the first one in this post, that changes. Clusters 1 and 2 cluster together. They are then joined by HRP0036 (Sri Lankan + European), and only then joined to cluster 3.

    Finally, looking at the PDF, it seems to me that clusters 2 and 3 are back together and cluster 1 is further away, but I haven't looked in detail to see if that matches clearly. In any case, the Singapore Indians actually seem quite diverse to me.

    • from zack's PCA it looks like some of the singapore indians have malay admixture (they deviate toward east asians just like northeast indians). i am willing to bet those exact same ones are clustering with the bengalis (my parents + me).

    • The reason the clusters combine differently is probably because one of the reference average populations is further away from the other cluster thus affecting the algorithm.

      Yeah I misspoke about the Singapore Indians. They seem to go from Sindhis to Kannadi and a few have quite a bit of Southeast Asian admixture.