Category Archives: Clusters - Page 4

Distance Measures

Referring to the dendrogram computed from the admixture results of Harappa Project participants, Thorfinn asked a long time ago:

Interesting that South Indian/Cow Belt Brahmins cluster together; while Punjabi Brahmins are closer to Punjabis.

I can understand the first clustering, assuming that Southern Brahmin communities are a spinoff of northern communities and have maintained relative genetic isolation; and the source Northern Brahmin population differed in original origin from other Cow Belt populations.

But how do both Brahmin communities differ equally from Punjabi/Rajasthani Brahmins; and why is that community closer to other Punjabi populations?

In terms of admixture results, that is correct in the case of the project participants. Why this is the case, I have no idea.

However, there is an issue here that we have to consider and nsriram commented about it:

The euclidean distance doesn’t seem to be the appropriate metric to capture the pairwise similarities. Once you make a commitment to the distance measure then the side effects carry-over into the tree construction.

What is a good distance measure to compute the similarity or dissimilarity of the admixture results of two people? Is the Euclidean distance a good one in this case? It certainly is the most common and the easiest to use I guess. So we usually default to it.

However, if we look at the Fst divergences of the ancestral components, we see that the different components are more or less different from each other. So a 5% difference in C1 might not be the same as a 5% difference in C10.

A solution might be to use a weighted distance, but how to weight it? The Fst numbers give pairwise distances for the different ancestral populations. If you are focused on a specific population (e.g., South Asians), we could try weighting by the Fst values between that component and the others. But I am not sure if that's a good solution either.

In the end, a Euclidean distance measure gives us a rough idea of the differences between admixture results, but it should not be used to explain minor differences or to consider phylogenies.

Harappa and Reference I Dendrograms

Looking at the Harappa dendrogram and the dendrogram for reference I, I thought I would combine them to see where our project participants fit.

Then I got more curious. I wanted to see a similarity tree of all the samples in reference I (2,654) plus the 40 Harappa participants I have processed till now. That came out to be such a huge tree it was impossible to save it in a way to be legible. Finally I compromised by selecting only the South Asian samples from the Reference I dataset and putting them together with the Harappa data. Unfortunately, that doesn't give the Iranian and European-admixed participants any information. I'll have to analyze those separately.

Anyway, here's the South Asian Admixture Dendrogram in PDF format. That means you can search for "HRP" to find all the project members, which is why I like PDF in this case better than an image.

Note that Singapore Indians are such a good stand-in for South Indians.