Reference 3 + Yunusbayev + HAP PCA and Mclust

I ran Principal Component Analysis (PCA) on reference 3 along with Yunusbayev et al Caucasus dataset and Harappa Ancestry Project participants (up to HRP0200).

Then I ran mclust on the first 70 dimensions. The resulting 156 clusters can be seen in a spreadsheet.

For individuals belonging to Harappa Ancestry Project, the value in a column shows that person's probability of being in that cluster. So if there is a 1 in CL15 for example, then that person has a 100% probability of being in Cluster CL15.

For the reference population groups, I have added up the probabilities for all the individuals belonging to that group.


  1. Hi Zack,I am HRP142, and my CL11=1.
    In layman's terms, I cluster with others who have CL11=1?(seems to be the obvious conclusion I guess)
    Thank you for running this.

    • Yes. 1 or more.

      I have added a couple sentences in the post explaining it.

      • Thanks! Are neighbouring CL's closely related to each other? i.e. is CL11 closer to CL10, than say CL100? Another obvious question I guess. I ask because I do not cluster with any one other person from my social group who seem to all be in CL10. (I am south Indian and cluster with north Indians...bug?)

  2. Here is an Excel file that shows which participants cluster together under which cluster:

    All probabilities have been rounded to the nearest whole number, so if you have mixed clustering, it will put you with the group you cluster with more.

  3. what does each cluster represent? What does cluster 1 stand for and so on?

    • It seems like you cannot generalize what they are, but just that they group people with similar genes together based on the source data. For example the Gujarati-b sample seem to be over different clusters, while others are confined to specific ones.

    • The clusters are computed from the PCA results for all the individuals.