Reference 3 + Yunusbayev + HAP PCA and Mclust

Posted by Zack on December 19, 2011

I ran Principal Component Analysis (PCA) on reference 3 along with Yunusbayev et al Caucasus dataset and Harappa Ancestry Project participants (up to HRP0200).

Then I ran mclust on the first 70 dimensions. The resulting 156 clusters can be seen in a spreadsheet.

For individuals belonging to Harappa Ancestry Project, the value in a column shows that person's probability of being in that cluster. So if there is a 1 in CL15 for example, then that person has a 100% probability of being in Cluster CL15.

For the reference population groups, I have added up the probabilities for all the individuals belonging to that group.

Clusters, PCAharappa, mclust, reference

← Yunusbayev Ref3 Admixture Results

South Asian PCA + Mclust →

8 Comments.

SB December 20, 2011 at 7:18 pm

Hi Zack,I am HRP142, and my CL11=1.
In layman's terms, I cluster with others who have CL11=1?(seems to be the obvious conclusion I guess)
Thank you for running this.
- Zack December 20, 2011 at 7:25 pm
  
  Yes. 1 or more.
  
  I have added a couple sentences in the post explaining it.
  - SB December 20, 2011 at 7:32 pm
    
    Thanks! Are neighbouring CL's closely related to each other? i.e. is CL11 closer to CL10, than say CL100? Another obvious question I guess. I ask because I do not cluster with any one other person from my social group who seem to all be in CL10. (I am south Indian and cluster with north Indians...bug?)
    - Zack December 20, 2011 at 10:19 pm
      
      The order of the clusters in the spreadsheet is random.
SB December 20, 2011 at 8:59 pm

Here is an Excel file that shows which participants cluster together under which cluster: http://tinyurl.com/77vojhv

All probabilities have been rounded to the nearest whole number, so if you have mixed clustering, it will put you with the group you cluster with more.
JDP December 21, 2011 at 10:53 am

what does each cluster represent? What does cluster 1 stand for and so on?
- SB December 21, 2011 at 1:35 pm
  
  It seems like you cannot generalize what they are, but just that they group people with similar genes together based on the source data. For example the Gujarati-b sample seem to be over different clusters, while others are confined to specific ones.
- Zack December 21, 2011 at 5:19 pm
  
  The clusters are computed from the PCA results for all the individuals.

Harappa Ancestry Project

Genetics and South Asia