Reference II PCA

I ran PCA on the Reference II dataset which includes 3.161 samples from various populations but with only 23,000 SNPs in common.

Here are the top ten eigenvalues:

  • 219.225396
  • 146.835968
  • 20.719760
  • 9.721733
  • 7.552482
  • 6.216977
  • 3.991663
  • 3.484690
  • 3.106919
  • 2.805874

While the first two eigenvalues are much bigger than the rest, the first explains 7.12% of the variation and the second 4.77%, the Tracy-Widom stats show that about 54 eigenvectors are significant.

Here are the plots for the first 10 principal components. Remember that the 1st eigenvector is 1.5 times the 2nd.

Here is a 3-D PCA plot (hat tip: Doug McDonald) showing the first three eigenvectors. The plot is rotating about the 1st eigenvector which is vertical. Also, I have stretched the principal components based on the corresponding eigenvalues.

I also ran MClust on the PCA data and got 17 clusters. The results are in a spreadsheet. I am sure with more principal components than the 10 I used, I would be able to deduce finer population structure.

Do take a look at the clusters assigned to the South Asian populations from Xing et al.


  1. my exp. seems to be that fine-grained distinctions in PCA are more robust to sub-100 K SNP marker samples. so i think ref ii is more useful for PCA than ADMIXTURE. or more explicitly, the gap in robustness between ref i and ref ii is far greater in ADMIXTURE than in PCA.

  2. humayun luwi libi

    Why the europe component forms the apex of the thing?