South Asian PCA

I used Eigensoft to create a PCA plot of the South Asians in our Reference I dataset (a total of 398 samples) along with the first batch of South Asian Harappa Project participants (HRP0001 to HRP0009).

The PCA software removed 2 Makranis, 1 Sindhi, 1 Balochi and 1 Brahui as outliers, thus leaving us with 402 samples to perform a PCA on.

Here are the plots for the first four eigenvectors. Click to see bigger images.

South Asian PCA eig1 vs eig2

South Asian PCA eig1 vs eig3

South Asian PCA eig2 vs eig3

South Asian PCA eig1 vs eig4

South Asian PCA eig2 vs eig4

South Asian PCA eig3 vs eig4

If you have seen the South Asian plot at 23andme, the first plot here isn't very different except that it seems rotated.

UPDATE: Eigenvectors 1 through 4 explain 1.12%, 0.77%, 0.71% and 0.44% of the total variance.

8 Comments.

  1. Interesting, eigenvector 4 seems to capture variation between one group of Gujaratis and everyone else. Could you tell us the size of the eigenvalues?

    By the way, ran some new PCAs with the HapMap Gujaratis and without the Amerindian populations. (I also went back to using R. gnuplot is too hard.) Here are the full PCA and a zoom into the South Asian cluster. Pretty much what you'd expect, I think. First dimension captures about 3.3 times more variation than the second one.

    I also did a couple of ADMIXTURE runs at K=10 and higher. (Spreadsheet and barplot.) There's some weird/interesting stuff going on at higher K's, like a splitting-off of a Kurdish component from the generalized Caucasian/Kurdish component (both of which are present in the South Asian populations), but I'm reluctant to put it up without examining the most likely possibility -- that I'm screwing something up.

  2. Interesting, eigenvector 4 seems to capture variation between one group of Gujaratis and everyone else. Could you tell us the size of the eigenvalues?

    By the way, ran some new PCAs with the HapMap Gujaratis and without the Amerindian populations. (I also went back to using R. gnuplot is too hard.) Here are the full PCA and a zoom into the South Asian cluster. Pretty much what you'd expect, I think. First dimension captures about 3.3 times more variation than the second one.

    I also did a couple of ADMIXTURE runs at K=10 and higher. (Spreadsheet and barplot.) There's some weird/interesting stuff going on at higher K's, like a splitting-off of a Kurdish component from the generalized Caucasian/Kurdish component (both of which are present in the South Asian populations), but I'm reluctant to put it up without examining the most likely possibility -- that I'm screwing something up.

    (Sorry, reposting after closing HTML tags.)

  3. Who are those Houston Gujus? | Gene Expression | Discover Magazine - pingback on February 14, 2011 at 6:38 pm
  4. Who are those Houston Gujus? | Biology News by Biologged - pingback on February 14, 2011 at 8:31 pm
  5. Any idea how dr doug mcdonald generates this http://www.scs.illinois.edu/~mcdonald/PCA84pops.html
    ?

  6. Singapore Indians | Harappa Ancestry Project - pingback on March 11, 2011 at 10:14 am