Ref 1 South Asians + Harappa PCA

I ran PCA on the South Asian populations included in Reference I dataset (excluding Kalash and Hazara) as well as 38 South Asian participants of Harappa Project. I excluded Kalash and Hazara because they usually dominate a South Asian PCA plot being so distinct.

The reference populations included are: Balochi, Bnei Menashe Jews, Brahui, Burusho, Cochin Jews, Gujaratis (divided into two groups), Makrani, Malayan, North Kannadi, Paniya, Pathan, Sakilli, Sindhi, and Singapore Indians.

Here's the spreadsheet showing the eigenvalues and the first 15 principal components for each sample.

I computed the PCA using Eigensoft which removed 26 samples as outliers. The Tracy-Widom statistics show that about 30 eigenvectors are significant.

Here are the first 15 eigenvalues:

1 3.874124
2 1.819077
3 1.663232
4 1.335721
5 1.293500
6 1.242984
7 1.230921
8 1.225775
9 1.222177
10 1.214539
11 1.212808
12 1.204000
13 1.198930
14 1.195450
15 1.192848

Here is a 3-D PCA plot (hat tip: Doug McDonald) showing the first three eigenvectors. The plot is rotating about the 1st eigenvector which is vertical. Also, I have stretched the principal components based on the corresponding eigenvalues.

Now here are plots of the first 14 eigenvectors. In this case, I have not stretched the principal components, so keep in mind that the first eigenvector explains 3.874124/1.819077=2.13 times variation compared to the 2nd eigenvector.

UPDATE: At the bottom of the 3-D plot, you can see a dropdown. Just select one of the project participants from there and that participant's dot in the plot with become bigger so they are easy to spot.


  1. So, where did you get that Reich, et al. dataset? Did you get it from a disreputable genome dealer in a seedy pub in Zanzibar after a tip from Blaise Li? Did you defeat David Reich in man-to-man unarmed combat? Tell us the story!

  2. I'm sorry but I'm so confused about these results. I'm new to all this, so, as you can expect, I don't understand these results at all. Can somebody please explain to me what all these numbers mean? Also, some of my results differed considerably from other peoples results; especially for pc9 and 13 (I'm hrp0044 by the way :P). Sorry for being such a pain (lol) but can somebody please explain what this all means?? Thanks

    • I am sorry I just assumed everyone was very familiar with PCA plots. For years, PCA and its variations played an important part in my life, so I am prone to think they are common knowledge.

      I need to write a couple of posts explaining PCA, Admixture results, etc.

  3. I love the 3d PCA plots, but is there any way to tell where exactly I am on that one?

    • On the 3-D plot? It would be a little hard right now. You could look up your coords from the spreadsheet, figure out where you are on the 2-D plots and then try it on the 3-D one.

      I am thinking of providing a dropdown of project participant IDs which would highlight their location on the 3-D plot, but that would require some reprogramming of the javascript.

  4. Harappa Participants on 3-D PCA | Harappa Ancestry Project - pingback on March 17, 2011 at 1:09 pm
  5. Hey Zack, when you select a User ID from the drop down list, all thats supposed to happen is that the user's spot on the map is supposed to become bigger, right? It's sort of confusing to discern where exactly they cluster for some of the users as you seem to have labeled both the reference populations and the project participants according to their state/region of origins. f.e - Punjabi 12 is clustering with the Pathans and Burusho, and so is the Rajasthani Brahmin, yes? Am I reading this right?

  6. Austroasiatic Dataset | Harappa Ancestry Project - pingback on March 20, 2011 at 8:08 am
  7. Its funny to see Dravidian speakers on both extreme ends of the spectrum.

  8. The Burusho and Gujaratis are the de-facto "outliers" here as they sit laterally to the Malayan-Brahui continuum. I am not familiar with the history of Gujarat to speculate why they are distal; isolation explains the Burusho position.

    Any chance of creating a similar PCA for Iran and the Caucasus, Zack? Excellent work.

    • The Gujaratis that are the outliers are those who form a tight cluster in all sorts of analysis. Razib's guess (and mine too) is that they belong to a single clan or something. I am not sure if they are typical or the others (gujaratis-b) who are on the main axis of variation. Or both.

      I have a PCA of all the populations done. I just need to generate the appropriate plots.

      I am also thinking about which populations to include for a (sort of) Near Eastern PCA. I was thinking Iran, Turks, Caucasus, Pakistan (excluding Kalash and Burusho), Indian Punjab. Not sure whether to include the local Jewish populations. What do you think? There's also the Levant and the Arabian peninsula.

      • Gujarati-a cluster together implying closeness and they are off the main South Asian axis implying an off-axis component, right? If so, can we resolve that component. In the Burusho the off-axis appears to be an eastern asian one.

      • I agree, that is probably the cause for this unusual Gujarati "outlier" status.

        At the very least, (most of) Pakistan, Indian Punjab, the Caucasus, Turkey, Iran and the Levant should be included in the analysis. I am not certain how informative including the Arabian Peninsula and local Jewish populations would be.

  9. Would a homozygous test on that particular gujuratis that are outliers be able to determine whether they are inbred and possibly from a extended family group. The cluster seems very tight even for people from a similar tribe.

  10. Ref1 South Asian + Harappa Admixture | Harappa Ancestry Project - pingback on March 23, 2011 at 1:03 am