Harappa Gene Similarity

I was looking at Simranjit's DNA Tribes results and I thought I could provide you guys a list of how similar (part of) your genome is to different reference populations in a somewhat similar way to DNA Tribes results.

Basically, I computed an IBS (identity by state) matrix for all Harappa participants from HRP0001 to HRP0080 and my Reference II samples (info). These are the same as the Genome-wide comparison feature at 23andme.

Then I took the median similarity percentage between you and a reference population group. I found that median worked better here than mean as the mean was affected a lot by some outlier samples in the reference data.

Of course, since I am giving you a big discount compared to DNA Tribes, I am not doing a nice individual report. Instead, all you get is one spreadsheet including everyone. Click on your ID in the column headers to sort by your similarity to the different reference populations.

We see four outliers among the project participants who don't match any reference populations very well. One is HRP0074, a Brazilian, which is expected since I don't have any Native American populations. Then there are me (HRP0001) and my sister (HRP0035) which was well-known already. Finally, HRP0044, a Kashmiri.

Do note that this analysis was done using about 20,000 SNPs.


  1. HRP0019 who is a Punjabi Brahmin scores highest with TN Brahmins, whereas the Tamil Brahmins - HRP0014, 16, 17, 41, 48, 72, 75 and 79 score highest with AP Madigas; TN Dalit, AP Madiga , AP Madiga, Singapore Indians, AP Madiga, TN Dalit and AP Madiga respectively.


    • Indeed looking at the 3 punjabi jatts , the values are odd as well , something is not right. Some of the high matches simply don't make sense.

      • It may be a function of the ~20k SNPs that the Xing, et al. reference populations have in common with the Illumina-tested SNPs.

      • Oh, also, IBS is computed over all SNPs, and not just the ones that reflect the main dimensions of variation between populations. So you wouldn't expect it to match up with population structure the way a PCA plot would.

        (That's also why the "Compare Genes" feature on 23andMe doesn't correspond as well to population structure.)

        • More than the small number of SNPs, this is the main issue with this analysis. It's not intended to bring out population structure. It's a simple one-to-one comparison.

          • Well, in light of the same the 'Compare Genes' is very arbitrary as far as South Asians are concerned. For example, my second highest match is one of Razib Khan's relatives. I mean why should a Bengali score better with me before fellow Tamil Brahmins? Lame.

          • Hmmm .... this triggered a thought for me, though I approach this with little knowledge of how these things are actually calculated. If similarities are calculated based on "chunks" matching where "subchunks" are actually what are inherited, that could explain how someone who is an Indo-Egyptian mix could falsely appear to be further from both than Indians and Egyptians are from each other. Consider 2 original populations mixing to produce a third population (NW Subcontinentals). Consider that population then mixing with a third population to produce Indo-Egyptians.

            orig1 - aaaa bbbb cccc dddd eeee

            orig2 - ssss bbbb jjjj dddd mmmm

            orig3 - ssss wwww cccc dddd eeee

            mix1 (o1o2) - ssaa bbbb jjcc dddd mmee

            mix2 (m1o3) - sssa bbww jccc dddd meee

            o1o2 likeness count - 2
            o1o3 likeness count - 3
            o2o3 likeness count - 2
            m1o1 likeness count - 2
            m1o2 likeness count - 2
            m1o3 likeness count - 1
            m2o1 likeness count - 1
            m2o2 likeness count - 1
            m2o3 likeness count - 1
            m1m2 likeness count - 1

            Could this be what's happening with you Zack? Maybe I am completely off.

          • IBS similarity is measured SNP by SNP, allele by allele.

  2. HRP005, HRP006 and HRP008 are most similar to Gujaratis, TN Brahmin and Gujaratis - b. Super-odd indeed. I'd expect the Punjabis to either be similar to the Pathan, Sindhi or Arain ref. population, and the TN Brahmins the TN Brahmin samples, of course!

  3. Will have an interesting feature up for the para-Caucasus participants (Assyrian, Iranians, Georgian) in a bit. Will post when complete.

  4. Thanks Zack! Very interesting. 🙂 I wanted to ask who are the Utahns and "Armenians-b"?

    • Utahns are White (European) American from the state of Utah from the HapMap dataset.

      Armenians-b are those reference samples of Armenians which clustered far from the main Armenian reference populations. These have much higher Northern European admixture than the others for some reason.

  5. kiss my dalit ass you privileged brahmanists! or whatever i'm supposed to say 🙂 zack, i expect some special treatment now 🙂

  6. 0044 (me) might be an outlier as I think I might hold a genetic isolate.

    • Are you aware of any non-south asian ancestry as your highest sharing is just 72.76% (Andhra Pradesh)?

      • No, not that I'm aware of. From what I know of my ancestors, they were all kashmiris. I was surprised when I saw that I scored highest for Andhra Pradesh as it's so far from Kashmir.

  7. The Harappa Gene Similarity analysis, for my top matches, closely resembles the median ASD matrix value order generated by David for our Assyrian-Jewish study. The only significant difference, among the first few, compared to the Assyrian population median value order, is the Uzbek Jewish rank. Good work, Zack. Thanks!

    1 uzbekistan jews 0.73848
    2 armenians 0.73840
    3 georgians 0.73833
    4 cypriots 0.73808
    5 georgia jews 0.73802
    6 iraq jews 0.73768
    7 azerbaijan jews 0.73746
    8 tuscan 0.73730
    9 adygei 0.73720
    10 lezgins 0.73708
    11 ashkenazy jews 0.73698
    12 turks 0.73695

    [U]David's Assyrians (23andMe samples)[/U]
    1 ASY 0.24430
    2 AM 0.24472
    3 GE 0.24509
    4 CY 0.24525
    5 IQJ 0.24587
    6 GJ 0.24597
    7 Tuscan 0.24625
    8 TR 0.24626
    9 SJ 0.24635
    10 UZJ 0.24637
    11 Adygei 0.24651
    12 AJ 0.24663

  8. Harappa Median IBS with Reference 3 | Harappa Ancestry Project - pingback on April 26, 2011 at 1:09 am
  9. can anyone please email me with a suggestion for a good company that offers the SNP microarray test

Trackbacks and Pingbacks: