I have added the HarappaWorld Admixture results for HRP0328-HRP0351 to the individual spreadsheet.

Do note that the admixture components do not necessarily represent real ancestral populations. Also, the names I have chosen for the components should be thought of as mnemonics to ease discussion. I chose them based on which populations in my data these components peaked in. They do not tell anything directly about ancestral populations. The best way to look at these admixture results is by comparing individuals and populations. Finally, the standard error estimates on these results can be about 1%. Therefore, it is entirely possible that your 1% exotic admixture result is just noise.

I have also updated the group averages.

Since 10 of the 24 new participants are Punjabis (with 8 being Punjabi Jatts), I now have 33 Punjabis in HAP. Therefore, I will try to write about the Punjabi samples and their results next week.


  1. In your write up can you provide some estimates of admixtures of various Punjabi clan/caste groups? Did the jat admixture really occur 1900 YBP?

    If you do so, can you explain what is the basis of time calculations of admixtures? I have never been able to understand that. Specifically, is it the first or last evidence of admixture?

  2. I meant timing of admixture events.

  3. Are all of the sample results based on the same reference data sets or algorithm? It seems to me that the newer samples have more of the East Eurasianlike components (not counting South Indian as east Eurasian) compared to the older ones. What I mean is do the older official results match the DIY results better? I know the difference between DIY Harappa and the official results is negligible. Just curious.

    • Except for the original participants (HRP0239 and earlier) when I created HarappaWorld more than a year ago, all other participants have been analyzed using the exact same algorithm and reference data. There has been no change at all.

      • Zack, I think Curious is alluding to the original participants (HRP0239) and earlier. Participants after that (using the newer algorithm and reference data) seem to have noisier results and more East Eurasian. The first 5 Punjabi Jatts have a clear difference in their overall East Eurasian component averages (Siberian, NE Asian, Amerindian, Beringian, etc.) than the latter 7. The earlier participant's results also seem to be slightly less noisy. Finally they seem to have a bit higher Baloch (by about 5-9%) compared to the other 7. For some the individuals, their higher Caucasian (vs. Baloch) compensates for this but not always. All of these differences may be negligible but it's just a trend I picked up on as well.

        Not sure if you're willing to consider it, but would you mind re-calculating the original participant's results (HRP0239 and earlier) using the newer algorithm or reference data? At least for the Punjabis? Or, perhaps doing the opposite and using the original algorithm for the newer participants (HRP240 and later)?

        • I have no idea why you guys are calling it a newer algorithm. First, you create an admixture calculator using all available data. Then you compute the component percentages for the newer samples a few samples at a time so that you get the same components. That's how every genome blogger has done it.

          I can redo the whole admixture analysis including everyone, but that won't be the same as HarappaWorld at all. Also, it is a fairly laborious process where I select samples and SNPs. Then run ADMIXTURE at various K and also compute cross validation error to figure out which K is the best. Also I have to run admixture at one K multiple times with different seeds to find out stable, likely solutions. Some samples, like the Kalash, can create their own useless component which needs to be rid of. Finally, there's bootstrapping and error estimation.

          • Sorry for my wording Zack. I was alluding to the "updated calculator" which seems to be giving different results which may or may not be negligible. Also, I'm sure the admixture will not be the same as HarappaWorld but perhaps the older (less updated) calculator did yield more similar results than the updated one? Is that possible? I'm not sure without comparing an older participant's results to their HarappaWorld.

          • Also, I'm not sure how long it would take to redo the whole admixture analysis including everyone and the reference samples (only you would have a clear idea) but I would much appreciate it if you did. I think it would possibly make the data more accurate. You mentioned it is a fairly laborious process though so I understand if you have to put it off for awhile or choose not to do it.

          • "Except for the original participants (HRP0239 and earlier) when I created HarappaWorld more than a year ago, all other participants have been analyzed using the exact same algorithm and reference data."

            Would the DIY calculator's results be equivalent to results done in the pre-HRP0239 algorithm?

          • I see no reason why DIY would be more accurate.

          • I agree with Paul. Look at the Jatt Sikh results for older samples versus newer ones. You will notice that the first three samples average 3% of the east Asian like components whereas the most recent ones average 6%.
            This is a great site for reference purposes but it looks like the older samples cannot be compared properly to the newer samples because of the noise. Each component can change by 1% in the official results compared to the older ones. This would appear negligible. However the issue is it will add up. For example the 3% east asianlike components for the older samples versus 6% for the newer samples for the Jatt Sikh samples.
            I understand it is a tedious process but I too think the database needs to be updated or revised so that all samples can be properly compared. Maybe create a new database and leave this one as is. The K at 16 should be fine.

          • If you guys want me to create a new admixture spreadsheet every time, that will be a once every 6 months process. Also, no calculators would really be comparable to each other.

          • Well if you create a new database using a new calculator the results in the new database cannot be compared to this current one, but if you run admixture on all samples and data sets I think it will be more accurate than this one. So we will be able to compare samples better with the new database.
            I understand it is a very tedious process. I tried using the Admixture software following Razib's instructions and I did find out that it is a very slow process especially for me who has never used the software before.
            You have a lot of data so I am sure it will be a big pain. If you ever have some spare time I think you should definitely create a new calculator. This is a great database for South Asians.

          • I'll try to do a somewhat more systematic analysis of noise first.

          • Do you take donations that would at least compensate for the amount of time you put into this? Definitely something worth investing in for South Asians.

          • Thanks, but since this is a hobby, the major constraint is time.

          • Thank you Zack. Yeah, definitely, noise in the samples should definitely be analyzed first before creating any new database/calculator.

  4. Most average punjabi south asian component is 30-35%, which makes sense, even jatts are 29% now

  5. I don't think we can assume the difference between the DIY and official results are negligible until we know more about what Curious asked. We need to see a 3 table comparison of the entire set using the older method, newer method, and DIY method. Until then, the rest of us who are not Zack can't really use this data

  6. I've already shared this with Zack and at Anthrogenica but I'll repost here in case any of the other followers of this blog might be able to shed some light. I'm HRP0349 and kind of confused to my origin based on my results.

    I have the lowest Baloch (33%) and highest S-Indian (35%) out of any of the Jatts in the spreadsheet who have made their ethnic background public. Nearing the limits for the other Punjabi groups too. My DIY results were even higher (37% S-Indian, 35% Baloch, 11% Caucasian, 10% NE-Euro). The online spreadsheet results dropped a few points in all but Caucasian and added them to the exotic parts (even pygmy!).

    There are some Jatts with low Baloch numbers (only 2-3% higher than mine) but their S-Indian numbers are also typical (under 30%) and thus way off mine (combination of high S-Indian, lower Baloch but ~10% average of Caucasian/NE-Euro).

    Since an unidentified user (HRP0351) submitted similar results and their Caucasian/NE-Euro ratio seems to indicate a Punjabi of Pakistani background, I wonder if we might represent more southeastern heritage with history in Rajasthan, Gujarat, Sindh, etc.

    The reason for this is that I've noticed the Rajputs have higher S-Indian percentages. Several of the Jatt clans from the south, including the Gills which is my paternal lineage, claim descent from Rajput clans of Rajasthan/etc. I don't know if that means some of them literally arose from the Rajputs or just have partial ancestry from them through marriage. A lot of the Jats who didn't convert to Islam were pushed into Sindh/Gujarat/Rajasthan after the Arabs took Sindh and I assume that's when they took up with the Rajputs who appeared shortly before this, allegedly from the Hephthalite invasion. I think Rajput origins are murkier than Jatts. Whether these ruling Rajputs of this area were Hephthalite or not, the people (likely the Jatts) started calling them Rajputs around this time (probably to distinguish them from the Jatts). For all we know they could have been there for a lot longer than the Jatts.

    The Baloch numbers for Rajputs are all over the place with the lowest coming from North India Pahari Rajput (I'm assuming at the base of the mountainous regions, similar to where the Pahari-speaking Punjabis of Pakistan live... so up near the border with Kashmir or Nepal in India) and West Bengali Rajput (the West Bengali Rajput still had 30%). So these people are actually physically as far from Balochistan as is possible. The southwestern Rajputs, however, like the Rajasthani and Punjabi Pahari had the typical high Baloch numbers for Punjabis which still leaves me confused.

    I suppose we'd need more data from Jatts and Rajputs of Rajasthan to paint a better picture. If lower Baloch numbers (low-mid 30s) are more common there, that solves the mystery. If they have higher Baloch numbers, that just makes it tougher to figure out. Everything else is explainable: higher S-Indian points to Rajasthan/Gujarat/Sindh, the 11% Caucasian points to the Muslim/Pakistani Punjabi side in recent history, the NE-Euro being equal to or slightly higher than other Pakistani Punjabi populations (9.51% in DIY, 8.34% in official) and slightly lower than the Sikhs over the border. That all fits if the Baloch were at a more typical Punjabi number in the 40s or at least high 30s. But at its present number just indicates something farther away from the region.

    I also had a bit of Siberian/Beringian/Amerindian in there and these went up as the NE-Euro went down from my DIY results to the online spreadsheet so I wonder if that means some of my NE-Euro is just very ancient or from a more northern and less western (i.e, Iranian) origin but that's complete speculation.

    My family going to my great-great-great-great grandparents were nearly all from Punjab (some of them were even Sikh at around this point in time). I know the names of my great grandparents (all but one), and most of my great-great grandparents. All Jatt clans. At the level of my grandparents I'm 25% Gill, 50% Pansota, 25% Balham. At the level of my great grandparents I'm 1/8 Gill, 2/8 Pansota, 2/8 Balham, 1/8 Bhatti, and 2/8 Unknown but still Jatt. At the level of my great-great grandparents it's a lot more unknown (nearly 62% unknown) but I do know for certain at least half of them were Jatts and with "near" certainty that almost 70% were Jatts from Punjab (i.e, not Rajasthan or Sindh or whatever). Even the Balham Jatts in my family whose origin is supposed to be from Multan have been in central Punjab for some time. The Pansota were from Hoshiarpur (according to both history and my grandparents) but have been in Pakistani Punjab area since the 19th century. My Gill paternal grandfather came from Ludhiana in the first half of the 20th century and I know the least about his side of the family but from him I got the J2b2* paternal haplogroup. Mother was M30b (which was first found in a South Indian population but it's an imperfect match, apparently there's several extra markers).

    Which is curious because both M30b and J2b2 are present in West/South Indians (M30b was officially identified in Andhara Pradesh where there's plenty of J2b2). Eupedia says the J2b2 got into India likely with the R1as. The R1a spread is pretty much everywhere but the J2b2 spread is mostly in the West, Sindh/Gujarat, and a small bit in Punjab. I have no reason to believe my maternal mtDNA ancestors were linked to my paternal ancestors, especially since my paternal grandfather was relatively new to our family, but there it is. Then again it may just be due to founder effects which I suppose is highly likely. So the haplogroups unfortunately didn't help answer the question, they just added more. The smaller J2b2 dissemination does indicate that one of the branches of the people that came in with the R1as moved in a specific direction (towards Sindh and Magadha). Unfortunately this describes any one of a dozen possible nations spread over thousands of years.

    The endogamy within the Jatt community could have just preserved some trends from a long while back, so my family's recent 2 century history in Punjab might not mean anything if they didn't marry too far outside of their circle (I have 8 great grandparents but 12 great-great grandparents instead of 16).

    • Did you try the new Harappa Oracle? I just input your results, and this is what I got:
      [1,] "up_harappa_5" "5.4944"
      [2,] "gujarati-muslim_harappa_3" "6.0526"
      [3,] "punjabi_harappa_10" "6.347"
      [4,] "kashmiri-pandit_reich_5" "6.5856"
      [5,] "kashmiri_harappa_2" "6.8144"
      [6,] "punjabi-brahmin_harappa_2" "7.2143"
      [7,] "singapore-indian-c_sgvp_10" "7.6517"
      [8,] "bihari-muslim_harappa_4" "7.9248"
      [9,] "up-brahmin_harappa_3" "7.956"
      [10,] "kashmiri-pahari_harappa_2" "8.2104"
      [11,] "punjabi-ramgarhia_harappa_2" "8.2236"
      [12,] "nepalese-a_xing_12" "8.7201"
      [13,] "bengali-brahmin_harappa_6" "9.4031"
      [14,] "brahmin-uttar-pradesh_metspalu_8" "9.8438"
      [15,] "punjabi-jatt_harappa_8" "10.3879"
      [16,] "punjabi-arain_xing_25" "10.8939"
      [17,] "vaish_reich_4" "11.1423"
      [18,] "cochin-jew_behar_4" "11.489"
      [19,] "gujarati-b_hapmap_34" "11.7782"
      [20,] "up-kshatriya_metspalu_7" "11.8739"

      I think the up_harappa_5 samples are primarily Muslim samples.
      Notice that the punjabi_harappa_10 is the third closest group.

  7. Also, re: Imran Khan, the results from myself and #351 probably bumped up the Jatt South Asian numbers.

    • I don't think HRP0351 is listed in the Jatt samples.

      Also, there are now three Jatt groups: Punjabi Jatt Sikhs, Punjabi Jatt Muslim, and Haryana Jat.

    • AD, I believe you're in the Pakistani Punjabi Muslim Jatt group with 3 other participants. The Indian Punjabi Sikh Jatt average South Indian is about 28% even with the 5 new Sikh participants so it's changed by less than 0.5 even with the new participants and seperation of Muslim and Sikh Jatts. Before the new grouping, HRP0283 Pakistani Punjabi Jatt) was part of the Punjabi Jatt grouping of 8 participants.

  8. I just stumbled onto this person in a comment from an older post on this blog:


    shukla madhyandin brahman....

    40.14% S-Indian (vs 37)
    34.03% Baloch (vs 35)
    7.91% Caucasian (vs 11)
    11.83% NE-Euro (vs 10)
    0.23% SE-Asian
    1.03% Siberian
    0.26% NE-Asian
    1.30% Papuan
    0.50% American
    0.73% Beringian
    0.41% Mediterranean
    1.18% SW-Asian
    0.00% San
    0.46% E-African
    0.00% Pygmy
    0.00% W-African

    I put my DIY numbers in parenthesis.

    Those are Deshastha Brahmins from southwest India (Maharashtra).

    There is one Maharashtra Brahmin in the Hap database and their Baloch is higher and NE-Euro nearly half of this. But they were Saraswat Brahmins who aren't as localized as the Deshastha are to Maharashtra.

    If that is the composition of an upper caste from that region, I presume any Jatts down there would look similar, maybe they'd have to be a little further to the north in Gujarat or Western Madhya Pradesh or Northern Andhra Pradesh.

    Comparing my numbers to his, the lower NE-Euro in southern India indicates a slightly lower caste or the Muslim heritage. The lower South Indian and slightly higher Baloch moves it up to the north. The higher Caucasian indicates some Muslim heritage.

    So that's my best theory if my composition represents a particular group of Punjabis, there must have been some Jatts/Rajputs from all the way out in the middle of India, further than Rajasthan, like Madhya Pradesh, Andhra Pradesh, Maharashtra, or some province like that (maybe even Uttar Pradesh), who were Muslim or became Muslim and moved to Punjab at the latest 200 years ago but possibly much earlier. They moved to the Jhang area (or to another area together and then to the Jhang area). There are no other results right now from that area according to the Google Maps page for the Harappa Project. Short of going there and handing out kits, I guess I'll have to wait and hope some of the other Punjabis getting similar numbers chime in and let us know if they're from near Jhang (not all the Punjabis there would be similar).

    Whatever the case, that's the million dollar question. What would Jatts be doing in the middle of India?

  9. Just noticed another pattern. Other Jatts with low Baloch are in the mid-high 30s, but still have low South Indian. But they make up for it with slightly raised SW Asian/Med and higher Caucasian. It sounds like those Jatts (including some Jatt Sikhs who fit this pattern) have some admixture with populations closer to Muslims.

    My closest matches are the Uttar Pradesh/Pathan crosses (302, 311, 306), and a Kashmiri But (250). The UP/Pathan crosses have higher SW Asian/Med composition and they both note possible Arab ancestry. There's a half Punjabi half-UP (302) who has higher South Indian but that's probably because it's a Pakistani Punjabi cross with UP and not Pathan. The normal UP folks seem to have Baloch that's too high to match with mine, but they are close otherwise. Which is why I think UP is a possibility but also further south into the middle of India. I've just never heard of Jatts from UP/MP/Maharashta/etc. There are some Rajputs out there, but Jatts?

    According to Joshua Project's People Profile:


    Enter India, then Jat, Muslim. There are scattered populations all over India still. Just surprised we haven't seen too many of these Jatts from other parts of India who moved to Punjab on here nor have I heard of any of them. It might have been too far in their past to be remembered since those who moved to Punjab would have done so over 150 years ago. Those who stuck together would have numbers closer to mine.

    Unrelated but interestingly, I think most of the J2 (no subclades) might have come from Middle Eastern Muslims as it's well represented among people with SW Asian/Med admixture, some overtly so, and including in UP Muslims and Syeds, and a 'Bene Israel' (Indian Jew). There was a Punjabi Arain or two who had it that I ran into at 23andMe (the Arain are supposed to be descendants of Arabs who invaded Sindh and the Jatts who had converted to Islam).

    • My dad 318 is J2, but he can't help much with his background.

      • Did you get your mother tested too?

        I've only seen J2 with no subclades in Punjabi Arain but HRP0318's admixture looks like a mix, since there's two Jatt Sikhs (HRP0257 HRP0339) with similar numbers to him but the Caucasian for both him and those Jatt Sikhs are higher than the norm for Jatt Sikhs and the NE-Euro of your dad is higher than the norm for Muslim Punjabis like the Arain. They could both belong to a less common strain of Punjabis from the same area of Punjab.

        If he's a mix, he could be a mixture of Punjabi and Pathan or Kashmiri would be my guess, those are fairly common there actually. His Baloch is slightly lower (along with that other Jatt Sikh), which is what leads me to believe either he's of a similar background to HRP0257 and HRP0339 which is rare among Punjabis surveyed so far or partly descended from Pathan or Kashmiri since they too have the lower Baloch. He has 2.6% NE-Asian which is really rare in Jatts (who are mostly stacked up together in the 0% range... 257 is the highest at 1.3%), but not uncommon in Kashmiris or Pahari or some Pashtun people.

        The other option is that the J2 indicates he's descended from some relatively recent Muslim movement to the area which is also why his Caucasian and NE-Asian are so high. Muslim groups from the north would have isolated instances of J2 (which I strongly suspect spread into this general region with the first Muslims) and also the higher Caucasian and NE-Asian, and NE-Euro. A lot of these Muslims who swept into India came from regions of Afghanistan, Tajikstan, Uzbekistan, etc. So the J2 could have gone up there with the males and south again recently or the J2 could be a holdover from the first Arab invasion of India (like the Arain story) and then there was mixing with recent Muslims from these northwestern regions.

      • @hrp302

        Sending samples of close family relatives, such as one's parents, breaks participation rules as per Zack-

        "Please do not send samples from close relatives. I define close relatives as 2nd cousins or closer. If you have data from yourself and your parents, it might be better to send the samples from your parents (assuming they are not related to each other) and not send your own sample."

  10. AD, you're not in the Punjabi_Harappa group. Rather, it consists of a "generic" group of 10 (now 11) Punjabis who didn't inform Zack of their specific background, chose not to publicly identify or perhaps didn't fit into one of the other Punjabi categories. You're actually part of the newly created Punjabi-Jatt-Muslim group.

  11. Creating a new calculator is a lot of work, and I do honestly appreciate the amount of effort Zack has put into this aspect of the project. But maybe it is time to create a new calculator. I feel like the "calculator effect" is at work here. Now even Punjabi participants have less "Baloch" than expected, more East Eurasian components, and more of specific exotic components ("Papuan", "East African", etc). The same thing is observable for the HAP Pashtuns in relation to the HGDP Pashtuns. With the new data Zack has, the results for a new calculator will be superior to "HarappaWorld".

    • HRP0282: What differences showed up between the HGDP and HarappaProject Pashtuns?

    • Calculator effect is an interesting theory. The majority of the original Punjabis sound like they were from the central/northern part of Punjab.

  12. The results for HAP Pashtuns are much more noisy. And we are the least "Baloch" of all the samples-participants from near the Indus. Kalash, Burusho, the HGDP Pashtuns, and most of the Punjabis, are more "Baloch". Heck, a lot of Indians have more of the "Baloch" component than us. And I find it difficult to take that without a grain of salt. But, perhaps this is a real result. If so, it is a very interesting finding. Also, more East Eurasian, and less "South Indian", makes geographic sense. But African components? Australasian percentages?

    • I saw the slightly lower Baloch numbers too and attributed it to Pashtun remaining relatively more endogamous and being reluctant to mix with West Asians of various Indo-Iranian persuasion. Baloch actually seems to form a continuum with Tajik and they arc *around* the Pashtun. The Pashtun are like a little low-Baloch island in the middle of these heavily Indo-Iranianized groups (groups like the Tajik are from Iran, moved into Central Asia, and some came back down again with the very recent Muslim invasions of the past few centuries). So Afghan Pashtun are close to Tajik but not as close to Pakistanis but Tajik are close to both. There's a little hiccup in the spread. And when Pashtun did start to mix with the Pakistani groups in recent centuries, they mostly stuck with Punjabis or Indian groups from the North.

      Also, Burusho have the lowest Baloch actually. That's why I keep getting put next to Burusho on the maps using HGDP reference populations, farther from Pathan than the other Punjabis. The big difference between Burusho is that the Pathan have higher Caucasian and slightly more Mediterranean/SW Asian but the Burusho have a sizeable NE Asian chunk (8%!). This also threw off my results because the Caucasian/NE-Euro was similar to mine which I knew couldn't be right. It's because they had the extra 8% in NE-Asian. Kalash are very close to Pashtun by the looks of just the spreadsheet, so I don't know why they aren't closer in the PCA graphs, they're way off in the corner away from Pathan and Burusho. That mystifies me.

      Calculator effect is still a possibility but if it shows up for the Pashtun, who have a sizeable representation in HGDP, that does pose a lot of questions.

    • Forgot to add, the Pashtun from the south would have higher Baloch than the ones from the north. I do see them divided as "North Pashtun" and "South Pashtun" in various websites. It could be both that and a calculator effect since most of HGDP's Pashtuns seem like "South Pashtuns" from the Afghan/Balochistan area. Where are you from? If you're from those southern populations and it's still low Baloch then I think a calculator effect is more likely than anything because there's no explanation for that.

      • AD, the Kalash cluster close to Pashtuns but in an isolated corner because their admixture proportions are similar but they are extremely inbred. Razib Khan mentioned before that the Kalash are almost like an "inbred" version of the HGDP Pashtuns. Also, I'm not sure whether there is any real difference between Northern and Southern Pashtuns in terms of admixture because there isn't any data to base this off. We would need samples from a Northern area like Swat District or Kunar Province and compare it to samples from Kandahar or Quetta. Although, I'd point out the HGDP Pashtuns are actually Pakistani Pashtuns from Kurram Valley in the Federally Administered Tribal Areas.

        Also, I'd like to mention that I've seen the DIY HarappaWorld results of an Afghan Pashtun from Kandahar and to my surprise his Baloch score was almost the same as yours and not nearly as high as the average of the HGDP Pashtuns.

        • Baloch is higher in Pakistani populations in general, do you think this means the Pashtun in Afghanistan have lower Baloch than their cousins in Pakistan?

          I know some Pashtuns from Kandahar (Popalzai), I'll encourage them to try this out and submit results. I'd be interested in knowing their haplogroups too since that's the tribe of the founder of the Durrani Empire.

        • Perhaps the results seem a little jarring because we're ignoring the exotic categories. I know people don't want to rely on them because they might be due to noise, but the noise is in a pretty specific pattern. The Pashtuns with low Baloch have higher SW-Asian/Mediterranean. This could be due to a really old admixture from the first Arab invasions of the 7th century. Quite a few of the Arabs mixed with peoples from Afghanistan (including non-Pashtun). What comes to mind is quite a few famous personalities in Arab civilization who were from "Khorasan", some literally from Afghanistan (Abu Hanifah). The later Indo-Iranian admixture probably didn't hit them directly, but as the Pashtun extended eastward into what is now Pakistan and India, they could have wound up with higher Baloch admixture.

          Also, if the HGDP Pashtun are all from Kurram, they're closer to Waziris and Pakhtuns. Those in Kandahar say Pashtun, those in Peshawer say Pakhtun (i.e, Khyber-Pakhtunkhwa) and those from the south are Waziri to the best of my knowledge and Kurram are closest to Waziri, and they border Balochistan. Because these distinctions have been in place for over 200 years (the border goes back to the British), perhaps the sharp difference in Baloch admixture over a long-standing political boundary makes sense? And after the Durranis, the Kandahari Pashtun seem to have mixed less with others.

    • I looked at the individual Pashtun results and there's a Mohammadzai in there with low Baloch, that seems very off.

      I noticed the HAP Pashtuns have more Mediterranean/SW Asian though, that's where some of the missing Baloch % seems to have gone. Since Baloch is essentially closest to West Asian, I think it's a pattern from the newer algorithm. Instead of a 8% differential between HGDP and HAP pashtuns it's 3% if you group SW-Asian/Mediterranean and the African with Baloch.

      Whether this is due to greater accuracy (more Mideastern component to Pashtun) or lesser accuracy/more noise is anyone's guess.

      • And any detractions from NE-Euro went to the far northeast (Siberian, Beringian, NE Asian, American).

        I made a copy of the spreadsheet on my Google Drive account and added columns which group the exotic North and NE-Euro into one and Baloch and exotic Southwest Asian/African into one, and the relationships seem a little smoother between the different groups and the older and newer individuals.

        My low Baloch is in both runs (DIY and Zack's), and it's a 10% swing so that's probably not due to noise.

  13. At some point I think you should split off Punjabis with similar admixture to mine (if enough join) into a new group since our ancestry then isn't really fully "Punjabi". Maybe something like "Punjabi SE" or something. Like Jatts or Rajputs from outside of Punjab who moved to Punjab and don't remember where they were before but don't have admixture fully in common with surrounding populations. Or maybe put us in both groups (existing group and also a new one just for us).

    It's possible they're mostly in the Faisalabad/South-of-Faisalabad area but I'm sure some have moved elsewhere and plenty of native Punjabis live in those areas too so it's not something we can generalize onto everyone from there.

    The J2b2 haplogroup at least gives a rough idea of where the spread started but the specifics of recent (past 2000 years) are probably going to be impossible to discover at this point, especially with the complicated recent history after Islam came to India.

  14. Hi AD,

    The split between southern and northern Pashtuns is linguistic. Most northerners (in both Afghanistan and Pakistan) speak the "hard" dialect of Pashto (as in "Pukhtun", "zamung", "kha"), while most southerners (in both Afghanistan and Pakistan) speak the "soft" dialect of Pashto (as in "Pashtun", "zamuz", "sha"). Pashto orthography is based on the soft dialect, which is why the "soft" dialectal form is used for "Pashtun", and "Pashto" (rather than "Pukhtun", and "Pukhtu"). Both forms of the language are virtually identical, and there is little if any difference in lexical content. There is complete mutual intelligibility. But things are more complicated than just a north-south difference. Although you often see this simple, two-part distinction repeated across linguistic literature, it is not really correct or optimal. Most linguists recognize this. The Wazirs and Khattaks (the more southerly members of the Khattak) speak a "soft" form of Pashto. But it is quite clear from linguistic analyses that their Pashto is rather divergent from other "soft" dialects, and is much closer to "hard" Pashto. Also, Pashtuns in Khost, Kurram, and nearby regions, speak the "hard" form of the language. But their Pashto is very divergent from other "hard" dialects. Soft Qandahari Pashto is far more comprehensible and familiar to a Peshawari or Nangarhari Pashtun than the form of Pashto spoken in the Khost-Kurram axis. So the linguistic distinction itself is founded on a minor difference in phonology, which often does not map onto actual linguistic divergences on the ground. Besides the difference in phonology, there is no real split between northern and southern Pashtuns.

  15. I found this article about admixture in South Asia, and I was wondering what you all think about it or if you have different views.


    • This seems to make sense. By 2000 years ago the clans and castes were set up and entrenched.

      I wonder if it's related to population numbers and crowding. More people crowded together means more reasons for classism.

      • It underestimates mixing in recent times though. Especially with the advent of Islam and tumultuous political history. Continued invasions from Central Asia, from Afghanistan, from Iran, Muslim dynasties spread across all of India, replaced by Hindu ones, Sikh ones, then more Muslim dynasties, there was a lot of wars, movement, and even mixing. Especially of the patrilineal sort since I suppose people had less of an issue bringing foreign women into the fold.

    • Who were the ANI from 8000 years ago? What genetic component is that? The only pattern I've seen is higher NE-Euro in higher castes and that's probably from ~4000 years ago with the first Indo-Europeans or Proto-Indo-Europeans. We don't even know if the Indo-Europeans even existed 8000 years ago let alone that they were in India.

      • Zack, I wound up at this link from that one:


        Since it's your project, what do you think the Baloch admixture component represents? Razib's theory was:

        "A plausible framework then is that expansion of institutional complexity resulted in an expansion of the agriculture complex ~3,000 B.C., and subsequent admixture with the indigenous hunter-gatherer substrate to the east and south during this period."

        Which I took to mean indigenous peoples of this area (Northwestern India or what is now Pakistan) who were the first mixture of non-Indian populations (in this case, caucasians or cousins of caucasians who diverged from caucasians ~10k years ago) with Indian populations.

        If the Baloch represent kind of an ancient/ancestral substratum that forms a continuum with the South Indian component, and the NE-Euro represents slightly more recent Indo-European waves from Central Asia (possibly from Andronovo culture/horizon and its successors until ~2000 years ago), and the 'Caucasian' represents the most recent mixing of actual recent caucasians in the last 1500 years, then that kind of seems to paint a coherent picture but I'm curious as to your own thoughts and conclusions.

  16. is there any new pashtun sample in this set?

  17. Anyone interested, here's what the DNA Tribes matches were for me (HRP0349):



    The Netherlands was weirdly at 0.7038 and so higher than some of my matches to neighboring Indian populations.

    • Do they do a component breakdown like Harappa does (e.g. Baloch, South Indian, NE European, etc)?
      They do seem to have a lot of South Asian data:

      Did you find your results to be accurate?

      • No, their admixture was about as bad as 23andMe.

        Admixture Analysis: 9 Continental Zones:

        92% South Asian / 7 % European / 1.2% Sub-Saharan African

        (23andMe: 98% South Asian, 1% European)


        Admixture Analysis: 20 World Regions:

        South India: 47%
        Indus Valley: 46%
        Northwest European: 5%
        West African: 1.3%

        (not much better)


        This one was different:

        Admixture Analysis: Native Populations

        Gujarat India: 26%
        Balochi Pakistan: 11.5%
        Brahmin Andhra Pradesh India: 9.6%
        Sindhi Pakistan: 7.9%
        Pashtun Pakistan: 6.5%
        Chamar India: 5.1%
        Dharkar India: 4.2%
        Brahmin Tamil Nadu India: 4.0%
        Kurdish Sample 2: 3.3%
        Uzbek Central Asia: 2.7%
        Belarus: 2.7%
        Aymara La Paz Bolivia: 2.5%
        North Ossetia: 2.4%

        Including Diasporic populations brought in Indian Singaporean with 9.5%, I have no idea what that is.

        The most useful stuff was the maps I showed you which let you see a bit of a pattern at least.

        • This is interesting, I would expect at least some Middle Eastern admixture for you since your Y-DNA is more common in the middle east than South Asia.

          • Actually J2 with no subclades is common in the Middle East, J2b2* is not, it's common in the Balkans, various parts of Europe, Anatolia, and India.

            So whenever I see J2 with no subclades, I assume there's recent (past ~1300 years) history from the Middle East for them. J2b2* in India indicates possibly a Central Asian or Anatolian/Balkan source from ~3300 years ago (that's one interpretation anyway, nobody really knows how it got there).

          • @AD

            Phylogeny tree of South Asian J2 branches discovered in Punabis, Gujaratis, Tamils, Bengalis, and Sri Lankans-


            So far, HG02690 PJL - Punjabi in Lahore, Pakistan shares the same J2 sub-branch as NA20885 GIH Gujarati in Houston, both belonging to J-Y951Y1001 * Y998 * Y991 * Y989 * Y980, separated from Bengali, Tamils and Sri Lankan J2 STR types at J-Y983Y983.

          • Orange, interesting. There are 3 major hotspots for J2b2 on the heatmap. Punjab (smallest), Gujarat (cutting into Rajasthan and Sindh), and North/East India.

            I figured the Punjabi/Gujarati would be the same but now that I think about it, I suppose that means the Gujarat/Punjabi hotspots originated from the same source (I would guess spread from Punjab to Gujarat). So the initial spread of J2b2 wasn't all the way to Gujarat, they just moved there soon after (long enough for the mutations to develop). This fits closer to the map of the spread of the Andronovo Indo-European culture (where they had settlements spotted across northern Pakistan/India, but nothing down the Indus Valley yet). That also explains the gap between the three hotspots. Two of them moved out from the other one (towards the South and East), leaving a limited population of the original in Punjab which is the smallest grouping of J2b2 and a gap between the three.

            I would be curious if these markers show up in J2b2* people from Uttar Pradesh, northern Pakistan, or Nepal. If we had more we could try to estimate how it spread in India which could give clues as to how and when it came into India to begin with.

        • According to that link Curious posted, they have 148 Jat Sikh samples from India and 48 Jat samples from Uttar Pradesh?


          I'd consider their test if it wasn't so worthless other than for comparing yourself to other populations in their database. I prefer 23andMe and then FTDNA if you're interested in your haplogroups.

          • I'm mystified why the Uttar Pradesh Jats didn't show up at all but Uttar Pradesh Kshatriya did. UP Kshatriya are not on that listing so maybe they grouped UP Jats under UP Kshatriya.

      • They're supposed to have Jats from Uttar Pradesh but I didn't see them on any list in my analysis. Instead I got Muslims from Uttar Pradesh in the Total Similarity map. Out of nearly 300 populations on the Total Similarity map, Jats didn't show up (but Europeans and Africans did and "Kshatriya Uttar Pradesh" did... don't know what that means, Jats of Uttar Pradesh would be Kshatriya too...).

        • Hi AD,

          The Jatts from Uttar Pradesh are used on the STR test (which is worthless). You received the SNP test (which was a good choice on your part). If you are interested, here are my top 15 populations, and "admixture" results:

          1. Pashtun Pakistan 0.7096 South Asia
          2. Serbia and Croatia 0.7096 Europe
          3. Chechen 0.7094 Caucasus
          4. Abkhazian 0.7091 Caucasus
          5. Armenian Sample 1 0.7088 Caucasus
          6. Brahmin Uttar Pradesh India 0.7087 South Asia
          7. Muslim Uttar Pradesh India 0.7086 South Asia
          8. Armenian Sample 2 0.7086 Caucasus
          9. Iran 0.7086 Middle East
          10. Lambadi India 0.7085 South Asia
          11. Burusho Pakistan 0.7085 South Asia
          12. Russia General 0.7083 Europe
          13. Balkar 0.7083 Caucasus
          14. Ukraine General 0.7083 Europe
          15. North Ossetia 0.7083 Caucasus

          Admixture Analysis: 9 Continental Zones:
          South Asian=67.3%
          West Asian=27.5%
          Sub-Saharan African=0.3%

          Admixture Analysis: 20 World Regions:
          Indus Valley=71.4%
          South India=11.0%
          Caucasus Mountains=3.4%

          Based on the "Admixture Analysis", my closest populations are the Brahui, Baloch, Kalash, and Pashtun (in terms of percentages). I found that very unusual. But I think I know the reason for this. They have additional samples in their customer database, which they also use to compute the percentages. That's why the averages for populations are different. My "South Indian" percentage is closest to the Kalash in their analysis, and my "Indus Valley" percentage is closest to the Brahui.

          • This is pretty interesting, we have similar matches near the beginning of my list, or 0.7087. That's further down your list though, you seem to have more matches overall.

          • Hi HRP0282,

            I am curious as to what admixture calculator you used.


  18. Hi HRP282,

    How did you breakdown admixture?

    • Could this explain the migration of R1a1 out of South Asia and into the region?

      • Not all of R1a1, but R1a1-Z2124 perhaps.

        • Thanks Parasar, how about Y7+, from South Asia into the Middle East? Or was it Vice Versa?

          • From South Asia to Arabia. The reason is the near complete absence of J1 in South Asian groups that have L657-Y7. I think L657 expanded in the historical period, much like M458 in eastern Europe, and it is difficult to conceive of a movement from Arabia to South Asia without any accompanying J1.

  19. That is a pretty cool paper.

    • Re: "individuals deeply deposited in slightly alkaline soil of the Tell Ashara (ancient Terqa)"

      It is in line with the the clove findings (1700bc) at Terqa. Clove has a known SE Asian provenance, and probably went to the Middle-East through South Asia. "A small but very important find reflecting the scope of Terqa’s international trade connections was indicated by the contents of a jar in the pantry of Puzurum-a few kernels of cloves." http://www.iimas.org/Terqa.html

      Plus the Indic names that appear in Akkad (~2200bc - http://books.google.com/books?id=6lPzhfNRZ9IC&pg=PA374 ) and elsewhere including the Syrian regions.

  20. Why don't some of these admixture calculators have a component for Turkic-Mongolic ("Tatar")? Considering that's a signature independent component (with an important linguistic link)? Wouldn't that make for more precision and solve some of the noise issues in Asian populations?

  21. My (HRP0349) FTDNA population finder results (from a transfer of my 23andMe raw data) came up:


    South Asia (Indian) - North Indian - 82.36% (Error: 1.23%)
    Middle East - Palestinian, Adygei, Druze, Iranian, Jewish - 8.51% (Error: 2.46%)
    Europe - Finnish, Russian, Orcadian - 9.13% (Error: 1.85%)

    So... wtf @ Middle Eastern? I wonder how FTDNA got that result. Adygei are actually Caucasus so I'm guessing this represents what showed up as Caucasian in my Harappa results (which were 10.7% I believe... just barely within range of FTDNA's error margin). I've always figured the Caucasian admixture in Indians comes from Middle Eastern/Iranian sources post-Islam, this would seem to suggest that as well. The European is in range of my Harappa calculator results (9.51 on DIY, 8.34 on Zack's run).

    The "North Indian" does range across all of North India so maybe that's why there's higher confidence in that (there's plenty of high South Indian, East Asian, Tribal/whatever mixture further to the Northeast). The weird stuff got subsumed under there.

    Here's a listing of the reference populations that FTDNA uses:


    Surprised there's no kind of West Asian reference used.

    • Hi AD,

      This is very interesting, what were your results on 23andMe?

      • 98 point something South Asian, 1.1% European in Speculative mode. I think generally everyone's getting over 95% South Asian there since they group Afghanistan and I think Tajikstan with South Asia.

        • Not necessarily. Before I had them compute a split mode using my mother's results, I was around 75% South Asian, 19% Middle Eastern, 5% European, and 1% East Asian (if my memory serves me right).

        • Thanks for sharing, that really doesn't make sense. Makes me wonder if they keep changing their algorithm every year.

          • I think their Ancestry Composition is more similar to their Countries of Ancestry tool, they're looking for matches with full or half matches of chunks of DNA (>7 cM). So about 1% sounds right as a direct match for modern Europeans in 23andMe's database (it's actually higher if they include gedmatch one-to-many results). It doesn't seem like an admixture tool, as FTDNA's does.

            That's just my theory anyway, it could be a really bad admixture algorithm.

  22. Hi AD,

    I wouldn't take the FTDNA "Population Finder" too seriously. My results:

    Middle East (Bedouin, Bedouin South, Mozabite, Palestinian)=96.79% ±0.60%
    South Asian (Central Asian)=2.83% ±0.52%

    They tried running my data twice, and result is consistent. This is just how their algorithm seems to construe my data. The only thing I can say, something is obviously very wrong. Their algorithm is extraordinarily chunky, and seems to be going for "long clines".

  23. Although, I would note that your results actually make good sense. I wish mine were as reasonable.

    • That's really weird, I wonder why your data isn't matching the Pathan reference population from HGDP? Then again the HGDP Pathan numbers seem to be off from the Harappa Pashtun and Pathans, maybe that's why.

      DNA-Tribes seemed to do fine with those same populations on the other hand. You got Pashtun Pakistan as your #1 result.

      • To be honest, the DNA Tribes analysis was also somewhat strange, as my second closest population was Serbian and Croatian (in fact, I'm actually equidistant to the HGDP Pashtuns and the southeast European samples). Also, I thought it was weird that I'm closer to Caucasian and European populations than South Asians. In fact, South Asian populations are very distant based on their analysis, and most West Asians and Europeans are very close. So, their analysis has serious issues as well. But then again, IBS estimates are not really informative of population structure, so I guess this is an inherent issue with the kind of analysis in question.

  24. @ Saki,

    Sorry about that, I didn't see your question until now. Here is the link http://www.dnatribes.com/snp.html.

  25. wow! I found a very comprehensive PhD thesis on Indian populations. Quite good to get the basic genetic knowledge about South Asia for a newcomer like me.