In the chart below as well as in the spreadsheet, the IDs starting with "b" belong to the Burusho and those starting with "k" belong to the Kalash individuals.
You can check the spreadsheet too.
Having collected 12 datasets, I have gone through them and finally selected the samples and SNPs I want to include in my new dataset, which I'll call Reference 3.
It has 3,889 individuals and 217,957 SNPs. Since this is a South Asia focused blog, there are a total of 558 South Asians in this reference set (compared to 398 in my Reference I).
You can see the number of SNPs of various datasets which are common to 23andme version 2, 23andme version 3 and FTDNA Family Finder (Illumina chip).
The following datasets had more than 280,000 SNPs common with all three platforms and hence were included in Reference 3:
Reich et al had about 100,000 SNPs in common with 23andme (v2 & v3 intersection) and 137,000 with FTDNA, but there was not a great overlap. Only 59,000 Reich et al SNPs were present in all three platforms. Since I really wanted Reich et al data in Reference 3, I included it but the SNPs used for FTDNA comparisons won't be the same as for the 23andme comparisons.
Of the datasets I could not include, I am most disappointed about the Pan-Asian dataset since it has a good coverage of South and Southeast Asia. Unfortunately, it has only 19,000 SNPs in common with 23andme v2 and 23,000 with 23andme v3. I am going to have to do some analyses with the Pan-Asian data but it just can't be included in my Reference 3.
I am also interested in doing some analysis with the Henn et al African data with about 52,000 SNPs for personal reasons.
Xing et al has about 71,000 SNPs in common with 23andme v3, so some good work could be done with that, though I'll have to use only 23andme version 3 participants.
The information about the populations included in Reference 3 is in a spreadsheet as usual.
I am interested in North African populations due to my own heritage, so when Razib alerted me that Henn et al had a paper out about South African origins of humans and their African dataset was publicly available and included populations from all over Africa, I immediately downloaded it.
I have also been considering looking into the East Asian admixture in South Asians and Iranians in some detail to see where it originates from: Southeast Asia, Chinese/Japanese/Koreans, or the Turkic/Mongolian/Siberian populations of interior northeastern Asia. At a quick glance, Razib is correct:
The eastern Asian components are enriched among Bengalis, as youâ€™d expect, but theyâ€™re found in different proportions among many individuals who hail from the northern fringe of South Asia more generally. It seems clear that the further west you go, the more likely the â€œeasternâ€ element is going to be Turk, while the further east (and to some extent south) the more likely it is to be more southernly in provenance.
To do a better job though, it would be better to have more than the Yakut as an examplar of the Siberian component as I have done till now. Therefore, I downloaded the arctic populations dataset from Rasmussen et al.
Combining Henn et al and Rasmussen et al with my previous datasets (HapMap, HGDP, SGVP, Behar et al and Xing et al), I got 3,970 samples with a total of 1,716,031 SNPs represented, though at 99% genotyping rate it gets reduced to about 27,000 SNPs.
I did not remove any populations or individuals except for any duplicates and non-founders.
Now I am on the lookout for more datasets that are public, have enough SNPs in common with this set and can easily be converted into the Plink PED format. So if you know of any, let me know. May be I will have the biggest and most diverse dataset with your help.
The presence of those groups was creating some weird effects in admixture runs at K=8,9. Basically, the ancestral components for Africans I was getting were not stable. Instead they were varying with/without different Harappa participant batches. Also, at K=10,11, there were too many Africa-only ancestral components, forcing me to run even higher values of K.
Since we are not really interested in African diversity in this project and any African admixture among South Asians is most likely to be East, West or North African instead of Pygmy or San, the removal of these groups should not have any implications for the Harappa Ancestry Project.
To make sure that the above assertion is true, I'll re-run admixture analysis for K=2-5 and update later with the results.
The good thing is that this dataset has 544 South Asian samples from 24 ethnic groups. So it'll be useful for some analyses despite the low number of SNPs. I'll try to run parallel analyses on my reference population and this dataset so we can compare the pros and cons of both.
UPDATE: I removed 61 pygmy and San samples.
I got the Stanford University data which has data for 660,918 SNPs from 1,043 samples. It is claimed that the forward strand is given but that turned out not to be true and I had to flip strands and make sure I didn't include any ambiguous A/T or C/G strands in my dataset.
I also excluded the Native American samples because we are not interested in them and they are very closely related either due to recent endogamy or ancient bottlenecks. (yeah I had the nerve to write that.)
Of the total of 876 samples, here are the numbers for our populations of interest:
|Total South Asians||190|
These samples have about 541,560 SNPs in common with 23andme v2.