Metspalu Dataset Update

Dr. Metspalu, who has been very good about sharing data and information, has informed me about a couple of cases of mislabeling in the Metspalu et al dataset.

Our sample labelled D238 and reported as Tharu is in fact a Brahmin sample from Uttar Pradesh.

Following the publication we have identified that sample evo_32 was erroneously labelled as Kanjar before any genetic analyses. We hereby re-label the sample as belonging to Kol population.

Thus, I have updated the Metspalu admixture results and clustering results.

Reference 3 Fixed

I have fixed the problem with Reference 3 but if you notice any strange results, do let me know.

While the Reference 3 admixture results were generally good (and I have some nice surprises on the way I hope), the Reich et al populations had some weird behavior. From one K value to the next, their admixture would swing wildly especially among the minor components.

For example, for Chenchu, the 2nd component after South Asian was Southwest Asian (42%) at K=6, European (45%) at K=7 and American (32%) at K=8. That just didn't make any sense. It was similar for other Reich et al populations, but all the other reference populations seemed pretty stable.

The issue was that when I was creating Reference 3, I had to juggle lists of SNPs to figure out a way to include Reich et al with a large (>100,000) number of SNPs in the dataset since Reich doesn't have as many SNPs in common with the other datasets plus 23andme (v2 and v3) and FTDNA. In that effort where I was doing lots of SNP set intersections and unions I messed up. I used 217,000 SNPs. While these SNPs were present in all the other datasets, Reich et al had only 102,000 SNPs common with that set. Ouch! This was a royal mess as the high missing rate of Reich et al caused weird instability in its admixture results even though the rest of the results were mostly stable.

Now, I have pared down Reference 3 to 118,000 SNPs. These have a low missing rate in all the datasets. So I don't expect the same problems.

I am redoing the admixture runs with this new data and will have some of the results up soon.

Reference 3 Admixture

I have withdrawn the Admixture results for Reference 3 for now while I figure out why a few of them were weird and unstable.Далматин

I will report back on what I find and will have fixed results soon.

Changes due to San/Pygmy Removal

As mentioned earlier, I removed San and Pygmy groups from my reference datasets.

For the admixture runs on Reference Dataset I, the only major changes are for K=2 ancestral components where most European, Middle Eastern and South/Central Asian groups increase their African component. The changes for K=3,4,5 were minor as shown by these statistics:

K Median Abs Maximum Abs
3 0.01% 0.22%
4 0.02% 0.26%
5 0.02% 0.71%

I have updated the spreadsheet and the plots in the original post.

Looking at the changes in the admixture results I already posted for Harappa Project participants HRP0001 to HRP0010, there is major change for K=2. The African compoent (C1/red) increased by a lot among all project participants. This seems to be due to the African component best representing West Africans now instead of Pygmies as it did before.

For K=3,4,5, the changes are very minor. Let's look at the absolute value of the changes in the percentages of ancestral components for the ten project participants.

K Median Abs Maximum Abs
3 0.05% 0.19%
4 0.05% 0.22%
5 0.09% 0.60%

I have updated the spreadsheets and the charts in the original post.