Using Dienekes' data, I was trying to figure out which South Asian populations had more DNA chunks in common with other groups when I ran into something strange. Looking at the chunkcount spreadsheet, if we focus on a recipient population (i.e., one row), we can see which populations contributed more "chunks". For most populations, the results are expected. It's either the same population or some close population. For example, let's look at top 5 matches for Velamas_M,
However, when we do the same for Pathans, Sindhis, Uttar Pradesh Brahmins, Kshatriyas and Muslims, we get strange results.
Do Pathans match Chamar the best? Pathans don't show up as a donor till #11.
Again, Sindhis as donors are #12.
The same Brahmins_UP_M are #13 as donors.
Muslim_M are #8 as donors.
There is a pattern here among the top donors for these populations. The same populations show up time and again.
Compare to my results (with a larger South Asian dataset) now. The top 10 matches for Pathans are:
For Brahmins from Uttar Pradesh,
If Dienekes can post a chunkcount file for the clusters computed by fineSTRUCTURE, may be we can try to figure out what happened.
Dienekes has a word of caution about choosing reference populations and admixture results.
Consider a sample of 25 Mexicans from the HapMap and 25 Yoruba from the Hapmap, 25 Iberian Spanish from the 1000 Genomes Project, and 25 Pima from the HGDP as parental populations. We obtain for our Mexican sample:
- 59.7% European
- 36.9% "Native American"
- 3.4% African
Let's run a final experiment with just the Mexicans, Spanish, and Yoruba, i.e., with no Native American samples. At K=3 we obtain:
- 70% "Native American"
- 29.7% European
- 0.4% African
The "Native American" component has increased again! The explanation is simple: as we exclude less admixed Native American groups, Mexicans appear (comparatively) more Native American. The "Native American pole" has shifted, and so has the relative position of populations between them.
In other terms, what is labeled "Native American" in the three experiments is not the same: in the first one it is anchored on the more unadmixed Pima, in the last one in the more admixed Mexicans.
Thus, it seems that unadmixed reference samples are much more useful in getting good results from Admixture.
Then he runs Admixture on the Reich et al dataset for South Asians and tries to estimate the relationship between the Ancestral North Indian percentage computed by Reich et al and his K=2 admixture results on the same data.
Dienekes then included South Asian Dodecad participants in the analysis and ran a K=4 admixture analysis on Reich et al + Dodecad South Asian data, including Yoruba and Beijing Chinese from the HapMap to catch any African or East Asian ancestry.
The R2 correlation between the West Eurasian admixture component and the Reich et al ANI component is 0.98 which is good. His relationship equation comes out to:
ANI = 0.779*WestEurasian + 39.674
Using this relationship, he calculates the ANI and ASI (Ancestral South Indian) components for Dodecad project members. My results (DOD128) are as follows:
|Ancestral North Indian||75.9%|
|Ancestral South Indian||20.6%|
I should point out that due to my recent Egyptian ancestry, my ANI result is wrong since it's collecting all of the non-African Egyptian in there too.
Also, in the case of Razib, I don't think his East Asian 14.4% should be separated out from his ANI-ASI like that. At least some of it should form part of his ASI percentage in my opinion.
Otherwise, this seems like a very good exercise by Dienekes.
We know that some participants in Harappa Ancestry Project had also submitted their data to Dodecad Project. And they were curious how the different ancestry components here lined up with the Dodecad ones.
I selected the 36 populations that are present in both. While some of these are still not comparable because of which samples out of these populations were selected to be included in the reference datasets for Dodecad and Harappa, we are using mean values, so barring any big outliers we can compare them.
I decided to find a solution to linear equations of the form:
C1 = a11*D1 + a12*D2 + a13*D3 + a14*D4 + a15*D5 + a16*D6 + a17*D7 + a18*D8 + a19*D9 + a1A*D10
C2 = a21*D1 + a22*D2 + a23*D3 + a24*D4 + a25*D5 + a26*D6 + a27*D7 + a28*D8 + a29*D9 + a2A*D10
C3 = a31*D1 + a32*D2 + a33*D3 + a34*D4 + a35*D5 + a36*D6 + a37*D7 + a38*D8 + a39*D9 + a3A*D10
C4 = a41*D1 + a42*D2 + a43*D3 + a44*D4 + a45*D5 + a46*D6 + a47*D7 + a48*D8 + a49*D9 + a4A*D10
C5 = a51*D1 + a52*D2 + a53*D3 + a54*D4 + a55*D5 + a56*D6 + a57*D7 + a58*D8 + a59*D9 + a5A*D10
C6 = a61*D1 + a62*D2 + a63*D3 + a64*D4 + a65*D5 + a66*D6 + a67*D7 + a68*D8 + a69*D9 + a6A*D10
C7 = a71*D1 + a72*D2 + a73*D3 + a74*D4 + a75*D5 + a76*D6 + a77*D7 + a78*D8 + a79*D9 + a7A*D10
C8 = a81*D1 + a82*D2 + a83*D3 + a84*D4 + a85*D5 + a86*D6 + a87*D7 + a88*D8 + a89*D9 + a8A*D10
C9 = a91*D1 + a92*D2 + a93*D3 + a94*D4 + a95*D5 + a96*D6 + a97*D7 + a98*D8 + a99*D9 + a9A*D10
For each of the 36 populations, we'll have these 9 equations where C1 through C9 are the ancestral component percentages of that population in Harappa Project and D1 through D10 are the ancestral percentages in Dodecad Project.
The unknowns are the coefficients "a". They are 90 unknowns. Since we have 36 populations, the number of equations is 36*9=324. Therefore, this is an overdetermined system of linear equations and we can find a least squares solution to it.
Here is the solution:
|D1 W Asian||D2 NW African||D3 S Euro||D4 NE Asian||D5 SW Asian||D6 E Asian||D7 N Euro||D8 W African||D9 E African||D10 S Asian|
|C1 S Asian||0||0||0||0||0||0||0||0||0||0.92|
|C3 SW Asian||0.46||0.56||0.44||0||0.9||0||-0.09||0||0.09||-0.07|
|C4 SE Asian||0||0||0||0||0||0.6||0||0||0||0|
|C7 NE Asian||0||0||0||0.85||0||0.4||0||0||0||0|
|C8 W African||0||0.12||0||0||0||0||0||1||0||0|
|C9 E African||0||0.12||0||0||0.05||0||0||0||0.89||0|
Don't take the exact values to heart but this shows the general relationship between the Dodecad and Harappa (K=9) ancestral components.
The South Asian components are about the same in both projects.
The Kalash component is a mix but is primarily Dodecad West Asian.
The Harappa Southwest Asian has contributions from Northwest African, West Asian and South European in addition to the Dodecad West Asian component.
The Southeast Asian component corresponds partially to the Dodecad East Asian component.
The Harappa European component is more Dodecad North European than South European.
If enough Harappa-Dodecad participants are willing to let me know their IDs for both projects, I can do a similar analysis using individual data.