Tag Archives: dodecad

Dodecad South Asian ChromoPainter

Dienekes ran ChromoPainter/fineSTRUCTURE analysis of South Asians along with some West Eurasian populations, something I had neglected to do in my own South Asian run.

Using Dienekes' data, I was trying to figure out which South Asian populations had more DNA chunks in common with other groups when I ran into something strange. Looking at the chunkcount spreadsheet, if we focus on a recipient population (i.e., one row), we can see which populations contributed more "chunks". For most populations, the results are expected. It's either the same population or some close population. For example, let's look at top 5 matches for Velamas_M,

Velamas_M Pulliyar_M North_Kannadi Chamar_M Piramalai_Kallars_M
Velamas_M 1265.77 1259.38 1256.06 1255.6 1254.74

However, when we do the same for Pathans, Sindhis, Uttar Pradesh Brahmins, Kshatriyas and Muslims, we get strange results.

Chamar_M Velamas_M UP_Scheduled_Caste_M Piramalai_Kallars_M Muslim_M
Pathan 1229.91 1229.56 1229.53 1229.32 1229.27

Do Pathans match Chamar the best? Pathans don't show up as a donor till #11.

Chamar_M Piramalai_Kallars_M Pulliyar_M Velamas_M North_Kannadi
Sindhi 1234.09 1234.08 1233.85 1233.6 1233.55

Again, Sindhis as donors are #12.

Pulliyar_M Chamar_M North_Kannadi Kol_M Piramalai_Kallars_M
Brahmins_UP_M 1244.6 1244.53 1243.44 1242.88 1241.94

The same Brahmins_UP_M are #13 as donors.

Pulliyar_M Chamar_M North_Kannadi Kol_M Piramalai_Kallars_M
Kshatriya_M 1247.72 1247.36 1246.42 1244.98 1244.56

And #12.

Pulliyar_M Chamar_M North_Kannadi Kol_M Piramalai_Kallars_M
Muslim_M 1255.96 1255.36 1253.96 1251.74 1250.86

Muslim_M are #8 as donors.

There is a pattern here among the top donors for these populations. The same populations show up time and again.

Compare to my results (with a larger South Asian dataset) now. The top 10 matches for Pathans are:

  1. pathan
  2. punjabi-jatt
  3. bhatia
  4. haryana-jatt
  5. rajasthani-brahmin
  6. punjabi
  7. balochi
  8. kashmiri
  9. punjabi-brahmin
  10. sindhi

For Sindhis,

  1. sindhi
  2. bhatia
  3. balochi
  4. makrani
  5. brahui
  6. punjabi-jatt
  7. haryana-jatt
  8. meghawal
  9. pathan
  10. punjabi

For Brahmins from Uttar Pradesh,

  1. bihari-brahmin
  2. haryana-jatt
  3. brahmin-uttar-pradesh
  4. punjabi-jatt
  5. kurmi
  6. sourastrian
  7. bengali-brahmin
  8. bihari-kayastha
  9. bhatia
  10. up-brahmin

For Kshatriyas,

  1. bihari-brahmin
  2. kurmi
  3. meena
  4. kshatriya
  5. rajasthani-brahmin
  6. haryana-jatt
  7. punjabi-jatt
  8. bengali-brahmin
  9. kerala-muslim
  10. sourastrian

For Muslims,

  1. muslim
  2. chamar
  3. kol
  4. oriya
  5. uttar-pradesh-scheduled-caste
  6. bihari-muslim
  7. sourastrian
  8. brahmin-uttaranchal
  9. dusadh
  10. bihari-brahmin

If Dienekes can post a chunkcount file for the clusters computed by fineSTRUCTURE, may be we can try to figure out what happened.

Related Reading:

Dienekes on ANI/ASI

Dienekes has a word of caution about choosing reference populations and admixture results.

Consider a sample of 25 Mexicans from the HapMap and 25 Yoruba from the Hapmap, 25 Iberian Spanish from the 1000 Genomes Project, and 25 Pima from the HGDP as parental populations. We obtain for our Mexican sample:

  • 59.7% European
  • 36.9% "Native American"
  • 3.4% African

Let's run a final experiment with just the Mexicans, Spanish, and Yoruba, i.e., with no Native American samples. At K=3 we obtain:

  • 70% "Native American"
  • 29.7% European
  • 0.4% African

The "Native American" component has increased again! The explanation is simple: as we exclude less admixed Native American groups, Mexicans appear (comparatively) more Native American. The "Native American pole" has shifted, and so has the relative position of populations between them.

In other terms, what is labeled "Native American" in the three experiments is not the same: in the first one it is anchored on the more unadmixed Pima, in the last one in the more admixed Mexicans.

Thus, it seems that unadmixed reference samples are much more useful in getting good results from Admixture.

Then he runs Admixture on the Reich et al dataset for South Asians and tries to estimate the relationship between the Ancestral North Indian percentage computed by Reich et al and his K=2 admixture results on the same data.

Dienekes then included South Asian Dodecad participants in the analysis and ran a K=4 admixture analysis on Reich et al + Dodecad South Asian data, including Yoruba and Beijing Chinese from the HapMap to catch any African or East Asian ancestry.

Here are the admixture results for the reference populations:

The R2 correlation between the West Eurasian admixture component and the Reich et al ANI component is 0.98 which is good. His relationship equation comes out to:

ANI = 0.779*WestEurasian + 39.674

Using this relationship, he calculates the ANI and ASI (Ancestral South Indian) components for Dodecad project members. My results (DOD128) are as follows:

East Eurasian 0.0%
African 3.5%
Ancestral North Indian 75.9%
Ancestral South Indian 20.6%

I should point out that due to my recent Egyptian ancestry, my ANI result is wrong since it's collecting all of the non-African Egyptian in there too.

Also, in the case of Razib, I don't think his East Asian 14.4% should be separated out from his ANI-ASI like that. At least some of it should form part of his ASI percentage in my opinion.

Otherwise, this seems like a very good exercise by Dienekes.

Related Reading:

Dodecad vs Harappa

We know that some participants in Harappa Ancestry Project had also submitted their data to Dodecad Project. And they were curious how the different ancestry components here lined up with the Dodecad ones.

So I decided to compare the two. I took the ancestral component percentages for the reference populations from
Dodecad population spreadsheet K=10 and Harappa Reference I spreadsheet K=9.

I selected the 36 populations that are present in both. While some of these are still not comparable because of which samples out of these populations were selected to be included in the reference datasets for Dodecad and Harappa, we are using mean values, so barring any big outliers we can compare them.

I decided to find a solution to linear equations of the form:

C1 = a11*D1 + a12*D2 + a13*D3 + a14*D4 + a15*D5 + a16*D6 + a17*D7 + a18*D8 + a19*D9 + a1A*D10
C2 = a21*D1 + a22*D2 + a23*D3 + a24*D4 + a25*D5 + a26*D6 + a27*D7 + a28*D8 + a29*D9 + a2A*D10
C3 = a31*D1 + a32*D2 + a33*D3 + a34*D4 + a35*D5 + a36*D6 + a37*D7 + a38*D8 + a39*D9 + a3A*D10
C4 = a41*D1 + a42*D2 + a43*D3 + a44*D4 + a45*D5 + a46*D6 + a47*D7 + a48*D8 + a49*D9 + a4A*D10
C5 = a51*D1 + a52*D2 + a53*D3 + a54*D4 + a55*D5 + a56*D6 + a57*D7 + a58*D8 + a59*D9 + a5A*D10
C6 = a61*D1 + a62*D2 + a63*D3 + a64*D4 + a65*D5 + a66*D6 + a67*D7 + a68*D8 + a69*D9 + a6A*D10
C7 = a71*D1 + a72*D2 + a73*D3 + a74*D4 + a75*D5 + a76*D6 + a77*D7 + a78*D8 + a79*D9 + a7A*D10
C8 = a81*D1 + a82*D2 + a83*D3 + a84*D4 + a85*D5 + a86*D6 + a87*D7 + a88*D8 + a89*D9 + a8A*D10
C9 = a91*D1 + a92*D2 + a93*D3 + a94*D4 + a95*D5 + a96*D6 + a97*D7 + a98*D8 + a99*D9 + a9A*D10

For each of the 36 populations, we'll have these 9 equations where C1 through C9 are the ancestral component percentages of that population in Harappa Project and D1 through D10 are the ancestral percentages in Dodecad Project.

The unknowns are the coefficients "a". They are 90 unknowns. Since we have 36 populations, the number of equations is 36*9=324. Therefore, this is an overdetermined system of linear equations and we can find a least squares solution to it.

Here is the solution:

D1 W Asian D2 NW African D3 S Euro D4 NE Asian D5 SW Asian D6 E Asian D7 N Euro D8 W African D9 E African D10 S Asian
C1 S Asian 0 0 0 0 0 0 0 0 0 0.92
C2 Kalash 0.54 0 -0.05 0.12 0.07 0 0.2 0 0 0.1
C3 SW Asian 0.46 0.56 0.44 0 0.9 0 -0.09 0 0.09 -0.07
C4 SE Asian 0 0 0 0 0 0.6 0 0 0 0
C5 Euro 0 0.19 0.6 0.05 -0.05 0 0.88 0 0 0
C6 Papuan 0 0 0 0 0 0 0 0 0 0
C7 NE Asian 0 0 0 0.85 0 0.4 0 0 0 0
C8 W African 0 0.12 0 0 0 0 0 1 0 0
C9 E African 0 0.12 0 0 0.05 0 0 0 0.89 0

Don't take the exact values to heart but this shows the general relationship between the Dodecad and Harappa (K=9) ancestral components.

The South Asian components are about the same in both projects.

The Kalash component is a mix but is primarily Dodecad West Asian.

The Harappa Southwest Asian has contributions from Northwest African, West Asian and South European in addition to the Dodecad West Asian component.

The Southeast Asian component corresponds partially to the Dodecad East Asian component.

The Harappa European component is more Dodecad North European than South European.

If enough Harappa-Dodecad participants are willing to let me know their IDs for both projects, I can do a similar analysis using individual data.

Related Reading: