Category Archives: Clusters

Eurasian fineStructure Dendrograms

The dendrogram in the last post about Eurasian ChromoPainter/fineStructure analysis is a little hard to make sense of, so here is the same info in a better format.

First, the upper portion showing the relationship of the five branches:

Now, let's take a look at Branch1 which consists of South Asians:

Branch2 is European.

Branch3 is mostly the Near East and western Asia.

Branch4 is Inner Asia/Siberia.

And Branch5 is East Asian.

Note that the leaf labels consist of ethnicity followed by the number of that group who belong to that particular cluster. However, some of the labels are cut off in the images since they were long.

South Asian fineStructure Ref3 Admixture

I was wondering what the admixture patterns of the clusters fineSTRUCTURE computed were for my South Asian run. So I computed the average admixture for each cluster (total: 89) using reference 3 admixture results.

The default order of the clusters is to keep the closer clusters together.

Ref3 Admixture Dendrograms

I have posted the reference 3 K=11 admixture results for all populations and datasets. Here are the relevant links:

So let's try a dendrogram of all these populations' average admixture results. Instead of using regular Euclidean distance, I used some weighting based on Fst distances between admixture components, very similar to what Palisto did.

Here's a dendrogram of all datasets using complete linkage.

Since the Pan-Asian dataset had only 5,400 SNPs common with reference 3, we need to be careful interpreting the tree above. Just to make sure, here's the dendrogram excluding Pan-Asian populations.

Dodecad South Asian ChromoPainter

Dienekes ran ChromoPainter/fineSTRUCTURE analysis of South Asians along with some West Eurasian populations, something I had neglected to do in my own South Asian run.

Using Dienekes' data, I was trying to figure out which South Asian populations had more DNA chunks in common with other groups when I ran into something strange. Looking at the chunkcount spreadsheet, if we focus on a recipient population (i.e., one row), we can see which populations contributed more "chunks". For most populations, the results are expected. It's either the same population or some close population. For example, let's look at top 5 matches for Velamas_M,

Velamas_M Pulliyar_M North_Kannadi Chamar_M Piramalai_Kallars_M
Velamas_M 1265.77 1259.38 1256.06 1255.6 1254.74

However, when we do the same for Pathans, Sindhis, Uttar Pradesh Brahmins, Kshatriyas and Muslims, we get strange results.

Chamar_M Velamas_M UP_Scheduled_Caste_M Piramalai_Kallars_M Muslim_M
Pathan 1229.91 1229.56 1229.53 1229.32 1229.27

Do Pathans match Chamar the best? Pathans don't show up as a donor till #11.

Chamar_M Piramalai_Kallars_M Pulliyar_M Velamas_M North_Kannadi
Sindhi 1234.09 1234.08 1233.85 1233.6 1233.55

Again, Sindhis as donors are #12.

Pulliyar_M Chamar_M North_Kannadi Kol_M Piramalai_Kallars_M
Brahmins_UP_M 1244.6 1244.53 1243.44 1242.88 1241.94

The same Brahmins_UP_M are #13 as donors.

Pulliyar_M Chamar_M North_Kannadi Kol_M Piramalai_Kallars_M
Kshatriya_M 1247.72 1247.36 1246.42 1244.98 1244.56

And #12.

Pulliyar_M Chamar_M North_Kannadi Kol_M Piramalai_Kallars_M
Muslim_M 1255.96 1255.36 1253.96 1251.74 1250.86

Muslim_M are #8 as donors.

There is a pattern here among the top donors for these populations. The same populations show up time and again.

Compare to my results (with a larger South Asian dataset) now. The top 10 matches for Pathans are:

  1. pathan
  2. punjabi-jatt
  3. bhatia
  4. haryana-jatt
  5. rajasthani-brahmin
  6. punjabi
  7. balochi
  8. kashmiri
  9. punjabi-brahmin
  10. sindhi

For Sindhis,

  1. sindhi
  2. bhatia
  3. balochi
  4. makrani
  5. brahui
  6. punjabi-jatt
  7. haryana-jatt
  8. meghawal
  9. pathan
  10. punjabi

For Brahmins from Uttar Pradesh,

  1. bihari-brahmin
  2. haryana-jatt
  3. brahmin-uttar-pradesh
  4. punjabi-jatt
  5. kurmi
  6. sourastrian
  7. bengali-brahmin
  8. bihari-kayastha
  9. bhatia
  10. up-brahmin

For Kshatriyas,

  1. bihari-brahmin
  2. kurmi
  3. meena
  4. kshatriya
  5. rajasthani-brahmin
  6. haryana-jatt
  7. punjabi-jatt
  8. bengali-brahmin
  9. kerala-muslim
  10. sourastrian

For Muslims,

  1. muslim
  2. chamar
  3. kol
  4. oriya
  5. uttar-pradesh-scheduled-caste
  6. bihari-muslim
  7. sourastrian
  8. brahmin-uttaranchal
  9. dusadh
  10. bihari-brahmin

If Dienekes can post a chunkcount file for the clusters computed by fineSTRUCTURE, may be we can try to figure out what happened.

Dense South Asian ChromoPainter

I had run ChromoPainter/fineSTRUCTURE for 715 South Asians using only about 90,000 SNPs. I thought it would be a useful exercise to use more SNPs, so I had to drop the Reich et al dataset. That left me with 615 individuals and 418,854 SNPs.

The "chunkcounts" file has the donors in columns and recipients in rows. Here's a heat map of the same.

fineSTRUCTURE classified these 615 individuals into 89 clusters. I have named these clusters for convenience, however, the names do not imply that anyone in the Punjab cluster is Punjabi.

While I created the cluster tree at the top of the spreadsheet, here's how the clusters are related.

The most interesting thing is how Gujarati A (likely Patels) are an out-group to everyone else. Another major grouping is that of the Baloch, Brahui and Makrani, along with 4 Sindhis (might be one of the Baloch tribe of Sindh?).

The Punjabis, Sindhis and Pathan get better classification here than they did last time.

The Punjab cluster includes 3 Gujarati B, 4 Pathans, 2 Singapore Indians, Punjabis, Haryanvis, Kashmiris, and a Rajasthani Brahmin. Even using this method, HRP0036, who is half-Sri Lankan and half-German/Polish was classified in the same cluster.

The Dharkar and Kanjar could not be separated at all here. According to Metspalu:

There are three second degree relatives groups in our sample: ..snip.. [Kanjar evo_37 and Dharkar HA023]. Again the last pair needs further explanation. The Dharkar and Kanjar practice a nomadic lifestyle and were living side by side at the time of sampling. As the ethnic border between the two is permeable we cannot rule out neither our error during sample collection and/or subsequent labelling nor shifted self-identity.

The inter-cluster heat map:

And you can see the chunkcounts donated from each cluster to recipient individuals in a spreadsheet.

The pairwise coincidence:

And the PCA plots:

ChromoPainter/fineStructure South Asians

You have probably heard of ChromoPainter/fineSTRUCTURE by now (Eurogenes, Dienekes, MDLP and Razib).

So I decided to run the South Asian samples data which I had earlier done PCA/MClust on through ChromoPainter and fineSTRUCTURE.

Here is the coancestry matrix among the 715 participants visualized as a heat map.

UPDATE: Here's a huge image showing the same.

fineSTRUCTURE can use this coancestry matrix to classify individuals into clusters, 52 in this case (compared to 38 using PCA and MClust). You can check the cluster assignments in a spreadsheet.

Note that I have named the clusters. That's just a shorthand so we don't have to refer to them by cluster number. Instead I used the population with the largest number of individuals in a cluster to label that cluster.

Here's the cluster-level coancestry heat map.

And the pairwise coincidence:

And finally PCA plots for the first 10 dimensions from fineSTRUCTURE.

UPDATE (Feb 9, 2012): New PCA plots with better markers for the clusters.

Metspalu Dataset Update

Dr. Metspalu, who has been very good about sharing data and information, has informed me about a couple of cases of mislabeling in the Metspalu et al dataset.

Our sample labelled D238 and reported as Tharu is in fact a Brahmin sample from Uttar Pradesh.

Following the publication we have identified that sample evo_32 was erroneously labelled as Kanjar before any genetic analyses. We hereby re-label the sample as belonging to Kol population.

Thus, I have updated the Metspalu admixture results and clustering results.

South Asian PCA + Mclust

I combined reference 3 with Metspalu et al data and Harappa Ancestry Project participants (up to HRP0200). Then I kept only those individuals whose combined proportion of South Asian and Onge components on my reference 3 admixture results was more than 50%.

I ran PCA on these South Asian samples and kept 31 dimensions. Running Mclust on the PCA results gave me 37 clusters.

The clustering results are in a spreadsheet.

For an individual, the value under a specific cluster shows the probability of that person belonging to that cluster. For example, HRP0152 has a 58% probability of belonging to cluster CL8 and 42% probability of being in cluster CL14.

For the populations in the first sheet, I added up the probabilities of all the samples in that population to get the expected number of individuals of that ethnicity belonging to a specific cluster.

In the second sheet, I have listed all the individual samples' clustering results.

There are some outliers who didn't belong in any cluster: HRP0001 (me, of course), 7 (out of 18) Makranis, 4 (out of 23) Sindhis, 3 (all) Great Andamanese, 1 (out of 20) Balochi, 1 (out of 4) Madiga, and 1 (only) Onge.

Reference 3 + Yunusbayev + HAP PCA and Mclust

I ran Principal Component Analysis (PCA) on reference 3 along with Yunusbayev et al Caucasus dataset and Harappa Ancestry Project participants (up to HRP0200).

Then I ran mclust on the first 70 dimensions. The resulting 156 clusters can be seen in a spreadsheet.

For individuals belonging to Harappa Ancestry Project, the value in a column shows that person's probability of being in that cluster. So if there is a 1 in CL15 for example, then that person has a 100% probability of being in Cluster CL15.

For the reference population groups, I have added up the probabilities for all the individuals belonging to that group.

Admixture Ref3 Dendrogram HRP0001-HRP0160

I haven't done any admixture dendrograms in a while, so I thought you guys might be interested.Особенности национального строительства. Стены помещения.

This uses admixture results using Reference 3. As usual, I used complete linkage for the hierarchical clustering.

Let's look at the dendrogram using regular Euclidean distance measure between admixture results.

I also decided to use chi squared distance measure to do the clustering.

PS. Any thoughts on the trees based on two different distance measures?