Eurasian ChromoPainter Chunk Counts

Continuing with the Eurasian ChromoPainter analysis, here is the zip file containing the chunk counts that were donated by an individual in a column to an individual in a row. Please note that this is an all-against-all analysis, so it does not directly show the direction of gene flow. Also, the IDs I used here are based on ethnicity (except for harappann which are mixed Harappa Project participants). If you want to find out your ethnicID, take a look at this spreadsheet which has the appropriate mapping.

Since fineStructure classified these 2,001 individuals into 203 populations, it's easier to look at the chunk counts averaged over these populations.

From top to bottom (recipient) and left to right (donor), the five major branches are South Asian, European, Near Eastern & Western Asian, Inner Asian/Siberian, and East Asian respectively.

This population chunk count data is available in a spreadsheet.

Now, let's look at some specific recipient clusters/populations.

Here's the top 50 populations that have donated chunks to the Kalash (Pop133).

The three bars at the bottom are for the 3 different (closely related) Kalash clusters. The clusters donating the most after that are the Burusho, Sindhi, Pathan, etc. The top non-South Asian donor (Tajik Pop116) is at #21 and the next one is also Tajik (Pop95) at #38.

Now here are the top donors for the Pathans (Pop148).

Interestingly, the number of chunks donated to Pathans from Balochi, Brahui and Sindhi seems to be a bit more than from Punjabis. Again, Tajiks are the closest non South Asian group at #55 and #59, followed by Kurds at #62 and Iranians/Kurds cluster (Pop172) at #63.

Now let's look at the donors for Pop134 which includes 2 Bhatia, 2 Gujarati-B, 3 Haryana Jatt, 1 Kashmiri, 4 Pathan, 5 Punjabi, 1 Punjabi Brahmin, 5 Punjabi Jatt, 2 Punjabi Ramgarhia, 1 Rajasthani Brahmin, 1 Sindhi and 3 Singapore Indians.

The top donors (other than Punjabis, of course) are Sindhis and Gujarati-B. The top non South Asian donors are Tajiks at #65 & #67, Iranians/Kurds at #69, Turkmen at #70, Kurds at #73 and Lezgin at #75.

Now for Pop181 (2 Baloch and 9 Brahui).

The Baloch/Brahui are more inbred compared to Punjabis and Pathans. After teh top donors from Baloch, Brahui and Makrani, we get Sindhis, Pathans, Velama and Punjabis. The top non South Asian donor populations are Iranian/Kurd at #28, Turkmen at #33, Turk/Kurd (Pop162) at #35, Iranian Jews at #39, Kurd at #41, a lone Saudi at #42, Iraqi Jews at #43, Tajik at #44, Drue at #47, Armenians at #48, and Samaritian at #50. So it seems like Baloch and Brahui are a lot more West Asian than other groups in Pakistan/NW India.

Let's look at the donors for Pop129 (1 Tamil Nadu Brahmin, 4 Iyengar Brahmin, 8 Iyer Brahmin, and 9 Singapore Indians).

The top donors, after Pop129, are Iyengar Brahmins and a group consisting of other South Indian Brahmins, Kerala Christiand and Nairs, and then Velama. The Dusadh are the top north Indian donor, followed by Gujarati-B and Chamar. Top non South Asian donor is Tajik at #73.

Now for the top donors for Pop188 which includes 33 Singapore Indians, 4 Tamil Vellalar, 3 Andhra Pradesh Reddy, 2 Andhra Pradesh, 2 Dusadh, 2 Karnataka, 2 Sinhalese, 2 Tamil Nadar, 2 Tamil Nadu Scheduled Caste, 1 Chenchu, 1 Kerala Christian, 1 Kerala Muslim, 1 North Kannadi, 1 Tamil Muslim, 1 Tamil Vishwakarma and 1 Velama.

The top donors are Sakilli, Piramalaikallar, and Velama. Their top non South Asian donor is a group of 5 Singapore Malays at #72, followed by Romanian and Serbian Romany at #73.

Finally, let's see which clusters are the top donors for Paniya (Pop65) who get the most South Indian component in my HarappaWorld Admixture runs.

Their top donors are Paniya, Malayan, Pulliyar and Kurumba. Their top non South Asian donors are Singapore Malays at #55, Burmanese at #59, and Cambodian/Singapore Malay at #64.

Eurasian fineStructure Dendrograms

The dendrogram in the last post about Eurasian ChromoPainter/fineStructure analysis is a little hard to make sense of, so here is the same info in a better format.

First, the upper portion showing the relationship of the five branches:

Now, let's take a look at Branch1 which consists of South Asians:

Branch2 is European.

Branch3 is mostly the Near East and western Asia.

Branch4 is Inner Asia/Siberia.

And Branch5 is East Asian.

Note that the leaf labels consist of ethnicity followed by the number of that group who belong to that particular cluster. However, some of the labels are cut off in the images since they were long.

Eurasian ChromoPainter Analysis

Some months ago, I decided to run a big ChromoPainter analysis of the Eurasian samples I have. I removed from my dataset not only all Sub-Saharan Africans, but also North Africans and anyone else with more than 2% African admixture (which unfortunately included me).

Since the number of samples was still too large, I picked 25 random individuals from each non-South-Asian ethnicity while keeping all South Asians. I also tried to remove all close relatives and those with a high missing genotyping rate.

In the end, I had 254,576 SNPs for 2,001 samples belonging to 197 ethnic groups.

I ran ShapeIT to phase their genomes and then ChromoPainter and fineStructure. The whole process took about 2 months.

Then I got busy and the results sat on my computer for more than a month.

Now let's look at the ChromoPainter/fineStructure analysis. Due to my time constraints, I am going to present them in several posts.

Today, let's look at the fineStructure clustering run on the chunkcount output of ChromoPainter. It divided the individuals into 203 populations. Here's the spreadsheet containing the group and individual population clustering.

And here is the dendrogram showing the relationship of the clusters/populations computed by fineStructure.

UPDATE: Better dendrograms

Reference I: Eurasian Subsets

Since we have established that none of the Harappa participants so far have African admixture except for HRP0001 (me) and HRP0027 (Caribbean Indian) and African populations are the most diverse, it's best to remove the African populations from our Reference I dataset and do some analysis using the Eurasian subset.

One option is to exclude the 517 samples of sub-Saharan African populations in our dataset:

  • Bantu Keyna: 11
  • Bantu South Africa: 8
  • Ethiopian Jews: 12
  • Ethiopians: 19
  • Kenyan Luhya: 101
  • Maasai: 135
  • Mandenka: 22
  • African Americans: 48
  • Yoruba: 161

However, in addition to the above, I decided to remove anyone from the reference I dataset who had more than x% African ancestry (sum of East African, East African Bantu and West African) at K=12 admixture run. I created two Eurasian datasets: Eurasian90 and Eurasian95.

Eurasian90 excludes all samples with more than 10% African admixture. That completely removes the following populations in addition to the above:

  • Egyptians: 12
  • Moroccans: 10
  • Mozabite: 29

Also, some samples from the following populations were removed for Eurasian90:

  • Balochi: 3/24
  • Bedouin: 19/46
  • Brahui: 2/25
  • Iranians: 3/19
  • Jordanians: 6/20
  • Lebanese: 2/7
  • Makrani: 3/25
  • Palestinian: 10/46
  • Saudis: 2/20
  • Sindhi: 2/24
  • Syrians: 2/16
  • Yemense: 7/8

That's a total of 629 samples in Reference I dataset that had at least 10% African admixture. Thus Eurasian90 has 2,025 samples. The complete list is here.

The other dataset, Eurasian95 excludes everyone with more than 5% African admixture. Thus in addition to the samples listed above, it excludes the following:

  • Balochi: 1
  • Bedouin: 19
  • Brahui: 1
  • Druze: 1
  • Iranians: 1
  • Jordanians: 14 (completely removed)
  • Makrani: 8
  • Morocco Jews: 2
  • Palestinian: 36 (completely removed)
  • Saudis: 16
  • Sindhi: 2
  • Syrians: 7
  • Yemenese: 1 (completely removed)
  • Yemen Jews: 15 (completely removed)

Eurasian95 is thus left with 1,901 whose breakdown is listed here.

I'll be experimenting with both Eurasian90 and Eurasian95.