Category Archives: Dataset

Afghan Dataset

A paper, Afghan Hindu Kush: Where Eurasian Sub-Continent Gene Flows Converge by Julie Di Cristofaro, Erwan Pennarun, Stéphane Mazières, Natalie M. Myres, Alice A. Lin, Shah Aga Temori, Mait Metspalu, Ene Metspalu, Michael Witzel, Roy J. King, Peter A. Underhill, Richard Villems, Jacques Chiaroni was published at PLoS One about the genetics of the people of Afghanistan.

Thanks to Mait Metspalu, the data is available online. It consists of:

  • 5 Hazara
  • 5 Pashtun
  • 5 Tajik
  • 4 Turkmen
  • 5 Uzbek

Here are the HarappaWorld Admixture results for the samples in this dataset.

You can check the spreadsheet too.

Tadjik1_44Af and Pashtun2_6Af seem to be outliers and there's a possibility they are mislabeled. I would like to look into these two samples further before I calculate group averages.

You can compare these Pashtun results to HGDP Pathan and HAP Pashtun results.

Related Reading:

Haber et al Lebanon Data

Haber et al published a paper Genome-Wide Diversity in the Levant Reveals Recent Structuring by Culture in PLoS Genetics. Here's their abstract:

The Levant is a region in the Near East with an impressive record of continuous human existence and major cultural developments since the Paleolithic period. Genetic and archeological studies present solid evidence placing the Middle East and the Arabian Peninsula as the first stepping-stone outside Africa. There is, however, little understanding of demographic changes in the Middle East, particularly the Levant, after the first Out-of-Africa expansion and how the Levantine peoples relate genetically to each other and to their neighbors. In this study we analyze more than 500,000 genome-wide SNPs in 1,341 new samples from the Levant and compare them to samples from 48 populations worldwide. Our results show recent genetic stratifications in the Levant are driven by the religious affiliations of the populations within the region. Cultural changes within the last two millennia appear to have facilitated/maintained admixture between culturally similar populations from the Levant, Arabian Peninsula, and Africa. The same cultural changes seem to have resulted in genetic isolation of other groups by limiting admixture with culturally different neighboring populations. Consequently, Levant populations today fall into two main groups: one sharing more genetic characteristics with modern-day Europeans and Central Asians, and the other with closer genetic affinities to other Middle Easterners and Africans. Finally, we identify a putative Levantine ancestral component that diverged from other Middle Easterners ~23,700–15,500 years ago during the last glacial period, and diverged from Europeans ~15,900–9,100 years ago between the last glacial warming and the start of the Neolithic.

They also released their data consisting of 75 Lebanese from different regions of the country, with 25 samples each for Muslims, Druze and Christians.

Here are the HarappaWorld admixture results for the Lebanese.

You can check the spreadsheet too.

As the authors mention in the summary:

Population stratification caused by nonrandom mating between groups of the same species is often due to geographical distances leading to physical separation followed by genetic drift of allele frequencies in each group. In humans, population structures are also often driven by geographical barriers or distances; however, humans might also be structured by abstract factors such as culture, a consequence of their reasoning and self-awareness. Religion in particular, is one of the unusual conceptual factors that can drive human population structures. This study explores the Levant, a region flanked by the Middle East and Europe, where individual and population relationships are still strongly influenced by religion. We show that religious affiliation had a strong impact on the genomes of the Levantines. In particular, conversion of the region's populations to Islam appears to have introduced major rearrangements in populations' relations through admixture with culturally similar but geographically remote populations, leading to genetic similarities between remarkably distant populations like Jordanians, Moroccans, and Yemenis. Conversely, other populations, like Christians and Druze, became genetically isolated in the new cultural environment. We reconstructed the genetic structure of the Levantines and found that a pre-Islamic expansion Levant was more genetically similar to Europeans than to Middle Easterners.

the Lebanese can be grouped better based on religion than region. That's why I am using group averages by religion.

Related Reading:

Pagani East African Dataset

Pagani et al analyzed Ethiopian genetics in their paper "Ethiopian Genetic Diversity Reveals Linguistic Stratification and Complex Influences on the Ethiopian Gene Pool". Their dataset consisting of Ethiopians and a few other East African populations is available online.

I have analyzed the Pagani dataset with my HarappaWorld admixture calculator and included the results in my regular spreadsheet.

The group (weighted mean) results are also shown in the usual interactive bar chart below. You can click on the component labels to sort by that ancestral component.

Because the East African component as computed in HarappaWorld is maximum among the Maasai and several of the Pagani dataset populations have a higher percentage of that component, we should be a bit careful with interpreting the HarappaWorld results for the Pagani groups. I'll likely include them in my next iteration of the admixture calculator.

Related Reading:

Simonson Tibet Dataset

Recently, I discovered that the paper Genetic Evidence for High-Altitude Adaptation in Tibet by Tatum S. Simonson, Yingzhong Yang, Chad D. Huff, Haixia Yun, Ga Qin, David J. Witherspoon, Zhenzhong Bai, Felipe R. Lorenzo, Jinchuan Xing, Lynn B. Jorde, Josef T. Prchal, RiLi Ge has its genotyping data online.

It contains 31 Tibetans from Madou county in Qinghai province. The chip is Affymetrix and there are 868,146 SNPs, which means it has a good overlap with Reich et al and Xing et al and also with my reference 3.

I ran reference 3 K=11 admixture on this dataset. Here are the individual results:

The average is as follows:

S Asian E Asian Siberian
1% 84% 14%

Related Reading:

Hodoglugil Dataset

Dr. Mahley was nice enough to share his Turkish and Kyrgyz dataset from the paper Turkish Population Structure and Genetic Ancestry Reveal Relatedness among Eurasian Populations by UÄŸur HodoÄŸlugil and Robert W. Mahley.

It has:

  • 16 Kyrgyz from Bishkek
  • 20 Turks from Aydin
  • 20 Turks from Istanbul
  • 23 Turks from Kayseri

Here are the group averages for the reference 3 K=11 admixture analysis.

And here are the individual results.

Related Reading:

Relatives in Datasets

Recently, there was a paper Identification of Close Relatives in the HUGO Pan-Asian SNP Database by Xiong Yang, Shuhua Xu, and the HUGO Pan-Asian SNP Consortium.

three individuals involved in MZ pairs were excluded from the whole dataset to construct standardized subset PASNP1716; seventy-six individuals involved in first-degree relationships were excluded from PASNP1716 to construct standardized subset PASNP1640; and 57 individuals involved in second-degree relationships were excluded from PASNP1640 to construct standardized subset PASNP1583. The individuals excluded were summarized in Table S6, S7, S8.

Let me engage in some blog triumphalism by saying I wrote about the duplicates and relatives in the Pan-Asian dataset in April 2011.

Here are my blog posts about relatedness in datasets:

Early on, I was removing only first degree relatives from the reference datasets. Nowadays, I try to remove all second degree relatives too. I leave the third degree relatives in the data since it's sometimes hard to figure out how real the low IBD values are in Plink. There are a lot of 3rd degree relatives if Plink is to be believed, but I am a little skeptical.

Since Plink's IBD analysis requires homogenous samples, I am now using KING (paper) for the purpose. I am also looking at kcoeff (paper)

Related Reading:

Metspalu Dataset Update

Dr. Metspalu, who has been very good about sharing data and information, has informed me about a couple of cases of mislabeling in the Metspalu et al dataset.

Our sample labelled D238 and reported as Tharu is in fact a Brahmin sample from Uttar Pradesh.

Following the publication we have identified that sample evo_32 was erroneously labelled as Kanjar before any genetic analyses. We hereby re-label the sample as belonging to Kol population.

Thus, I have updated the Metspalu admixture results and clustering results.

Related Reading:

Metspalu et al Data Relatedness

I performed IBD analysis on the Metspalu dataset using plink and found the relatedness of the following samples to be too high.

ID1 Source1 Population1 ID2 Source2 Population2 IBD Estimate
Mawasi1 Metspalu Mawasi Mawasi1 Chaubey Mawasi 100%
VELZ260 Metspalu Velama Velama_184_R2 Reich Velama 99%
VELZ260 Metspalu Velama VELZ265 Metspalu Velama 19%
VELZ265 Metspalu Velama Velama_184_R2 Reich Velama 19%
D254 Metspalu Tharu Tharu_107_R1 Reich Tharu 99%
D260 Metspalu Tharu Tharu_108_R1 Reich Tharu 98%
evo_32 Metspalu Kanjar 321e Metspalu Kol 53%
HA030 Metspalu Dharkar HA039 Metspalu Dharkar 52%
A387 Metspalu Dusadh A388 Metspalu Dusadh 52%
A394 Metspalu Dusadh A395 Metspalu Dusadh 52%
A395 Metspalu Dusadh A393 Metspalu Dusadh 46%
A394 Metspalu Dusadh A393 Metspalu Dusadh 45%
A392 Metspalu Dusadh A393 Metspalu Dusadh 32%
A392 Metspalu Dusadh A395 Metspalu Dusadh 31%
A392 Metspalu Dusadh A394 Metspalu Dusadh 28%
evo_37 Metspalu Kanjar HA023 Metspalu Dharkar 27%
HA039 Metspalu Dharkar HA041 Metspalu Dharkar 24%
HLKP245 Metspalu Hakkipikki Hallaki_137_R2 Reich Hallaki 22%
PULD160 Metspalu Pulliyar PULD162 Metspalu Pulliyar 20%

As you can see, three samples from Reich et al seem to be the same as Metspalu et al. In addition, two Reich samples seem to be related to Metspalu samples.

There are some Metspalu samples who are likely related to one another. A 50% indicates likely a parent-child or sibling-sibling relationship. A 45-46% relatedness is most likely siblings in my opinion. An 18-19% percentage could be a 1st cousin relationship in an endogamous community. it could also just be the background relatedness in a small, bottlenecked and endogamous community.

It looks like about half of the Dusadh in the Metspalu dataset are related.

I am surprised at the close relationship of a Kanjar and a Kol in the dataset, though both are from Uttar Pradesh.

Related Reading:

Dataset in Public

I get requests from time to time about sharing my Reference 3 dataset. I use a few datasets which I am not allowed to redistribute, but most of the others are actually public and the main issue is to convert them to plink format and merge them.

I have released code for the conversion already but to make the task even easier I am letting you guys know that I already released a subset of my dataset a long time ago. Razib wrote about it and added the detailed instructions on using that dataset.

So here's the link to the dataset which contains about 30,000 SNPs and almost 4,000 individuals from HapMap, HGDP, SGVP, Behar et al and Xing et al.

Related Reading:

Changes to 1000 Genomes South Asians

Looks like there have been some changes to the populations in the 1000 Genomes:

At least we'll be able to answer questions about the origin of the Sinhalese soon enough. I'm a little bummed that the Indian populations in Maharashtra and West Bengal disappeared. Did the Permit Raj strike again?

Related Reading: