Category Archives: Dataset

Afghan Dataset

Posted by Zack on January 16, 2014 190 comments

A paper, Afghan Hindu Kush: Where Eurasian Sub-Continent Gene Flows Converge by Julie Di Cristofaro, Erwan Pennarun, Stéphane Mazières, Natalie M. Myres, Alice A. Lin, Shah Aga Temori, Mait Metspalu, Ene Metspalu, Michael Witzel, Roy J. King, Peter A. Underhill, Richard Villems, Jacques Chiaroni was published at PLoS One about the genetics of the people of Afghanistan.

Thanks to Mait Metspalu, the data is available online. It consists of:

5 Hazara
5 Pashtun
5 Tajik
4 Turkmen
5 Uzbek

Here are the HarappaWorld Admixture results for the samples in this dataset.

You can check the spreadsheet too.

Tadjik1_44Af and Pashtun2_6Af seem to be outliers and there's a possibility they are mislabeled. I would like to look into these two samples further before I calculate group averages.

You can compare these Pashtun results to HGDP Pathan and HAP Pashtun results.

Haber et al Lebanon Data

Posted by Zack on August 3, 2013 45 comments

Haber et al published a paper Genome-Wide Diversity in the Levant Reveals Recent Structuring by Culture in PLoS Genetics. Here's their abstract:

The Levant is a region in the Near East with an impressive record of continuous human existence and major cultural developments since the Paleolithic period. Genetic and archeological studies present solid evidence placing the Middle East and the Arabian Peninsula as the first stepping-stone outside Africa. There is, however, little understanding of demographic changes in the Middle East, particularly the Levant, after the first Out-of-Africa expansion and how the Levantine peoples relate genetically to each other and to their neighbors. In this study we analyze more than 500,000 genome-wide SNPs in 1,341 new samples from the Levant and compare them to samples from 48 populations worldwide. Our results show recent genetic stratifications in the Levant are driven by the religious affiliations of the populations within the region. Cultural changes within the last two millennia appear to have facilitated/maintained admixture between culturally similar populations from the Levant, Arabian Peninsula, and Africa. The same cultural changes seem to have resulted in genetic isolation of other groups by limiting admixture with culturally different neighboring populations. Consequently, Levant populations today fall into two main groups: one sharing more genetic characteristics with modern-day Europeans and Central Asians, and the other with closer genetic affinities to other Middle Easterners and Africans. Finally, we identify a putative Levantine ancestral component that diverged from other Middle Easterners ~23,700â€“15,500 years ago during the last glacial period, and diverged from Europeans ~15,900â€“9,100 years ago between the last glacial warming and the start of the Neolithic.

They also released their data consisting of 75 Lebanese from different regions of the country, with 25 samples each for Muslims, Druze and Christians.

Here are the HarappaWorld admixture results for the Lebanese.

You can check the spreadsheet too.

As the authors mention in the summary:

Population stratification caused by nonrandom mating between groups of the same species is often due to geographical distances leading to physical separation followed by genetic drift of allele frequencies in each group. In humans, population structures are also often driven by geographical barriers or distances; however, humans might also be structured by abstract factors such as culture, a consequence of their reasoning and self-awareness. Religion in particular, is one of the unusual conceptual factors that can drive human population structures. This study explores the Levant, a region flanked by the Middle East and Europe, where individual and population relationships are still strongly influenced by religion. We show that religious affiliation had a strong impact on the genomes of the Levantines. In particular, conversion of the region's populations to Islam appears to have introduced major rearrangements in populations' relations through admixture with culturally similar but geographically remote populations, leading to genetic similarities between remarkably distant populations like Jordanians, Moroccans, and Yemenis. Conversely, other populations, like Christians and Druze, became genetically isolated in the new cultural environment. We reconstructed the genetic structure of the Levantines and found that a pre-Islamic expansion Levant was more genetically similar to Europeans than to Middle Easterners.

the Lebanese can be grouped better based on religion than region. That's why I am using group averages by religion.

Pagani East African Dataset

Posted by Zack on July 2, 2012 5 comments

Pagani et al analyzed Ethiopian genetics in their paper "Ethiopian Genetic Diversity Reveals Linguistic Stratification and Complex Influences on the Ethiopian Gene Pool". Their dataset consisting of Ethiopians and a few other East African populations is available online.

I have analyzed the Pagani dataset with my HarappaWorld admixture calculator and included the results in my regular spreadsheet.

The group (weighted mean) results are also shown in the usual interactive bar chart below. You can click on the component labels to sort by that ancestral component.

Because the East African component as computed in HarappaWorld is maximum among the Maasai and several of the Pagani dataset populations have a higher percentage of that component, we should be a bit careful with interpreting the HarappaWorld results for the Pagani groups. I'll likely include them in my next iteration of the admixture calculator.

Simonson Tibet Dataset

Posted by Zack on March 7, 2012 6 comments

Recently, I discovered that the paper Genetic Evidence for High-Altitude Adaptation in Tibet by Tatum S. Simonson, Yingzhong Yang, Chad D. Huff, Haixia Yun, Ga Qin, David J. Witherspoon, Zhenzhong Bai, Felipe R. Lorenzo, Jinchuan Xing, Lynn B. Jorde, Josef T. Prchal, RiLi Ge has its genotyping data online.

It contains 31 Tibetans from Madou county in Qinghai province. The chip is Affymetrix and there are 868,146 SNPs, which means it has a good overlap with Reich et al and Xing et al and also with my reference 3.

I ran reference 3 K=11 admixture on this dataset. Here are the individual results:

The average is as follows:

S Asian	E Asian	Siberian
1%	84%	14%

Hodoglugil Dataset

Posted by Zack on February 24, 2012 9 comments

Dr. Mahley was nice enough to share his Turkish and Kyrgyz dataset from the paper Turkish Population Structure and Genetic Ancestry Reveal Relatedness among Eurasian Populations by UÄŸur HodoÄŸlugil and Robert W. Mahley.

It has:

16 Kyrgyz from Bishkek
20 Turks from Aydin
20 Turks from Istanbul
23 Turks from Kayseri

Here are the group averages for the reference 3 K=11 admixture analysis.

And here are the individual results.

Relatives in Datasets

Posted by Zack on February 6, 2012 Comments Off

Recently, there was a paper Identification of Close Relatives in the HUGO Pan-Asian SNP Database by Xiong Yang, Shuhua Xu, and the HUGO Pan-Asian SNP Consortium.

three individuals involved in MZ pairs were excluded from the whole dataset to construct standardized subset PASNP1716; seventy-six individuals involved in first-degree relationships were excluded from PASNP1716 to construct standardized subset PASNP1640; and 57 individuals involved in second-degree relationships were excluded from PASNP1640 to construct standardized subset PASNP1583. The individuals excluded were summarized inÂ Table S6,Â S7,Â S8.

Let me engage in some blog triumphalism by saying I wrote about the duplicates and relatives in the Pan-Asian dataset in April 2011.

Here are my blog posts about relatedness in datasets:

Early on, I was removing only first degree relatives from the reference datasets. Nowadays, I try to remove all second degree relatives too. I leave the third degree relatives in the data since it's sometimes hard to figure out how real the low IBD values are in Plink. There are a lot of 3rd degree relatives if Plink is to be believed, but I am a little skeptical.

Since Plink's IBD analysis requires homogenous samples, I am now using KING (paper) for the purpose. I am also looking at kcoeff (paper)

Metspalu Dataset Update

Posted by Zack on January 4, 2012 4 comments

Dr. Metspalu, who has been very good about sharing data and information, has informed me about a couple of cases of mislabeling in the Metspalu et al dataset.

Our sample labelled D238 and reported as Tharu is in fact a Brahmin sample from Uttar Pradesh.

Following the publication we have identified that sample evo_32 was erroneously labelled as Kanjar before any genetic analyses. We hereby re-label the sample as belonging to Kol population.

Thus, I have updated the Metspalu admixture results and clustering results.

Metspalu et al Data Relatedness

Posted by Zack on December 11, 2011 9 comments

I performed IBD analysis on the Metspalu dataset using plink and found the relatedness of the following samples to be too high.

ID1	Source1	Population1	ID2	Source2	Population2	IBD Estimate
Mawasi1	Metspalu	Mawasi	Mawasi1	Chaubey	Mawasi	100%
VELZ260	Metspalu	Velama	Velama_184_R2	Reich	Velama	99%
VELZ260	Metspalu	Velama	VELZ265	Metspalu	Velama	19%
VELZ265	Metspalu	Velama	Velama_184_R2	Reich	Velama	19%
D254	Metspalu	Tharu	Tharu_107_R1	Reich	Tharu	99%
D260	Metspalu	Tharu	Tharu_108_R1	Reich	Tharu	98%
evo_32	Metspalu	Kanjar	321e	Metspalu	Kol	53%
HA030	Metspalu	Dharkar	HA039	Metspalu	Dharkar	52%
A387	Metspalu	Dusadh	A388	Metspalu	Dusadh	52%
A394	Metspalu	Dusadh	A395	Metspalu	Dusadh	52%
A395	Metspalu	Dusadh	A393	Metspalu	Dusadh	46%
A394	Metspalu	Dusadh	A393	Metspalu	Dusadh	45%
A392	Metspalu	Dusadh	A393	Metspalu	Dusadh	32%
A392	Metspalu	Dusadh	A395	Metspalu	Dusadh	31%
A392	Metspalu	Dusadh	A394	Metspalu	Dusadh	28%
evo_37	Metspalu	Kanjar	HA023	Metspalu	Dharkar	27%
HA039	Metspalu	Dharkar	HA041	Metspalu	Dharkar	24%
HLKP245	Metspalu	Hakkipikki	Hallaki_137_R2	Reich	Hallaki	22%
PULD160	Metspalu	Pulliyar	PULD162	Metspalu	Pulliyar	20%

As you can see, three samples from Reich et al seem to be the same as Metspalu et al. In addition, two Reich samples seem to be related to Metspalu samples.

There are some Metspalu samples who are likely related to one another. A 50% indicates likely a parent-child or sibling-sibling relationship. A 45-46% relatedness is most likely siblings in my opinion. An 18-19% percentage could be a 1st cousin relationship in an endogamous community. it could also just be the background relatedness in a small, bottlenecked and endogamous community.

It looks like about half of the Dusadh in the Metspalu dataset are related.

I am surprised at the close relationship of a Kanjar and a Kol in the dataset, though both are from Uttar Pradesh.

Dataset in Public

Posted by Zack on August 31, 2011 3 comments

I get requests from time to time about sharing my Reference 3 dataset. I use a few datasets which I am not allowed to redistribute, but most of the others are actually public and the main issue is to convert them to plink format and merge them.

I have released codeÂ for the conversion already but to make the task even easier I am letting you guys know that I already released a subset of my dataset a long time ago. Razib wrote about it and added the detailed instructions on using that dataset.

So here's the link to the datasetÂ which contains about 30,000 SNPs and almost 4,000 individuals from HapMap, HGDP, SGVP, Behar et al and Xing et al.

Changes to 1000 Genomes South Asians

Posted by Razib Khan on June 27, 2011 21 comments

Looks like there have been some changes to the populations in the 1000 Genomes:

At least we'll be able to answer questions about the origin of the Sinhalese soon enough. I'm a little bummed that the Indian populations in Maharashtra and West Bengal disappeared. Did the Permit Raj strike again?kahovka-service.ru

Harappa Ancestry Project

Genetics and South Asia

Category Archives: Dataset

Afghan Dataset

Haber et al Lebanon Data

Pagani East African Dataset

Simonson Tibet Dataset

Hodoglugil Dataset

Relatives in Datasets

Metspalu Dataset Update

Metspalu et al Data Relatedness

Dataset in Public

Changes to 1000 Genomes South Asians

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Archives

Recent Comments

Blogroll

Genetics and South Asia

Category Archives: Dataset

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Tags

Archives

Recent Comments

Blogroll