Category Archives: Dataset

Pagani East African Dataset

Pagani et al analyzed Ethiopian genetics in their paper "Ethiopian Genetic Diversity Reveals Linguistic Stratification and Complex Influences on the Ethiopian Gene Pool". Their dataset consisting of Ethiopians and a few other East African populations is available online.

I have analyzed the Pagani dataset with my HarappaWorld admixture calculator and included the results in my regular spreadsheet.

The group (weighted mean) results are also shown in the usual interactive bar chart below. You can click on the component labels to sort by that ancestral component.

Because the East African component as computed in HarappaWorld is maximum among the Maasai and several of the Pagani dataset populations have a higher percentage of that component, we should be a bit careful with interpreting the HarappaWorld results for the Pagani groups. I'll likely include them in my next iteration of the admixture calculator.

Related Reading:

The Somali Doctrine
Six Months in Sudan: A Young Doctor in a War-Torn Village
East African Origin of the Ancient Egyptians
In Ethiopia
Ethiopia 1:2,000,000 Travel Map (International Travel Maps)

Simonson Tibet Dataset

Recently, I discovered that the paper Genetic Evidence for High-Altitude Adaptation in Tibet by Tatum S. Simonson, Yingzhong Yang, Chad D. Huff, Haixia Yun, Ga Qin, David J. Witherspoon, Zhenzhong Bai, Felipe R. Lorenzo, Jinchuan Xing, Lynn B. Jorde, Josef T. Prchal, RiLi Ge has its genotyping data online.

It contains 31 Tibetans from Madou county in Qinghai province. The chip is Affymetrix and there are 868,146 SNPs, which means it has a good overlap with Reich et al and Xing et al and also with my reference 3.

I ran reference 3 K=11 admixture on this dataset. Here are the individual results:

The average is as follows:

S Asian E Asian Siberian
1% 84% 14%

Related Reading:

Tibet, 3rd (Bradt Travel Guide)
Seven Years in Tibet
Superman: The Man of Steel (1991-2003) #1
Tibet: Culture on the Edge
Pocket Ref 4th Edition

Hodoglugil Dataset

Dr. Mahley was nice enough to share his Turkish and Kyrgyz dataset from the paper Turkish Population Structure and Genetic Ancestry Reveal Relatedness among Eurasian Populations by Uğur Hodoğlugil and Robert W. Mahley.

It has:

  • 16 Kyrgyz from Bishkek
  • 20 Turks from Aydin
  • 20 Turks from Istanbul
  • 23 Turks from Kayseri

Here are the group averages for the reference 3 K=11 admixture analysis.

And here are the individual results.

Related Reading:

Reference Guide for Essential Oils Hard Cover 2012
Pocket Ref 4th Edition
Publication Manual of the American Psychological Association, 6th Edition
Reference and Information Services: An Introduction, Third Edition

Relatives in Datasets

Recently, there was a paper Identification of Close Relatives in the HUGO Pan-Asian SNP Database by Xiong Yang, Shuhua Xu, and the HUGO Pan-Asian SNP Consortium.

three individuals involved in MZ pairs were excluded from the whole dataset to construct standardized subset PASNP1716; seventy-six individuals involved in first-degree relationships were excluded from PASNP1716 to construct standardized subset PASNP1640; and 57 individuals involved in second-degree relationships were excluded from PASNP1640 to construct standardized subset PASNP1583. The individuals excluded were summarized in Table S6S7S8.

Let me engage in some blog triumphalism by saying I wrote about the duplicates and relatives in the Pan-Asian dataset in April 2011.

Here are my blog posts about relatedness in datasets:

Early on, I was removing only first degree relatives from the reference datasets. Nowadays, I try to remove all second degree relatives too. I leave the third degree relatives in the data since it's sometimes hard to figure out how real the low IBD values are in Plink. There are a lot of 3rd degree relatives if Plink is to be believed, but I am a little skeptical.

Since Plink's IBD analysis requires homogenous samples, I am now using KING (paper) for the purpose. I am also looking at kcoeff (paper)

Related Reading:

IBD Self-Management: The AGA Guide to Crohn's Disease and Ulcerative Colitis
Weight Watchers New Points Plus Plan The Very Best Asian Recipes Cookbook
Lower Your Blood Pressure in Eight Weeks: A Revolutionary Program for a Longer, Healthier Life
Asian Dumplings: Mastering Gyoza, Spring Rolls, Samosas, and More
Reference Guide for Essential Oils Hard Cover 2012

Metspalu Dataset Update

Dr. Metspalu, who has been very good about sharing data and information, has informed me about a couple of cases of mislabeling in the Metspalu et al dataset.

Our sample labelled D238 and reported as Tharu is in fact a Brahmin sample from Uttar Pradesh.

Following the publication we have identified that sample evo_32 was erroneously labelled as Kanjar before any genetic analyses. We hereby re-label the sample as belonging to Kol population.

Thus, I have updated the Metspalu admixture results and clustering results.

Related Reading:

The Evolution of Human Populations in Arabia: Paleoenvironments, Prehistory and Genetics (Vertebrate Paleobiology and Paleoanthropology)
The Evolution and History of Human Populations in South Asia: Inter-disciplinary Studies in Archaeology, Biological Anthropology, Linguistics and ... Paleobiology and Paleoanthropology)
The New York Times Guide to Essential Knowledge: A Desk Reference for the Curious Mind
Microsoft Excel 2010 Introduction Quick Reference Guide (Cheat Sheet of Instructions, Tips & Shortcuts - Laminated Card)
Egyptian Language: The Mountains of the Moon , Niger-Congo Speakers and the Origin of Egypt

Metspalu et al Data Relatedness

I performed IBD analysis on the Metspalu dataset using plink and found the relatedness of the following samples to be too high.

ID1 Source1 Population1 ID2 Source2 Population2 IBD Estimate
Mawasi1 Metspalu Mawasi Mawasi1 Chaubey Mawasi 100%
VELZ260 Metspalu Velama Velama_184_R2 Reich Velama 99%
VELZ260 Metspalu Velama VELZ265 Metspalu Velama 19%
VELZ265 Metspalu Velama Velama_184_R2 Reich Velama 19%
D254 Metspalu Tharu Tharu_107_R1 Reich Tharu 99%
D260 Metspalu Tharu Tharu_108_R1 Reich Tharu 98%
evo_32 Metspalu Kanjar 321e Metspalu Kol 53%
HA030 Metspalu Dharkar HA039 Metspalu Dharkar 52%
A387 Metspalu Dusadh A388 Metspalu Dusadh 52%
A394 Metspalu Dusadh A395 Metspalu Dusadh 52%
A395 Metspalu Dusadh A393 Metspalu Dusadh 46%
A394 Metspalu Dusadh A393 Metspalu Dusadh 45%
A392 Metspalu Dusadh A393 Metspalu Dusadh 32%
A392 Metspalu Dusadh A395 Metspalu Dusadh 31%
A392 Metspalu Dusadh A394 Metspalu Dusadh 28%
evo_37 Metspalu Kanjar HA023 Metspalu Dharkar 27%
HA039 Metspalu Dharkar HA041 Metspalu Dharkar 24%
HLKP245 Metspalu Hakkipikki Hallaki_137_R2 Reich Hallaki 22%
PULD160 Metspalu Pulliyar PULD162 Metspalu Pulliyar 20%

As you can see, three samples from Reich et al seem to be the same as Metspalu et al. In addition, two Reich samples seem to be related to Metspalu samples.

There are some Metspalu samples who are likely related to one another. A 50% indicates likely a parent-child or sibling-sibling relationship. A 45-46% relatedness is most likely siblings in my opinion. An 18-19% percentage could be a 1st cousin relationship in an endogamous community. it could also just be the background relatedness in a small, bottlenecked and endogamous community.

It looks like about half of the Dusadh in the Metspalu dataset are related.

I am surprised at the close relationship of a Kanjar and a Kol in the dataset, though both are from Uttar Pradesh.

Related Reading:

Reference and Information Services: An Introduction, Third Edition
The IBD Healing Plan and Recipe Book: Using Whole Foods to Relieve Crohn's Disease and Colitis
The Ancient Black Civilizations of Asia
Microsoft Excel 2010 Introduction Quick Reference Guide (Cheat Sheet of Instructions, Tips & Shortcuts - Laminated Card)

Dataset in Public

I get requests from time to time about sharing my Reference 3 dataset. I use a few datasets which I am not allowed to redistribute, but most of the others are actually public and the main issue is to convert them to plink format and merge them.

I have released code for the conversion already but to make the task even easier I am letting you guys know that I already released a subset of my dataset a long time ago. Razib wrote about it and added the detailed instructions on using that dataset.

So here's the link to the dataset which contains about 30,000 SNPs and almost 4,000 individuals from HapMap, HGDP, SGVP, Behar et al and Xing et al.

Related Reading:

Elemental Rising (Paranormal Public Series)
Public Policy: Politics, Analysis, and Alternatives, 4th Edition
Speak Like a Winner: Public Speaking System to Make You Twice the Speaker in Half the Time
Microsoft Excel 2010 Introduction Quick Reference Guide (Cheat Sheet of Instructions, Tips & Shortcuts - Laminated Card)
Windows 8 Quick Reference Guide (Cheat Sheet of Instructions, Tips & Shortcuts - Laminated Guide)

Changes to 1000 Genomes South Asians

Looks like there have been some changes to the populations in the 1000 Genomes:

At least we'll be able to answer questions about the origin of the Sinhalese soon enough. I'm a little bummed that the Indian populations in Maharashtra and West Bengal disappeared. Did the Permit Raj strike again?

Related Reading:

Paleofantasy: What Evolution Really Tells Us about Sex, Diet, and How We Live
Introduction to Physical Anthropology 2011-2012 Edition
Genome Annotation (Chapman & Hall/CRC Mathematical & Computational Biology)
The $1,000 Genome: The Revolution in DNA Sequencing and the New Era of Personalized Medicine
The Creative Destruction of Medicine: How the Digital Revolution Will Create Better Health Care

Behar Bene Israel

As Razib and I were discussing, the four Bnei Menashe Jewish samples from Behar et al didn't look right since Bnei Menashe are from Mizoram in the northeast of India and thus should be expected to have some East Asian admixture.

When I tried to confirm the admixture/PCA results for Bnei Menashe in the Behar et al paper, I didn't find any mention of the group. Instead, the South Asian Jewish group they mentioned was Bene Israel. According to their admixture and PCA results, Bene Israel looked more like Pakistani populations than their Indian host populations. This is consistent with what my admixture runs show.

So I suspected that the four Bene Israel samples mentioned in the Behar et al paper were accidently labeled as Bnei Menashe in the dataset. I sent an email to the authors and they have confirmed that this was the case.

I have corrected all my spreadsheets so you should see Bene Israel instead of Bnei Menashe now. If you spot Bnei Menashe anywhere, please let me know.

PS. Also, it has been confirmed that three Paniya samples were mislabeled when the data was submitted to the GEO database. They are working on fixing it soon.

UPDATE: Mait Metspalu tells me that the database has been updated with the fixed version of the Behar et al dataset.

Related Reading:

Women Writing Culture
The Vulnerable Observer: Anthropology That Breaks Your Heart
It's Not About the Coffee: Lessons on Putting People First from a Life at Starbucks
60 Questions Christians Ask About Jewish Beliefs and Practices

Reference 3 Fixed

I have fixed the problem with Reference 3 but if you notice any strange results, do let me know.

While the Reference 3 admixture results were generally good (and I have some nice surprises on the way I hope), the Reich et al populations had some weird behavior. From one K value to the next, their admixture would swing wildly especially among the minor components.

For example, for Chenchu, the 2nd component after South Asian was Southwest Asian (42%) at K=6, European (45%) at K=7 and American (32%) at K=8. That just didn't make any sense. It was similar for other Reich et al populations, but all the other reference populations seemed pretty stable.

The issue was that when I was creating Reference 3, I had to juggle lists of SNPs to figure out a way to include Reich et al with a large (>100,000) number of SNPs in the dataset since Reich doesn't have as many SNPs in common with the other datasets plus 23andme (v2 and v3) and FTDNA. In that effort where I was doing lots of SNP set intersections and unions I messed up. I used 217,000 SNPs. While these SNPs were present in all the other datasets, Reich et al had only 102,000 SNPs common with that set. Ouch! This was a royal mess as the high missing rate of Reich et al caused weird instability in its admixture results even though the rest of the results were mostly stable.

Now, I have pared down Reference 3 to 118,000 SNPs. These have a low missing rate in all the datasets. So I don't expect the same problems.

I am redoing the admixture runs with this new data and will have some of the results up soon.

Related Reading:

Reference Guide for Essential Oils Hard Cover 2012
Premium Dungeons & Dragons 3.5 Player's Handbook with Errata
Forks Over Knives - The Cookbook: Over 300 Recipes for Plant-Based Eating All Through the Year
Reference and Information Services: An Introduction, Third Edition
Desk Reference to the Diagnostic Criteria from DSM-5(TM)