Category Archives: Dataset

Simonson Tibet Dataset

Recently, I discovered that the paper Genetic Evidence for High-Altitude Adaptation in Tibet by Tatum S. Simonson, Yingzhong Yang, Chad D. Huff, Haixia Yun, Ga Qin, David J. Witherspoon, Zhenzhong Bai, Felipe R. Lorenzo, Jinchuan Xing, Lynn B. Jorde, Josef T. Prchal, RiLi Ge has its genotyping data online.

It contains 31 Tibetans from Madou county in Qinghai province. The chip is Affymetrix and there are 868,146 SNPs, which means it has a good overlap with Reich et al and Xing et al and also with my reference 3.

I ran reference 3 K=11 admixture on this dataset. Here are the individual results:

The average is as follows:

S Asian E Asian Siberian
1% 84% 14%

Related Reading:

Teaching and Learning at a Distance: Foundations of Distance Education (5th Edition)
The Bar Sinister
Brazilian Boys Kindle Edition
Tibet: Culture on the Edge
To a Mountain in Tibet (P.S.)

Hodoglugil Dataset

Dr. Mahley was nice enough to share his Turkish and Kyrgyz dataset from the paper Turkish Population Structure and Genetic Ancestry Reveal Relatedness among Eurasian Populations by Uğur Hodoğlugil and Robert W. Mahley.

It has:

  • 16 Kyrgyz from Bishkek
  • 20 Turks from Aydin
  • 20 Turks from Istanbul
  • 23 Turks from Kayseri

Here are the group averages for the reference 3 K=11 admixture analysis.

And here are the individual results.

Related Reading:

The New York Times Guide to Essential Knowledge, Second Edition: A Desk Reference for the Curious Mind
Pocket Ref 4th Edition
Legends of the middle ages, narrated with special reference to literature and art
Merriam-Webster's Everyday Language Reference Set

Relatives in Datasets

Recently, there was a paper Identification of Close Relatives in the HUGO Pan-Asian SNP Database by Xiong Yang, Shuhua Xu, and the HUGO Pan-Asian SNP Consortium.

three individuals involved in MZ pairs were excluded from the whole dataset to construct standardized subset PASNP1716; seventy-six individuals involved in first-degree relationships were excluded from PASNP1716 to construct standardized subset PASNP1640; and 57 individuals involved in second-degree relationships were excluded from PASNP1640 to construct standardized subset PASNP1583. The individuals excluded were summarized in Table S6S7S8.

Let me engage in some blog triumphalism by saying I wrote about the duplicates and relatives in the Pan-Asian dataset in April 2011.

Here are my blog posts about relatedness in datasets:

Early on, I was removing only first degree relatives from the reference datasets. Nowadays, I try to remove all second degree relatives too. I leave the third degree relatives in the data since it's sometimes hard to figure out how real the low IBD values are in Plink. There are a lot of 3rd degree relatives if Plink is to be believed, but I am a little skeptical.

Since Plink's IBD analysis requires homogenous samples, I am now using KING (paper) for the purpose. I am also looking at kcoeff (paper)

Related Reading:

Asian Dumplings: Mastering Gyoza, Spring Rolls, Samosas, and More
Living with IBD & IBS: A Personal Journey of Success
The Steamy Kitchen Cookbook: 101 Asian Recipes Simple Enough for Tonight's Dinner
The How to Make Money in Stocks Complete Investing System: Your Ultimate Guide to Winning in Good Times and Bad
Pan-Asian Integration: Linking East and South Asia

Metspalu Dataset Update

Dr. Metspalu, who has been very good about sharing data and information, has informed me about a couple of cases of mislabeling in the Metspalu et al dataset.

Our sample labelled D238 and reported as Tharu is in fact a Brahmin sample from Uttar Pradesh.

Following the publication we have identified that sample evo_32 was erroneously labelled as Kanjar before any genetic analyses. We hereby re-label the sample as belonging to Kol population.

Thus, I have updated the Metspalu admixture results and clustering results.

Related Reading:

MORE ERRATA
Stigmata Errata Etcetera (Poet/Artist Collaboration Series)
The Foolish Dictionary An exhausting work of reference to un-certain English words, their origin, meaning, legitimate and illegitimate use, confused by a few pictures [not included]
Legends of the middle ages, narrated with special reference to literature and art
Path Of The Heretic: Liber Erratum

Metspalu et al Data Relatedness

I performed IBD analysis on the Metspalu dataset using plink and found the relatedness of the following samples to be too high.

ID1 Source1 Population1 ID2 Source2 Population2 IBD Estimate
Mawasi1 Metspalu Mawasi Mawasi1 Chaubey Mawasi 100%
VELZ260 Metspalu Velama Velama_184_R2 Reich Velama 99%
VELZ260 Metspalu Velama VELZ265 Metspalu Velama 19%
VELZ265 Metspalu Velama Velama_184_R2 Reich Velama 19%
D254 Metspalu Tharu Tharu_107_R1 Reich Tharu 99%
D260 Metspalu Tharu Tharu_108_R1 Reich Tharu 98%
evo_32 Metspalu Kanjar 321e Metspalu Kol 53%
HA030 Metspalu Dharkar HA039 Metspalu Dharkar 52%
A387 Metspalu Dusadh A388 Metspalu Dusadh 52%
A394 Metspalu Dusadh A395 Metspalu Dusadh 52%
A395 Metspalu Dusadh A393 Metspalu Dusadh 46%
A394 Metspalu Dusadh A393 Metspalu Dusadh 45%
A392 Metspalu Dusadh A393 Metspalu Dusadh 32%
A392 Metspalu Dusadh A395 Metspalu Dusadh 31%
A392 Metspalu Dusadh A394 Metspalu Dusadh 28%
evo_37 Metspalu Kanjar HA023 Metspalu Dharkar 27%
HA039 Metspalu Dharkar HA041 Metspalu Dharkar 24%
HLKP245 Metspalu Hakkipikki Hallaki_137_R2 Reich Hallaki 22%
PULD160 Metspalu Pulliyar PULD162 Metspalu Pulliyar 20%

As you can see, three samples from Reich et al seem to be the same as Metspalu et al. In addition, two Reich samples seem to be related to Metspalu samples.

There are some Metspalu samples who are likely related to one another. A 50% indicates likely a parent-child or sibling-sibling relationship. A 45-46% relatedness is most likely siblings in my opinion. An 18-19% percentage could be a 1st cousin relationship in an endogamous community. it could also just be the background relatedness in a small, bottlenecked and endogamous community.

It looks like about half of the Dusadh in the Metspalu dataset are related.

I am surprised at the close relationship of a Kanjar and a Kol in the dataset, though both are from Uttar Pradesh.

Related Reading:

The Foolish Dictionary An exhausting work of reference to un-certain English words, their origin, meaning, legitimate and illegitimate use, confused by a few pictures [not included]
How to Make Money in Stocks:  A Winning System in Good Times and Bad, Fourth Edition
The New York Times Guide to Essential Knowledge, Second Edition: A Desk Reference for the Curious Mind
Human Mitochondrial DNA and the Evolution of Homo sapiens (Nucleic Acids and Molecular Biology)
Investor's Business Daily and the Making of Millionaires: How IBD Rewrote the Rules of Investing and Business News

Dataset in Public

I get requests from time to time about sharing my Reference 3 dataset. I use a few datasets which I am not allowed to redistribute, but most of the others are actually public and the main issue is to convert them to plink format and merge them.

I have released code for the conversion already but to make the task even easier I am letting you guys know that I already released a subset of my dataset a long time ago. Razib wrote about it and added the detailed instructions on using that dataset.

So here's the link to the dataset which contains about 30,000 SNPs and almost 4,000 individuals from HapMap, HGDP, SGVP, Behar et al and Xing et al.

Related Reading:

The Handy Cyclopedia of Things Worth Knowing A Manual of Ready Reference
Pocket Ref 4th Edition
Study Bible KJV - Scofield Reference Bible
The Foolish Dictionary An exhausting work of reference to un-certain English words, their origin, meaning, legitimate and illegitimate use, confused by a few pictures [not included]
Public Display of Arousal

Changes to 1000 Genomes South Asians

Looks like there have been some changes to the populations in the 1000 Genomes:

At least we'll be able to answer questions about the origin of the Sinhalese soon enough. I'm a little bummed that the Indian populations in Maharashtra and West Bengal disappeared. Did the Permit Raj strike again?

Related Reading:

Race Decoded: The Genomic Fight for Social Justice
Next-Generation Genome Sequencing: Towards Personalized Medicine
Bioinformatics and Functional Genomics
by Kevin Davies The $1,000 Genome: The Revolution in DNA Sequencing and the New Era of Personalized Medicine [Bargain Price] (text only)1st (First) edition[Hardcover]2010
The $1,000 Genome: The Revolution in DNA Sequencing and the New Era of Personalized Medicine

Behar Bene Israel

As Razib and I were discussing, the four Bnei Menashe Jewish samples from Behar et al didn't look right since Bnei Menashe are from Mizoram in the northeast of India and thus should be expected to have some East Asian admixture.

When I tried to confirm the admixture/PCA results for Bnei Menashe in the Behar et al paper, I didn't find any mention of the group. Instead, the South Asian Jewish group they mentioned was Bene Israel. According to their admixture and PCA results, Bene Israel looked more like Pakistani populations than their Indian host populations. This is consistent with what my admixture runs show.

So I suspected that the four Bene Israel samples mentioned in the Behar et al paper were accidently labeled as Bnei Menashe in the dataset. I sent an email to the authors and they have confirmed that this was the case.

I have corrected all my spreadsheets so you should see Bene Israel instead of Bnei Menashe now. If you spot Bnei Menashe anywhere, please let me know.

PS. Also, it has been confirmed that three Paniya samples were mislabeled when the data was submitted to the GEO database. They are working on fixing it soon.

UPDATE: Mait Metspalu tells me that the database has been updated with the fixed version of the Behar et al dataset.

Related Reading:

The Presence of the Past in a Spanish Village
Too Jewish
The Paniyan Tribe of Nilgiris: A Socio-Economic Profile
A History of Prophecy in Israel - revised and enlarged
DNA and Tradition: The Genetic Link to the Ancient Hebrews

Reference 3 Fixed

I have fixed the problem with Reference 3 but if you notice any strange results, do let me know.

While the Reference 3 admixture results were generally good (and I have some nice surprises on the way I hope), the Reich et al populations had some weird behavior. From one K value to the next, their admixture would swing wildly especially among the minor components.

For example, for Chenchu, the 2nd component after South Asian was Southwest Asian (42%) at K=6, European (45%) at K=7 and American (32%) at K=8. That just didn't make any sense. It was similar for other Reich et al populations, but all the other reference populations seemed pretty stable.

The issue was that when I was creating Reference 3, I had to juggle lists of SNPs to figure out a way to include Reich et al with a large (>100,000) number of SNPs in the dataset since Reich doesn't have as many SNPs in common with the other datasets plus 23andme (v2 and v3) and FTDNA. In that effort where I was doing lots of SNP set intersections and unions I messed up. I used 217,000 SNPs. While these SNPs were present in all the other datasets, Reich et al had only 102,000 SNPs common with that set. Ouch! This was a royal mess as the high missing rate of Reich et al caused weird instability in its admixture results even though the rest of the results were mostly stable.

Now, I have pared down Reference 3 to 118,000 SNPs. These have a low missing rate in all the datasets. So I don't expect the same problems.

I am redoing the admixture runs with this new data and will have some of the results up soon.

Related Reading:

Legends of the middle ages, narrated with special reference to literature and art
Merriam-Webster's Everyday Language Reference Set
The Handy Cyclopedia of Things Worth Knowing A Manual of Ready Reference
Ugly's Electrical References, 2011 Edition
Erratum

Behar Paniya

Behar as in the Behar et al paper/dataset and not the Indian state of Bihar. The Behar dataset contains 4 samples of Paniya, which apparently is a Dravidian language of some Scheduled Tribes in Kerala.

I had always been suspicious of those four samples since one of them had admixture proportions similar to other South Indians but the other three were like Southeast Asians.

When I got the Austroasiatic dataset, I found out that they had the four Paniyas from Behar et al in their data. However, only one of those four was the same as Behar. The other three were different. So I now had 7 Paniya samples.

Let's look at the K=12 admixture results for these Paniyas.

Behar's GSM536916 was the one which was the same as Austroasiatic's D36 and it has regular South Indian results. The other three Behar Paniyas are very Southeast Asian (yellow in the plot) while the three Paniyas from Austroasiatic data are similar to GSM536916/D36.

Since the Austroasiatic Paniya samples originated from Behar et al, I guess at some point before the Behar data being submitted to the GEO database the Paniyas got mislabeled.

I am now excluding the four Paniyas from Behar et al dataset and only using the Paniya samples from Austroasiatic dataset.

Related Reading:

When You Need a Lift: But Don't Want to Eat Chocolate, Pay a Shrink, or Drink a Bottle of Gin