Tag Archives: duplicate

Reich et al Duplicates

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Reich et al Indian dataset.

The dataset doesn't have any duplicate or likely relative samples itself. However, there are two Kharia samples that are the same as the Austroasiatic dataset. Since Austroasiatic dataset has more SNPs in common with 23andme, I removed these two samples from Reich et al.

The IBS/IBS analysis and the sample IDs are in a spreadsheet as usual.

Related Reading:

IBS-IBD Fiber Charts: Soluble & Insoluble Fibre Data for over 250 Items, Including Links to Internet Resources
The Ultimate Guide to Teaching English in Thailand (Teaching English in Southeast Asia)
The Duplicate
Southeast Asia in World History (New Oxford World History)

Pan-Asian Dataset Duplicates and Relatives

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

Looking at the Pan-Asian dataset, I found 3 pairs of duplicate samples and 82 pairs that could be closely related. I have removed 64 samples from the dataset.

You can see the IBD results from plink as well as the list of sample IDs I removed in a spreadsheet.

UPDATE: I found 4 Melanesians in the Pan-Asian dataset who were the same as those in HGDP. So I have removed those as well and added them in the list in the spreadsheet.

Related Reading:

Transnational Asian Identities in Pan-Pacific Cinemas: The Reel Asian Exchange (Routledge Advances in Film Studies)
Inflammatory Bowel Disease: The Essential Guide to Controlling Crohn's Disease, Colitis and Other IBDs
Pocket Ref 4th Edition
Student Lab Notebook: 100 Spiral Bound duplicate pages
The Handy Cyclopedia of Things Worth Knowing A Manual of Ready Reference

Austroasiatic Dataset Duplicates

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Chaubey et al Austroasiatic Indians dataset.

The dataset doesn't have any duplicate or likely relative samples itself. Of course, I had to remove the 632 HGDP samples it had, but that's easy to do since they have the same IDs (starting with HGDP).

As their paper mentions, the dataset also has 19 Dravidian speaking Indian samples from Behar et al. Since I got Behar et al data from the GEO site, I had different IDs for them than what they use in this dataset. So I had to figure out which samples were the same in both. The IBS/IBD results of duplicates as well as the list of sample IDs I removed is given in a spreadsheet.

Checking this out resolved an issue I had with Behar et al. Behar et al has 4 Paniya samples from South India. One of those four has admixture proportions similar to Indians but three seem very East Asian. I had always suspected that those three samples were mislabeled. Now the Austroasiatic dataset also has those four Paniya samples. However, only one of them is identical to the Behar et al one. The other three are different. I haven't checked yet which one of the Behar samples matches Austroasiatic, but my guess is that it is the more Indian admixture one. So I am keeping the other three Paniya samples from the Austroasiatic dataset and hoping that they are the correct ones.

Related Reading:

When You Need a Lift: But Don't Want to Eat Chocolate, Pay a Shrink, or Drink a Bottle of Gin
Student Lab Notebook: 100 Spiral Bound duplicate pages
Investor's Business Daily and the Making of Millionaires: How IBD Rewrote the Rules of Investing and Business News
Duplicate - Shahrukh Khan

Henn Duplicates

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Henn et al dataset, which you can download from their website.

There are 107 samples common from the HapMap (IDs start with NA) and 131 from HGDP (IDs start with HGDP).

Henn et al has two PED files. One for the Khoisan data and one for all Africa 55k SNP set. Unfortunately they have 31 San duplicated in both these PED files with same individual IDs but different family IDs (SAN and SAN_SA). So they do not get automatically merged per Plink procedures. Just remove all the ones with SAN_SA FID since they have fewer SNPs. All the IBD info etc is in this spreadsheet.

Related Reading:

How I Found Livingstone; travels, adventures, and discoveres in Central Africa, including an account of four months' residence with Dr. Livingstone, by Henry M. Stanley
Student Lab Notebook: 50 Carbonless Duplicate
Henns lustige Weinschule (German Edition)
IBS-IBD Fiber Charts: Soluble & Insoluble Fibre Data for over 250 Items, Including Links to Internet Resources
Business Ethics: A Case Study Approach

Xing Redo

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Xing et al dataset, which you can download from their website.

I found no duplicates within the Xing et al data but there are 259 samples common from the HapMap. Since they are not assigned any family IDs they will pass through the ped files without being merged into HapMap samples. So you need to remove any samples with IDs starting with "NA".

Xing et al also contains 6 duplicates from HGDP with completely different IDs and two Xing samples look to be related to HGDP samples.

There are also three pairs with very high identity-by-descent values, which I calculated using Plink. You can see the samples with PI_HAT greater than 0.5 in this spreadsheet. PI_HAT is the proportion IBD estimated by plink. Notice also that all these pairs also have high IBS similarity (the DSC column), more than 85% similar.

The samples I have removed as a result of this (other than HapMap) are listed in this spreadsheet.

Related Reading:

My Name Is Number 4: A True Story from the Cultural Revolution
Student Lab Notebook: 50 Carbonless Duplicate
The Gregg Reference Manual 10e
Xing Yi Quan Xue: The Study of Form-Mind Boxing
The How to Make Money in Stocks Complete Investing System: Your Ultimate Guide to Winning in Good Times and Bad

Behar Redo

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Behar et al dataset, which you can download from the GEO Accession website.

I found three set of duplicates and two pairs with very high identity-by-descent values, which I calculated using Plink. You can see the samples with PI_HAT greater than 0.5 in this spreadsheet. PI_HAT is the proportion IBD estimated by plink. Notice also that all these pairs also have high IBS similarity (the DSC column), more than 83% similar.

The five samples I have removed as a result of this are listed in this spreadsheet.

Related Reading:

IBS-IBD Fiber Charts: Soluble & Insoluble Fibre Data for over 250 Items, Including Links to Internet Resources
Legends of the middle ages, narrated with special reference to literature and art
Hollywood on the Riviera: The Inside Story of the Cannes Film Festival
Translated Woman: Crossing the Border with Esperanza's Story
Investor's Business Daily and the Making of Millionaires: How IBD Rewrote the Rules of Investing and Business News

HapMap Redo

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to HapMap, which you can download from their website. I am using HapMap 3 public release #3 from May 28, 2010.

I found one set of duplicates, NA21344 is identical to NA21737. And a whole bunch of pairs with high identity-by-descent values, which I calculated using Plink. You can see the samples with PI_HAT greater than 0.5 in this spreadsheet. PI_HAT is the proportion IBD estimated by plink. Notice also that all these pairs also have high IBS similarity (the DSC column), more than 85% similar in fact.

All the 41 samples I have removed as a result of this are listed in this spreadsheet.

Related Reading:

The Gregg Reference Manual 10e
Ugly's Electrical References, 2011 Edition
The New York Times Guide to Essential Knowledge, Second Edition: A Desk Reference for the Curious Mind
Principles of Pharmacogenetics and Pharmacogenomics