Behar Redo

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Behar et al dataset, which you can download from the GEO Accession website.

I found three set of duplicates and two pairs with very high identity-by-descent values, which I calculated using Plink. You can see the samples with PI_HAT greater than 0.5 in this spreadsheet. PI_HAT is the proportion IBD estimated by plink. Notice also that all these pairs also have high IBS similarity (the DSC column), more than 83% similar.

The five samples I have removed as a result of this are listed in this spreadsheet.

Related Reading:

It's Not About the Coffee: Lessons on Putting People First from a Life at Starbucks
What to Eat with IBD: A Comprehensive Nutrition and Recipe Guide for Crohn's Disease and Ulcerative Colitis
Sheetzucacapoopoo: My Kind of Dog
Investor's Business Daily and the Making of Millionaires: How IBD Rewrote the Rules of Investing and Business News
Legends of the middle ages, narrated with special reference to literature and art

Related Posts:

  1. Relatives in Datasets | Harappa Ancestry Project - pingback on February 6, 2012 at 6:11 pm

Leave a Comment

NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">

Trackbacks and Pingbacks: