Tag Archives: ibd

Relatives in Datasets

Posted by Zack on February 6, 2012 Comments Off

Recently, there was a paper Identification of Close Relatives in the HUGO Pan-Asian SNP Database by Xiong Yang, Shuhua Xu, and the HUGO Pan-Asian SNP Consortium.

three individuals involved in MZ pairs were excluded from the whole dataset to construct standardized subset PASNP1716; seventy-six individuals involved in first-degree relationships were excluded from PASNP1716 to construct standardized subset PASNP1640; and 57 individuals involved in second-degree relationships were excluded from PASNP1640 to construct standardized subset PASNP1583. The individuals excluded were summarized inÂ Table S6,Â S7,Â S8.

Let me engage in some blog triumphalism by saying I wrote about the duplicates and relatives in the Pan-Asian dataset in April 2011.

Here are my blog posts about relatedness in datasets:

Early on, I was removing only first degree relatives from the reference datasets. Nowadays, I try to remove all second degree relatives too. I leave the third degree relatives in the data since it's sometimes hard to figure out how real the low IBD values are in Plink. There are a lot of 3rd degree relatives if Plink is to be believed, but I am a little skeptical.

Since Plink's IBD analysis requires homogenous samples, I am now using KING (paper) for the purpose. I am also looking at kcoeff (paper)

Metspalu et al Data Relatedness

Posted by Zack on December 11, 2011 9 comments

I performed IBD analysis on the Metspalu dataset using plink and found the relatedness of the following samples to be too high.

ID1	Source1	Population1	ID2	Source2	Population2	IBD Estimate
Mawasi1	Metspalu	Mawasi	Mawasi1	Chaubey	Mawasi	100%
VELZ260	Metspalu	Velama	Velama_184_R2	Reich	Velama	99%
VELZ260	Metspalu	Velama	VELZ265	Metspalu	Velama	19%
VELZ265	Metspalu	Velama	Velama_184_R2	Reich	Velama	19%
D254	Metspalu	Tharu	Tharu_107_R1	Reich	Tharu	99%
D260	Metspalu	Tharu	Tharu_108_R1	Reich	Tharu	98%
evo_32	Metspalu	Kanjar	321e	Metspalu	Kol	53%
HA030	Metspalu	Dharkar	HA039	Metspalu	Dharkar	52%
A387	Metspalu	Dusadh	A388	Metspalu	Dusadh	52%
A394	Metspalu	Dusadh	A395	Metspalu	Dusadh	52%
A395	Metspalu	Dusadh	A393	Metspalu	Dusadh	46%
A394	Metspalu	Dusadh	A393	Metspalu	Dusadh	45%
A392	Metspalu	Dusadh	A393	Metspalu	Dusadh	32%
A392	Metspalu	Dusadh	A395	Metspalu	Dusadh	31%
A392	Metspalu	Dusadh	A394	Metspalu	Dusadh	28%
evo_37	Metspalu	Kanjar	HA023	Metspalu	Dharkar	27%
HA039	Metspalu	Dharkar	HA041	Metspalu	Dharkar	24%
HLKP245	Metspalu	Hakkipikki	Hallaki_137_R2	Reich	Hallaki	22%
PULD160	Metspalu	Pulliyar	PULD162	Metspalu	Pulliyar	20%

As you can see, three samples from Reich et al seem to be the same as Metspalu et al. In addition, two Reich samples seem to be related to Metspalu samples.

There are some Metspalu samples who are likely related to one another. A 50% indicates likely a parent-child or sibling-sibling relationship. A 45-46% relatedness is most likely siblings in my opinion. An 18-19% percentage could be a 1st cousin relationship in an endogamous community. it could also just be the background relatedness in a small, bottlenecked and endogamous community.

It looks like about half of the Dusadh in the Metspalu dataset are related.

I am surprised at the close relationship of a Kanjar and a Kol in the dataset, though both are from Uttar Pradesh.

Reich et al Duplicates

Posted by Zack on April 12, 2011 Comments Off

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Reich et al Indian dataset.

The dataset doesn't have any duplicate or likely relative samples itself. However, there are two Kharia samples that are the same as the Austroasiatic dataset. Since Austroasiatic dataset has more SNPs in common with 23andme, I removed these two samples from Reich et al.

The IBS/IBS analysis and the sample IDs are in a spreadsheet as usual.

1000genomes

Posted by Zack on April 10, 2011 Comments Off

I got the 1000genomes data a couple of weeks ago. Trying to convert it from VCF to PED format using vcftools was a complete disaster. Then Dienekes sent me a conversion script which was more than a hundred times faster.

1000genomes will have 100 Assamese Ahom, 100 Kayadtha from Calcutta, 100 Reddys from Hyderabad, 100 Maratha from Bombay and 100 Lahori Punjabis later this year. Right now, the new populations (other than HapMap) are British, Finns, Han Chinese South, Puerto Ricans, Colombians, and Spaniards.

I removed all the 660 samples which were common with the HapMap data. Also, there were 31 pairs with high IBD values. The list of IBD/IBS values and the samples I removed can be seen in the spreadsheet.

Latino Dataset

Posted by Zack on April 9, 2011 1 comment

Razib mentioned a Latino/Hispanic dataset to me a few days ago.

The relevant paper is "Genome-wide patterns of population structure and admixture among Hispanic/Latino populations" by Katarzyna Bryca, Christopher Velezb, Tatiana Karafetc, Andres Moreno-Estradaa, Andy Reynoldsa, Adam Autona, Michael Hammerc, Carlos D. Bustamantea, and Harry Ostrer. And the data is available on the GEO Accession viewer.

The dataset has 100 samples from Colombia, Dominican Republic, Ecuador, and Puerto Rico.

It's in the same format and uses the same chip as Behar et al and Rasmussen et al. So it was really easy to download and convert it to Plink PED format.

Now what does a Hispanic dataset got to do with a South Asian genetics project? Nothing, for now. But I am collecting all genotyping data. And also I am hoping that we get more participants of South Asian origin from the Caribbean and other countries of the region where there has been a longer presence of South Asians. In that case, it would be interesting to compare them against other populations of the Americas.

In keeping with my effort to clean the data of any relatives, here are the IBD/IBS analysis results. The 2nd sheet shows the two samples I removed.

Pan-Asian Dataset Duplicates and Relatives

Posted by Zack on April 9, 2011 2 comments

Looking at the Pan-Asian dataset, I found 3 pairs of duplicate samples and 82 pairs that could be closely related. I have removed 64 samples from the dataset.

You can see the IBD results from plink as well as the list of sample IDs I removed in a spreadsheet.

UPDATE: I found 4 Melanesians in the Pan-Asian dataset who were the same as those in HGDP. So I have removed those as well and added them in the list in the spreadsheet.

Austroasiatic Dataset Duplicates

Posted by Zack on April 8, 2011 Comments Off

So I went back to the Chaubey et al Austroasiatic Indians dataset.

The dataset doesn't have any duplicate or likely relative samples itself. Of course, I had to remove the 632 HGDP samples it had, but that's easy to do since they have the same IDs (starting with HGDP).

As their paper mentions, the dataset also has 19 Dravidian speaking Indian samples from Behar et al. Since I got Behar et al data from the GEO site, I had different IDs for them than what they use in this dataset. So I had to figure out which samples were the same in both. The IBS/IBD results of duplicates as well as the list of sample IDs I removed is given in a spreadsheet.

Checking this out resolved an issue I had with Behar et al. Behar et al has 4 Paniya samples from South India. One of those four has admixture proportions similar to Indians but three seem very East Asian. I had always suspected that those three samples were mislabeled. Now the Austroasiatic dataset also has those four Paniya samples. However, only one of them is identical to the Behar et al one. The other three are different. I haven't checked yet which one of the Behar samples matches Austroasiatic, but my guess is that it is the more Indian admixture one. So I am keeping the other three Paniya samples from the Austroasiatic dataset and hoping that they are the correct ones.

Rasmussen Likely Relatives

Posted by Zack on April 7, 2011 Comments Off

So I went back to the Rasmussen et al dataset, which you can download from here.

While there are no duplicates, 9 pairs of samples have high IBS values (85% similar or more) and seem to be related (Plink PI_HAT > 0.5). You can see the IBD results in a spreadsheet, along with the 8 samples I removed.

Henn Duplicates

Posted by Zack on April 6, 2011 2 comments

So I went back to the Henn et al dataset, which you can download from their website.

There are 107 samples common from the HapMap (IDs start with NA) and 131 from HGDP (IDs start with HGDP).

Henn et al has two PED files. One for the Khoisan data and one for all Africa 55k SNP set. Unfortunately they have 31 San duplicated in both these PED files with same individual IDs but different family IDs (SAN and SAN_SA). So they do not get automatically merged per Plink procedures. Just remove all the ones with SAN_SA FID since they have fewer SNPs. All the IBD info etc is in this spreadsheet.

Xing Redo

Posted by Zack on April 3, 2011 Comments Off

So I went back to the Xing et al dataset, which you can download from their website.

I found no duplicates within the Xing et al data but there are 259 samples common from the HapMap. Since they are not assigned any family IDs they will pass through the ped files without being merged into HapMap samples. So you need to remove any samples with IDs starting with "NA".

Xing et al also contains 6 duplicates from HGDP with completely different IDs and two Xing samples look to be related to HGDP samples.

There are also three pairs with very high identity-by-descent values, which I calculated using Plink. You can see the samples with PI_HAT greater than 0.5 in this spreadsheet. PI_HAT is the proportion IBD estimated by plink. Notice also that all these pairs also have high IBS similarity (the DSC column), more than 85% similar.

The samples I have removed as a result of this (other than HapMap) are listed in this spreadsheet.

Harappa Ancestry Project

Genetics and South Asia

Tag Archives: ibd

Relatives in Datasets

Metspalu et al Data Relatedness

Reich et al Duplicates

1000genomes

Latino Dataset

Pan-Asian Dataset Duplicates and Relatives

Austroasiatic Dataset Duplicates

Rasmussen Likely Relatives

Henn Duplicates

Xing Redo

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Archives

Recent Comments

Blogroll

Genetics and South Asia

Tag Archives: ibd

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Tags

Archives

Recent Comments

Blogroll