Tag Archives: ibd

Relatives in Datasets

Recently, there was a paper Identification of Close Relatives in the HUGO Pan-Asian SNP Database by Xiong Yang, Shuhua Xu, and the HUGO Pan-Asian SNP Consortium.

three individuals involved in MZ pairs were excluded from the whole dataset to construct standardized subset PASNP1716; seventy-six individuals involved in first-degree relationships were excluded from PASNP1716 to construct standardized subset PASNP1640; and 57 individuals involved in second-degree relationships were excluded from PASNP1640 to construct standardized subset PASNP1583. The individuals excluded were summarized in Table S6, S7, S8.

Let me engage in some blog triumphalism by saying I wrote about the duplicates and relatives in the Pan-Asian dataset in April 2011.

Here are my blog posts about relatedness in datasets:

Early on, I was removing only first degree relatives from the reference datasets. Nowadays, I try to remove all second degree relatives too. I leave the third degree relatives in the data since it's sometimes hard to figure out how real the low IBD values are in Plink. There are a lot of 3rd degree relatives if Plink is to be believed, but I am a little skeptical.

Since Plink's IBD analysis requires homogenous samples, I am now using KING (paper) for the purpose. I am also looking at kcoeff (paper)

Metspalu et al Data Relatedness

I performed IBD analysis on the Metspalu dataset using plink and found the relatedness of the following samples to be too high.

ID1 Source1 Population1 ID2 Source2 Population2 IBD Estimate
Mawasi1 Metspalu Mawasi Mawasi1 Chaubey Mawasi 100%
VELZ260 Metspalu Velama Velama_184_R2 Reich Velama 99%
VELZ260 Metspalu Velama VELZ265 Metspalu Velama 19%
VELZ265 Metspalu Velama Velama_184_R2 Reich Velama 19%
D254 Metspalu Tharu Tharu_107_R1 Reich Tharu 99%
D260 Metspalu Tharu Tharu_108_R1 Reich Tharu 98%
evo_32 Metspalu Kanjar 321e Metspalu Kol 53%
HA030 Metspalu Dharkar HA039 Metspalu Dharkar 52%
A387 Metspalu Dusadh A388 Metspalu Dusadh 52%
A394 Metspalu Dusadh A395 Metspalu Dusadh 52%
A395 Metspalu Dusadh A393 Metspalu Dusadh 46%
A394 Metspalu Dusadh A393 Metspalu Dusadh 45%
A392 Metspalu Dusadh A393 Metspalu Dusadh 32%
A392 Metspalu Dusadh A395 Metspalu Dusadh 31%
A392 Metspalu Dusadh A394 Metspalu Dusadh 28%
evo_37 Metspalu Kanjar HA023 Metspalu Dharkar 27%
HA039 Metspalu Dharkar HA041 Metspalu Dharkar 24%
HLKP245 Metspalu Hakkipikki Hallaki_137_R2 Reich Hallaki 22%
PULD160 Metspalu Pulliyar PULD162 Metspalu Pulliyar 20%

As you can see, three samples from Reich et al seem to be the same as Metspalu et al. In addition, two Reich samples seem to be related to Metspalu samples.

There are some Metspalu samples who are likely related to one another. A 50% indicates likely a parent-child or sibling-sibling relationship. A 45-46% relatedness is most likely siblings in my opinion. An 18-19% percentage could be a 1st cousin relationship in an endogamous community. it could also just be the background relatedness in a small, bottlenecked and endogamous community.

It looks like about half of the Dusadh in the Metspalu dataset are related.

I am surprised at the close relationship of a Kanjar and a Kol in the dataset, though both are from Uttar Pradesh.

Reich et al Duplicates

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Reich et al Indian dataset.

The dataset doesn't have any duplicate or likely relative samples itself. However, there are two Kharia samples that are the same as the Austroasiatic dataset. Since Austroasiatic dataset has more SNPs in common with 23andme, I removed these two samples from Reich et al.

The IBS/IBS analysis and the sample IDs are in a spreadsheet as usual.

1000genomes

I got the 1000genomes data a couple of weeks ago. Trying to convert it from VCF to PED format using vcftools was a complete disaster. Then Dienekes sent me a conversion script which was more than a hundred times faster.

1000genomes will have 100 Assamese Ahom, 100 Kayadtha from Calcutta, 100 Reddys from Hyderabad, 100 Maratha from Bombay and 100 Lahori Punjabis later this year. Right now, the new populations (other than HapMap) are British, Finns, Han Chinese South, Puerto Ricans, Colombians, and Spaniards.

I removed all the 660 samples which were common with the HapMap data. Also, there were 31 pairs with high IBD values. The list of IBD/IBS values and the samples I removed can be seen in the spreadsheet.

Latino Dataset

Razib mentioned a Latino/Hispanic dataset to me a few days ago.

The relevant paper is "Genome-wide patterns of population structure and admixture among Hispanic/Latino populations" by Katarzyna Bryca, Christopher Velezb, Tatiana Karafetc, Andres Moreno-Estradaa, Andy Reynoldsa, Adam Autona, Michael Hammerc, Carlos D. Bustamantea, and Harry Ostrer. And the data is available on the GEO Accession viewer.

The dataset has 100 samples from Colombia, Dominican Republic, Ecuador, and Puerto Rico.

It's in the same format and uses the same chip as Behar et al and Rasmussen et al. So it was really easy to download and convert it to Plink PED format.

Now what does a Hispanic dataset got to do with a South Asian genetics project? Nothing, for now. But I am collecting all genotyping data. And also I am hoping that we get more participants of South Asian origin from the Caribbean and other countries of the region where there has been a longer presence of South Asians. In that case, it would be interesting to compare them against other populations of the Americas.

In keeping with my effort to clean the data of any relatives, here are the IBD/IBS analysis results. The 2nd sheet shows the two samples I removed.

Pan-Asian Dataset Duplicates and Relatives

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

Looking at the Pan-Asian dataset, I found 3 pairs of duplicate samples and 82 pairs that could be closely related. I have removed 64 samples from the dataset.

You can see the IBD results from plink as well as the list of sample IDs I removed in a spreadsheet.

UPDATE: I found 4 Melanesians in the Pan-Asian dataset who were the same as those in HGDP. So I have removed those as well and added them in the list in the spreadsheet.

Austroasiatic Dataset Duplicates

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Chaubey et al Austroasiatic Indians dataset.

The dataset doesn't have any duplicate or likely relative samples itself. Of course, I had to remove the 632 HGDP samples it had, but that's easy to do since they have the same IDs (starting with HGDP).

As their paper mentions, the dataset also has 19 Dravidian speaking Indian samples from Behar et al. Since I got Behar et al data from the GEO site, I had different IDs for them than what they use in this dataset. So I had to figure out which samples were the same in both. The IBS/IBD results of duplicates as well as the list of sample IDs I removed is given in a spreadsheet.

Checking this out resolved an issue I had with Behar et al. Behar et al has 4 Paniya samples from South India. One of those four has admixture proportions similar to Indians but three seem very East Asian. I had always suspected that those three samples were mislabeled. Now the Austroasiatic dataset also has those four Paniya samples. However, only one of them is identical to the Behar et al one. The other three are different. I haven't checked yet which one of the Behar samples matches Austroasiatic, but my guess is that it is the more Indian admixture one. So I am keeping the other three Paniya samples from the Austroasiatic dataset and hoping that they are the correct ones.

Rasmussen Likely Relatives

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Rasmussen et al dataset, which you can download from here.

While there are no duplicates, 9 pairs of samples have high IBS values (85% similar or more) and seem to be related (Plink PI_HAT > 0.5). You can see the IBD results in a spreadsheet, along with the 8 samples I removed.

Henn Duplicates

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Henn et al dataset, which you can download from their website.

There are 107 samples common from the HapMap (IDs start with NA) and 131 from HGDP (IDs start with HGDP).

Henn et al has two PED files. One for the Khoisan data and one for all Africa 55k SNP set. Unfortunately they have 31 San duplicated in both these PED files with same individual IDs but different family IDs (SAN and SAN_SA). So they do not get automatically merged per Plink procedures. Just remove all the ones with SAN_SA FID since they have fewer SNPs. All the IBD info etc is in this spreadsheet.

Xing Redo

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Xing et al dataset, which you can download from their website.

I found no duplicates within the Xing et al data but there are 259 samples common from the HapMap. Since they are not assigned any family IDs they will pass through the ped files without being merged into HapMap samples. So you need to remove any samples with IDs starting with "NA".

Xing et al also contains 6 duplicates from HGDP with completely different IDs and two Xing samples look to be related to HGDP samples.

There are also three pairs with very high identity-by-descent values, which I calculated using Plink. You can see the samples with PI_HAT greater than 0.5 in this spreadsheet. PI_HAT is the proportion IBD estimated by plink. Notice also that all these pairs also have high IBS similarity (the DSC column), more than 85% similar.

The samples I have removed as a result of this (other than HapMap) are listed in this spreadsheet.