Metspalu et al Data Relatedness

I performed IBD analysis on the Metspalu dataset using plink and found the relatedness of the following samples to be too high.

ID1 Source1 Population1 ID2 Source2 Population2 IBD Estimate
Mawasi1 Metspalu Mawasi Mawasi1 Chaubey Mawasi 100%
VELZ260 Metspalu Velama Velama_184_R2 Reich Velama 99%
VELZ260 Metspalu Velama VELZ265 Metspalu Velama 19%
VELZ265 Metspalu Velama Velama_184_R2 Reich Velama 19%
D254 Metspalu Tharu Tharu_107_R1 Reich Tharu 99%
D260 Metspalu Tharu Tharu_108_R1 Reich Tharu 98%
evo_32 Metspalu Kanjar 321e Metspalu Kol 53%
HA030 Metspalu Dharkar HA039 Metspalu Dharkar 52%
A387 Metspalu Dusadh A388 Metspalu Dusadh 52%
A394 Metspalu Dusadh A395 Metspalu Dusadh 52%
A395 Metspalu Dusadh A393 Metspalu Dusadh 46%
A394 Metspalu Dusadh A393 Metspalu Dusadh 45%
A392 Metspalu Dusadh A393 Metspalu Dusadh 32%
A392 Metspalu Dusadh A395 Metspalu Dusadh 31%
A392 Metspalu Dusadh A394 Metspalu Dusadh 28%
evo_37 Metspalu Kanjar HA023 Metspalu Dharkar 27%
HA039 Metspalu Dharkar HA041 Metspalu Dharkar 24%
HLKP245 Metspalu Hakkipikki Hallaki_137_R2 Reich Hallaki 22%
PULD160 Metspalu Pulliyar PULD162 Metspalu Pulliyar 20%

As you can see, three samples from Reich et al seem to be the same as Metspalu et al. In addition, two Reich samples seem to be related to Metspalu samples.

There are some Metspalu samples who are likely related to one another. A 50% indicates likely a parent-child or sibling-sibling relationship. A 45-46% relatedness is most likely siblings in my opinion. An 18-19% percentage could be a 1st cousin relationship in an endogamous community. it could also just be the background relatedness in a small, bottlenecked and endogamous community.

It looks like about half of the Dusadh in the Metspalu dataset are related.

I am surprised at the close relationship of a Kanjar and a Kol in the dataset, though both are from Uttar Pradesh.


  1. Is it so hard to do sampling without incorporating relatives? Most ethnic or regional groups have millions of members.

    • With endogamy the relatedness within groups is very high. In our relatively insular caste (total population ~7 million) in spite of gotra and clan exogamy almost everyone is related within a few degrees to another.

    • I think it is hard.

      • Why? You have the option of collecting all samples of an ethnic group from different locales of a country. This way all samples of the ethnic group will likely be non-relatives.

  2. While I can see the difficulty in collecting unrelated samples what I can't understand is the lack first-level pruning/cleaning-up of data just as you have done above. A careful reading of the Material and Methods doesn't show any specific issue here. It's not rocket-science and there is no reason to doubt your numbers above. (Perhaps the data you were given was the
    unpruned set? The dataset is passwd-protected at NCBi/GEO.)

    Assuming your days for the next week had 48h each (:)) it shouldn't be too
    hard to compute if this data indeed did skew relevant haplotype frequecies
    and/or Fst's, right? (These two strike me as the most probable - perhaps I
    am mistaken in this.)

    • The blood/saliva samples are collected by one (or more groups), the genotyping is done by another lab and the analysis (that brought us the paper) is done by others still. Usually it's at the last step that you find out about samples being related. It's not possible to go back into the field then.

      My guess is the results in the paper are not impacted much by those relatives. The Reich et al data was used in only a couple of analyses. The only population whose results are likely to be affected is Dusadh.

      I am going to compute Fst to compare with the Metspalu et al paper.

      • Yes, given the size of the related set above to the set used in the analyses, I expect it to have minimal impact on the results if any. I was a bit surprised there was no mention of this related set in the paper. Agree with your first paragraph.