Behar et al Data

In their paper "The genome-wide structure of the Jewish people", Behar et al analyzed the genomes of some Jewish groups. More important than the Jewish samples (which include two South Asian Jewish groups) for us are the different South Asian, Middle Eastern, and European groups they sampled:

Ethnic group Count
Saudis 20
Jordanians 20
Georgians 20
Turks 19
Iranians 19
Hungarians 19
Ethiopians 19
Armenians 19
Lezgins 18
Chuvashs 17
Syrians 16
Romanians 16
Uzbeks 15
Spaniards 12
Egyptians 12
Cypriots 12
Moroccans 10
Lithuanians 10
North Kannadi 9
Belorussian 9
Yemenese 8
Lebanese 7
Sakilli 4
Paniya 4
Cochin Jews 4
Bene Israel 4
Samaritians 2
Russian 2
Malayan 2

Of the 466 samples, I excluded 8 because they were either duplicates or too similar in their genomes to others.

The series matrix files that I downloaded were in a somewhat different format. To convert them to Plink format, I had to look up the platform file for the Illumina genotyping BeadChip they used. Also, Illumina used an A/B alleles and Top/Bot strands system instead of the regular ACGT alleles and forward/reverse strands. This Illumina Technote explained it and I found a Perl script to convert between the two.


  1. I really appreciate this transparency about the datasets you're using; it lets us lowly commenters play along at home. Quick question: When you're pruning for linkage disequilibrium, what R^2 threshold are you using? It would be neat to see your summary statistics or your plink arguments.

    This probably says more about how neurotic I am than anything else, but Behar, et al.'s labeling their South Indian sample "North_Kannadi" always annoyed me. It's one of those fake eastern adjectivalizations, like jihadi. It would've been better to use Kannadiga, or even Canarese.

    • Right now I am using an R^2 of 0.3 for LD pruning. But I plan to try some other values as well to see the effect on admixture analysis.

      I am trying to be transparent and likely boring the heck out of most people. But any questions about my code, methods or data are welcome.

  2. It would’ve been better to use Kannadiga, or even Canarese.

    thanks! i had no idea what your kind were called 🙂

  3. Admixture: Reference Population | Harappa Ancestry Project - pingback on January 30, 2011 at 1:56 am
  4. Iranians | Harappa Ancestry Project - pingback on March 24, 2011 at 12:13 pm
  5. Africa in 12 ADMIXTURE chunks | Biology News by Biologged - pingback on April 7, 2011 at 1:31 am
  6. Behar Paniya | Harappa Ancestry Project - pingback on April 16, 2011 at 8:14 pm
  7. Behar Bene Israel | Harappa Ancestry Project - pingback on April 20, 2011 at 7:28 pm
  8. Hi Zack
    I'm having hard time converting Behar's dataset to plink formats. Can u please help by detailing how u did it. Thanks!

    • I described the process earlier.

      If you want my hacked together script, send me an email and I'll send it to you.

      • First and foremost, you have the least cryptic genome website (especially works well with newbies like myself). I really appreciate the effort.

        Do you mind sending the same script?