In their paper "The genome-wide structure of the Jewish people", Behar et al analyzed the genomes of some Jewish groups. More important than the Jewish samples (which include two South Asian Jewish groups) for us are the different South Asian, Middle Eastern, and European groups they sampled:
Ethnic group | Count |
---|---|
Saudis | 20 |
Jordanians | 20 |
Georgians | 20 |
Turks | 19 |
Iranians | 19 |
Hungarians | 19 |
Ethiopians | 19 |
Armenians | 19 |
Lezgins | 18 |
Chuvashs | 17 |
Syrians | 16 |
Romanians | 16 |
Uzbeks | 15 |
Spaniards | 12 |
Egyptians | 12 |
Cypriots | 12 |
Moroccans | 10 |
Lithuanians | 10 |
North Kannadi | 9 |
Belorussian | 9 |
Yemenese | 8 |
Lebanese | 7 |
Sakilli | 4 |
Paniya | 4 |
Cochin Jews | 4 |
Bene Israel | 4 |
Samaritians | 2 |
Russian | 2 |
Malayan | 2 |
Of the 466 samples, I excluded 8 because they were either duplicates or too similar in their genomes to others.
The series matrix files that I downloaded were in a somewhat different format. To convert them to Plink format, I had to look up the platform file for the Illumina genotyping BeadChip they used. Also, Illumina used an A/B alleles and Top/Bot strands system instead of the regular ACGT alleles and forward/reverse strands. This Illumina Technote explained it and I found a Perl script to convert between the two.
I really appreciate this transparency about the datasets you're using; it lets us lowly commenters play along at home. Quick question: When you're pruning for linkage disequilibrium, what R^2 threshold are you using? It would be neat to see your summary statistics or your plink arguments.
This probably says more about how neurotic I am than anything else, but Behar, et al.'s labeling their South Indian sample "North_Kannadi" always annoyed me. It's one of those fake eastern adjectivalizations, like jihadi. It would've been better to use Kannadiga, or even Canarese.
Right now I am using an R^2 of 0.3 for LD pruning. But I plan to try some other values as well to see the effect on admixture analysis.
I am trying to be transparent and likely boring the heck out of most people. But any questions about my code, methods or data are welcome.
well, the minority not bored are prolly going to be useful later if you want ppl to double check, etc.
It would’ve been better to use Kannadiga, or even Canarese.
thanks! i had no idea what your kind were called 🙂
Hi Zack
I'm having hard time converting Behar's dataset to plink formats. Can u please help by detailing how u did it. Thanks!
I described the process earlier.
If you want my hacked together script, send me an email and I'll send it to you.
First and foremost, you have the least cryptic genome website (especially works well with newbies like myself). I really appreciate the effort.
Do you mind sending the same script?