Tag Archives: 1000genomes

Introducing Reference 3

Having collected 12 datasets, I have gone through them and finally selected the samples and SNPs I want to include in my new dataset, which I'll call Reference 3.

It has 3,889 individuals and 217,957 SNPs. Since this is a South Asia focused blog, there are a total of 558 South Asians in this reference set (compared to 398 in my Reference I).

You can see the number of SNPs of various datasets which are common to 23andme version 2, 23andme version 3 and FTDNA Family Finder (Illumina chip).

The following datasets had more than 280,000 SNPs common with all three platforms and hence were included in Reference 3:

  1. HapMap
  2. HGDP
  3. SGVP
  4. Behar
  5. Henn (Khoisan data)
  6. Rasmussen
  7. Austroasiatic
  8. Latino
  9. 1000genomes

Reich et al had about 100,000 SNPs in common with 23andme (v2 & v3 intersection) and 137,000 with FTDNA, but there was not a great overlap. Only 59,000 Reich et al SNPs were present in all three platforms. Since I really wanted Reich et al data in Reference 3, I included it but the SNPs used for FTDNA comparisons won't be the same as for the 23andme comparisons.

Of the datasets I could not include, I am most disappointed about the Pan-Asian dataset since it has a good coverage of South and Southeast Asia. Unfortunately, it has only 19,000 SNPs in common with 23andme v2 and 23,000 with 23andme v3. I am going to have to do some analyses with the Pan-Asian data but it just can't be included in my Reference 3.

I am also interested in doing some analysis with the Henn et al African data with about 52,000 SNPs for personal reasons.

Xing et al has about 71,000 SNPs in common with 23andme v3, so some good work could be done with that, though I'll have to use only 23andme version 3 participants.

The information about the populations included in Reference 3 is in a spreadsheet as usual.


I got the 1000genomes data a couple of weeks ago. Trying to convert it from VCF to PED format using vcftools was a complete disaster. Then Dienekes sent me a conversion script which was more than a hundred times faster.

1000genomes will have 100 Assamese Ahom, 100 Kayadtha from Calcutta, 100 Reddys from Hyderabad, 100 Maratha from Bombay and 100 Lahori Punjabis later this year. Right now, the new populations (other than HapMap) are British, Finns, Han Chinese South, Puerto Ricans, Colombians, and Spaniards.

I removed all the 660 samples which were common with the HapMap data. Also, there were 31 pairs with high IBD values. The list of IBD/IBS values and the samples I removed can be seen in the spreadsheet.

Two Steps Forward, Two Steps Back

I got my daughter a netbook, so now my computer is doing Harappa Project work 24x7.

Also, Simranjit was nice enough to offer me the use of a server. For privacy reasons, I am not going to upload any of the participants' data there but it is much faster than my machine and hence very useful for running Admixture on the reference data (especially with crossvalidation).

As for steps back, I downloaded the current 1000genomes data (1,212 samples, 2.4 million SNPs). It's in vcf format. Using vcftools to convert it to ped format will take about 3 weeks. Yes you heard that right. BTW, the good stuff from a South Asian point of view will come later this year with a 100 Assamese Ahom, 100 Kayadtha from Calcutta, 100 Reddys from Hyderabad, 100 Maratha from Bombay and 100 Lahori Punjabis.

Also, I spent most of Sunday evening and night in the ER and got a diagnosis of ureterolithiasis for my efforts. All I can say is: Three cheers for Percocet!!

UPDATE: Dienekes was kind enough to send me his conversion code which looking at the source code should run really fast.

I am still astonished at why the vcftools conversion code is so slow. May be I should look at their source code.