Introducing Reference 3

Having collected 12 datasets, I have gone through them and finally selected the samples and SNPs I want to include in my new dataset, which I'll call Reference 3.

It has 3,889 individuals and 217,957 SNPs. Since this is a South Asia focused blog, there are a total of 558 South Asians in this reference set (compared to 398 in my Reference I).

You can see the number of SNPs of various datasets which are common to 23andme version 2, 23andme version 3 and FTDNA Family Finder (Illumina chip).

The following datasets had more than 280,000 SNPs common with all three platforms and hence were included in Reference 3:

  1. HapMap
  2. HGDP
  3. SGVP
  4. Behar
  5. Henn (Khoisan data)
  6. Rasmussen
  7. Austroasiatic
  8. Latino
  9. 1000genomes

Reich et al had about 100,000 SNPs in common with 23andme (v2 & v3 intersection) and 137,000 with FTDNA, but there was not a great overlap. Only 59,000 Reich et al SNPs were present in all three platforms. Since I really wanted Reich et al data in Reference 3, I included it but the SNPs used for FTDNA comparisons won't be the same as for the 23andme comparisons.

Of the datasets I could not include, I am most disappointed about the Pan-Asian dataset since it has a good coverage of South and Southeast Asia. Unfortunately, it has only 19,000 SNPs in common with 23andme v2 and 23,000 with 23andme v3. I am going to have to do some analyses with the Pan-Asian data but it just can't be included in my Reference 3.

I am also interested in doing some analysis with the Henn et al African data with about 52,000 SNPs for personal reasons.

Xing et al has about 71,000 SNPs in common with 23andme v3, so some good work could be done with that, though I'll have to use only 23andme version 3 participants.

The information about the populations included in Reference 3 is in a spreadsheet as usual.

34 Comments.

  1. Reference 3 Admixture K=2 | Harappa Ancestry Project - pingback on April 13, 2011 at 4:10 pm
  2. Hi Zack, does this mean you're officially accepting participants who've tested with FTDNA?

  3. Reference 3 Admixture K=3 | Harappa Ancestry Project - pingback on April 14, 2011 at 12:36 am
  4. Reference 3 Admixture K=5 | Harappa Ancestry Project - pingback on April 14, 2011 at 5:07 pm
  5. Reference 3 Admixture K=6 | Harappa Ancestry Project - pingback on April 19, 2011 at 2:41 am
  6. Reference 3 Admixture K=10 | Harappa Ancestry Project - pingback on April 20, 2011 at 8:11 pm
  7. Reference 3 Admixture K=11 | Harappa Ancestry Project - pingback on April 21, 2011 at 4:57 am
  8. Harappa (1-90) K=11 Admixture Ref3 | Harappa Ancestry Project - pingback on April 28, 2011 at 3:07 pm
  9. Reference 3F(iltered) Admixture | Harappa Ancestry Project - pingback on May 5, 2011 at 12:35 am
  10. Reference 3 PCA | Harappa Ancestry Project - pingback on May 13, 2011 at 3:24 pm
  11. Reference 3 South Asians PCA | Harappa Ancestry Project - pingback on May 16, 2011 at 6:52 am
  12. Another Reference Admixture Set | Harappa Ancestry Project - pingback on May 17, 2011 at 10:24 am
  13. Admixture (Ref3 K=11) HRP0151-HRP0160 | Harappa Ancestry Project - pingback on August 9, 2011 at 10:13 am
  14. Admixture (Ref3 K=11) HRP0161-HRP0170 | Harappa Ancestry Project - pingback on September 7, 2011 at 7:35 am
  15. Admixture (Ref3 K=11) HRP0171-HRP0180 | Harappa Ancestry Project - pingback on October 6, 2011 at 8:53 am
  16. Admixture (Ref3 K=11) HRP0181-HRP0190 | Harappa Ancestry Project - pingback on November 3, 2011 at 8:27 am
  17. Admixture (Ref3 K=11) HRP0191-HRP0200 | Harappa Ancestry Project - pingback on December 1, 2011 at 5:51 pm
  18. Metspalu Ref3 Admixture Results | Harappa Ancestry Project - pingback on December 12, 2011 at 5:18 pm
  19. Yunusbayev Ref3 Admixture Results | Harappa Ancestry Project - pingback on December 16, 2011 at 6:26 pm
  20. I need to convert vcf into something usable by Excel. How do I do this?

  21. BTW, I will be praying for the guy who received the bad news from the doctor.

  22. Admixture (Ref3 K=11) HRP0211-HRP0220 | Harappa Ancestry Project - pingback on February 13, 2012 at 6:24 pm
  23. Hi Zack, I ran my 23nMe v3 results with HGDP, 1000Genomes and SGVP. After merging, --geno, and pruning, I ended up with ~250,000 SNPs. I ran admixture K=8 (*very* long wait) and I'm getting weird results. The Europeans Italians from HGDP is in a different populaton than 1000Genomes. The SGVP and HGDP essentially makes sense but 1000genomes created another column for its populations! After much research, I think it has something to do with the Build. 1000Genomes I used is b37 while the HGDP & SGVP are b36. Did you come across this issue with your merging and getting results?

Trackbacks and Pingbacks: