Introducing Reference 3

Having collected 12 datasets, I have gone through them and finally selected the samples and SNPs I want to include in my new dataset, which I'll call Reference 3.

It has 3,889 individuals and 217,957 SNPs. Since this is a South Asia focused blog, there are a total of 558 South Asians in this reference set (compared to 398 in my Reference I).

You can see the number of SNPs of various datasets which are common to 23andme version 2, 23andme version 3 and FTDNA Family Finder (Illumina chip).

The following datasets had more than 280,000 SNPs common with all three platforms and hence were included in Reference 3:

  1. HapMap
  2. HGDP
  3. SGVP
  4. Behar
  5. Henn (Khoisan data)
  6. Rasmussen
  7. Austroasiatic
  8. Latino
  9. 1000genomes

Reich et al had about 100,000 SNPs in common with 23andme (v2 & v3 intersection) and 137,000 with FTDNA, but there was not a great overlap. Only 59,000 Reich et al SNPs were present in all three platforms. Since I really wanted Reich et al data in Reference 3, I included it but the SNPs used for FTDNA comparisons won't be the same as for the 23andme comparisons.

Of the datasets I could not include, I am most disappointed about the Pan-Asian dataset since it has a good coverage of South and Southeast Asia. Unfortunately, it has only 19,000 SNPs in common with 23andme v2 and 23,000 with 23andme v3. I am going to have to do some analyses with the Pan-Asian data but it just can't be included in my Reference 3.

I am also interested in doing some analysis with the Henn et al African data with about 52,000 SNPs for personal reasons.

Xing et al has about 71,000 SNPs in common with 23andme v3, so some good work could be done with that, though I'll have to use only 23andme version 3 participants.

The information about the populations included in Reference 3 is in a spreadsheet as usual.

Latino Dataset

Razib mentioned a Latino/Hispanic dataset to me a few days ago.

The relevant paper is "Genome-wide patterns of population structure and admixture among Hispanic/Latino populations" by Katarzyna Bryca, Christopher Velezb, Tatiana Karafetc, Andres Moreno-Estradaa, Andy Reynoldsa, Adam Autona, Michael Hammerc, Carlos D. Bustamantea, and Harry Ostrer. And the data is available on the GEO Accession viewer.

The dataset has 100 samples from Colombia, Dominican Republic, Ecuador, and Puerto Rico.

It's in the same format and uses the same chip as Behar et al and Rasmussen et al. So it was really easy to download and convert it to Plink PED format.

Now what does a Hispanic dataset got to do with a South Asian genetics project? Nothing, for now. But I am collecting all genotyping data. And also I am hoping that we get more participants of South Asian origin from the Caribbean and other countries of the region where there has been a longer presence of South Asians. In that case, it would be interesting to compare them against other populations of the Americas.

In keeping with my effort to clean the data of any relatives, here are the IBD/IBS analysis results. The 2nd sheet shows the two samples I removed.