Tag Archives: sgvp

Introducing Reference 3

Having collected 12 datasets, I have gone through them and finally selected the samples and SNPs I want to include in my new dataset, which I'll call Reference 3.

It has 3,889 individuals and 217,957 SNPs. Since this is a South Asia focused blog, there are a total of 558 South Asians in this reference set (compared to 398 in my Reference I).

You can see the number of SNPs of various datasets which are common to 23andme version 2, 23andme version 3 and FTDNA Family Finder (Illumina chip).

The following datasets had more than 280,000 SNPs common with all three platforms and hence were included in Reference 3:

  1. HapMap
  2. HGDP
  3. SGVP
  4. Behar
  5. Henn (Khoisan data)
  6. Rasmussen
  7. Austroasiatic
  8. Latino
  9. 1000genomes

Reich et al had about 100,000 SNPs in common with 23andme (v2 & v3 intersection) and 137,000 with FTDNA, but there was not a great overlap. Only 59,000 Reich et al SNPs were present in all three platforms. Since I really wanted Reich et al data in Reference 3, I included it but the SNPs used for FTDNA comparisons won't be the same as for the 23andme comparisons.

Of the datasets I could not include, I am most disappointed about the Pan-Asian dataset since it has a good coverage of South and Southeast Asia. Unfortunately, it has only 19,000 SNPs in common with 23andme v2 and 23,000 with 23andme v3. I am going to have to do some analyses with the Pan-Asian data but it just can't be included in my Reference 3.

I am also interested in doing some analysis with the Henn et al African data with about 52,000 SNPs for personal reasons.

Xing et al has about 71,000 SNPs in common with 23andme v3, so some good work could be done with that, though I'll have to use only 23andme version 3 participants.

The information about the populations included in Reference 3 is in a spreadsheet as usual.

One PED File to Rule Them All

I am interested in North African populations due to my own heritage, so when Razib alerted me that Henn et al had a paper out about South African origins of humans and their African dataset was publicly available and included populations from all over Africa, I immediately downloaded it.

I have also been considering looking into the East Asian admixture in South Asians and Iranians in some detail to see where it originates from: Southeast Asia, Chinese/Japanese/Koreans, or the Turkic/Mongolian/Siberian populations of interior northeastern Asia. At a quick glance, Razib is correct:

The eastern Asian components are enriched among Bengalis, as you’d expect, but they’re found in different proportions among many individuals who hail from the northern fringe of South Asia more generally. It seems clear that the further west you go, the more likely the “eastern” element is going to be Turk, while the further east (and to some extent south) the more likely it is to be more southernly in provenance.

To do a better job though, it would be better to have more than the Yakut as an examplar of the Siberian component as I have done till now. Therefore, I downloaded the arctic populations dataset from Rasmussen et al.

Combining Henn et al and Rasmussen et al with my previous datasets (HapMap, HGDP, SGVP, Behar et al and Xing et al), I got 3,970 samples with a total of 1,716,031 SNPs represented, though at 99% genotyping rate it gets reduced to about 27,000 SNPs.

I did not remove any populations or individuals except for any duplicates and non-founders.

Here's the information on the populations represented in this dataset.

Now I am on the lookout for more datasets that are public, have enough SNPs in common with this set and can easily be converted into the Plink PED format. So if you know of any, let me know. May be I will have the biggest and most diverse dataset with your help.

Singapore Indians

In the South Asian PCA plot, we saw that Singapore Indian samples from the SGVP dataset had a lot of diversity. Let's zoom into that plot so it's not dominated by the distinctiveness of the Kalash.

Eigenvector 1 explains 1.45 times the variation compared to eigenvector 2.

We see that Singapore Indians are spread in the whole region from Sindhis to North Kanaddi.

Now let's look at the individual admixture results (at K=12 ancestral populations) for the Singapore Indians. I have added some South Asian reference population averages so you can place them in context.

You can click on the legend to the right of the bar chart to sort by different ancestral components.

From these results, a majority of the Singapore Indian samples look South Indian but there are definitely a few from the northwest of the subcontinent (Punjabis or Sindhis?) There are also a few who could be from the Hindi belt.

There are 2-3 samples who have a significant amount of Southeast Asian. Could they be originally from Bengal? Or could they have partial Singapore Malay ancestry?

Reference Dataset II

Combining my reference population with Xing et al data gets me 3,222 3,161 samples but with only about 23,000 SNPs after LD-pruning.

The good thing is that this dataset has 544 South Asian samples from 24 ethnic groups. So it'll be useful for some analyses despite the low number of SNPs. I'll try to run parallel analyses on my reference population and this dataset so we can compare the pros and cons of both.

UPDATE: I removed 61 pygmy and San samples.

Admixture: Reference Population

For regular admixture analysis, I am using HapMap, HGDP, SGVP and Behar datasets with some samples removed as I wrote earlier.

For each of these datasets,

  1. I first filtered to keep only the list of SNPs present in 23andme v2 chip.
    plink --bfile data --extract 23andmev2.snplist
  2. I also filtered for founders:
    plink --bfile data --filter-founders
  3. And excluded SNPs with missing rates greater than 1%:
    plink --bfile data --geno 0.01

Then, I merged the datasets one by one. The reason for doing it one by one was that there were conflicts of strand orientation (forward or reverse) between the different datasets. If the merge operation gave an error, I had to flip those strands in one dataset and try the merge again.

plink --bfile data1 --bmerge data2.bed data2.bim data2.fam --make-bed
plink --bfile data2 --flip plink.missnp --make-bed --out data2flip
plink --bfile data1 --bmerge data2flip.bed data2flip.bim data2flip.fam --make-bed

Once all the four datasets were merged, I processed the combined data file:

  1. Removed SNPs with a missing rate of more than 1% in the combined dataset
    plink --bfile data --geno 0.01
  2. Then i performed linkage disequilibrium based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3:
    plink --bfile data --indep-pairwise 50 5 0.3
    plink --bfile data --extract plink.prune.in --make-bed

This gave me a reference population of 2,693 2,654 individuals with each sample having about 186,000 SNPs. Out of these 2,693 2,654 individuals, we have a total of 398 South Asians belonging to 16 ethnic groups.

Finally, it's time to start having some fun!

UPDATE: I removed 39 Pygmy and San samples because they were causing some trouble with African ancestral components. Since we are not interested in detailed African ancestry and African admixture among South Asians is not likely to be pygmy or San, I decided it would be best to remove them.


SGVP is the Singapore Genome Variation Project. It sampled the following groups:

Ethnicity Sample Count SNP Count
Singapore Chinese 96 1,405,417
Singapore Malay 89 1,402,256
Singapore Indian 83 1,404,699

Singapore Indians are generally likely to be South Indians, especially Tamils.

These 268 samples were easy to convert to Plink format