Tag Archives: south asia - Page 4

Ref1 South Asians + Harappa PCA Clusters

Using the PCA results of the South Asians in Reference I as well as Harappa participants, I ran a couple of clustering algorithms.

First, I scaled the principal components by the respective eigenvalues.

Using Euclidean distance for hierarchical clustering with complete linkage, here's the dendrogram for the Harappa Project participants.

You can compare this to the Admixture-based dendrogram:

The most obvious thing is that I (HRP0001) am an outlier by far.

We inferred three major clusters with the admixture results. Those are intact, though changed a little.

I also ran MClust on the PCA data. The optimum number of clusters was 14. The resulting cluster assignments can be seen in a spreadsheet.

For the Harappa Project participants, the numbers give the probability of assignment to a cluster. For example, for HRP0009 there is a 72% of belonging to cluster 4. For the reference populations, the numbers give the expected number of samples assigned to a cluster.

Ref 1 South Asians + Harappa PCA

I ran PCA on the South Asian populations included in Reference I dataset (excluding Kalash and Hazara) as well as 38 South Asian participants of Harappa Project. I excluded Kalash and Hazara because they usually dominate a South Asian PCA plot being so distinct.

The reference populations included are: Balochi, Bnei Menashe Jews, Brahui, Burusho, Cochin Jews, Gujaratis (divided into two groups), Makrani, Malayan, North Kannadi, Paniya, Pathan, Sakilli, Sindhi, and Singapore Indians.

Here's the spreadsheet showing the eigenvalues and the first 15 principal components for each sample.

I computed the PCA using Eigensoft which removed 26 samples as outliers. The Tracy-Widom statistics show that about 30 eigenvectors are significant.

Here are the first 15 eigenvalues:

1 3.874124
2 1.819077
3 1.663232
4 1.335721
5 1.293500
6 1.242984
7 1.230921
8 1.225775
9 1.222177
10 1.214539
11 1.212808
12 1.204000
13 1.198930
14 1.195450
15 1.192848

Here is a 3-D PCA plot (hat tip: Doug McDonald) showing the first three eigenvectors. The plot is rotating about the 1st eigenvector which is vertical. Also, I have stretched the principal components based on the corresponding eigenvalues.

Now here are plots of the first 14 eigenvectors. In this case, I have not stretched the principal components, so keep in mind that the first eigenvector explains 3.874124/1.819077=2.13 times variation compared to the 2nd eigenvector.

UPDATE: At the bottom of the 3-D plot, you can see a dropdown. Just select one of the project participants from there and that participant's dot in the plot with become bigger so they are easy to spot.

South Asian PCA

I used Eigensoft to create a PCA plot of the South Asians in our Reference I dataset (a total of 398 samples) along with the first batch of South Asian Harappa Project participants (HRP0001 to HRP0009).

The PCA software removed 2 Makranis, 1 Sindhi, 1 Balochi and 1 Brahui as outliers, thus leaving us with 402 samples to perform a PCA on.

Here are the plots for the first four eigenvectors. Click to see bigger images.

South Asian PCA eig1 vs eig2

South Asian PCA eig1 vs eig3

South Asian PCA eig2 vs eig3

South Asian PCA eig1 vs eig4

South Asian PCA eig2 vs eig4

South Asian PCA eig3 vs eig4

If you have seen the South Asian plot at 23andme, the first plot here isn't very different except that it seems rotated.

UPDATE: Eigenvectors 1 through 4 explain 1.12%, 0.77%, 0.71% and 0.44% of the total variance.

Reference Dataset II

Combining my reference population with Xing et al data gets me 3,222 3,161 samples but with only about 23,000 SNPs after LD-pruning.

The good thing is that this dataset has 544 South Asian samples from 24 ethnic groups. So it'll be useful for some analyses despite the low number of SNPs. I'll try to run parallel analyses on my reference population and this dataset so we can compare the pros and cons of both.

UPDATE: I removed 61 pygmy and San samples.

Admixture: Reference Population

For regular admixture analysis, I am using HapMap, HGDP, SGVP and Behar datasets with some samples removed as I wrote earlier.

For each of these datasets,

  1. I first filtered to keep only the list of SNPs present in 23andme v2 chip.
    plink --bfile data --extract 23andmev2.snplist
  2. I also filtered for founders:
    plink --bfile data --filter-founders
  3. And excluded SNPs with missing rates greater than 1%:
    plink --bfile data --geno 0.01

Then, I merged the datasets one by one. The reason for doing it one by one was that there were conflicts of strand orientation (forward or reverse) between the different datasets. If the merge operation gave an error, I had to flip those strands in one dataset and try the merge again.

plink --bfile data1 --bmerge data2.bed data2.bim data2.fam --make-bed
plink --bfile data2 --flip plink.missnp --make-bed --out data2flip
plink --bfile data1 --bmerge data2flip.bed data2flip.bim data2flip.fam --make-bed

Once all the four datasets were merged, I processed the combined data file:

  1. Removed SNPs with a missing rate of more than 1% in the combined dataset
    plink --bfile data --geno 0.01
  2. Then i performed linkage disequilibrium based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3:
    plink --bfile data --indep-pairwise 50 5 0.3
    plink --bfile data --extract plink.prune.in --make-bed

This gave me a reference population of 2,693 2,654 individuals with each sample having about 186,000 SNPs. Out of these 2,693 2,654 individuals, we have a total of 398 South Asians belonging to 16 ethnic groups.

Finally, it's time to start having some fun!

UPDATE: I removed 39 Pygmy and San samples because they were causing some trouble with African ancestral components. Since we are not interested in detailed African ancestry and African admixture among South Asians is not likely to be pygmy or San, I decided it would be best to remove them.

Participants So Far

While I am analyzing the data, checking for errors and making sure the results I am getting are valid, here is some information about participants till now.

So far I have got 11 participants send me their raw data. Of these eleven, ten have some South Asian ancestry.

The regions/ethnicities they cover are:

  • Punjab
  • Bengal
  • Bihar
  • Tamil Nadu
  • Telegu
  • Anglo-Indian

Of these, Punjabis are the only ones I have multiple samples of. So I definitely need more samples of the other ethnicities. And there are lots of ethnicities/regions I haven't gotten any participants in.

It would be great for this project if we got a few participants from each state/province of India and Pakistan. So if you know someone who is from our target regions and has tested with 23andme, please spread the word.

If you tested with 23andme during their Christmas sale, I am hearing that results are going to start coming in starting today.