Tag Archives: south asia - Page 3

Ref2 South Asian + Harappa Admixture

Using the reference II dataset of 548 South Asians and 38 Harappa Project South Asians that I have been working on, I ran Admixture.

The optimum number of ancestral components was 5-6. So I used K=6. The components are highest among the following groups:

C1 Brahui, Makrani, Balochi C2 TN Dalit, North Kannadi
C3 Irula C4 Gujaratis
C5 Hazara C6 Kalash

I consider the Irulas, a Scheduled tribe from Tamil Nadu, to be problematic in a similar way to the Kalash except that the Irulas are well-scattered in their own space in the PCA plot.

Also, note that all the European, West Asian, etc is being represented by C1. Similarly, all the East Asian ancestry is being collected in C5.

The spreadsheet showing the admixture results is here. The first sheet shows the individual results for the project participants.

The 2nd sheet shows the average (and standard deviation) for the reference populations.

The 3rd sheet shows the average and standard deviation for each cluster computed by MClust from PCA.

The 4th sheet shows the average and standard deviation for each cluster computed by MClust from MDS.

Also, take a look at the admixture percentage standard deviations. You'll notice that those are generally lower for the clusters compared to the population groups.

Ref 2 South Asians + Harappa MDS Clusters

Why do MDS clusters when we already did PCA-based clustering for this data?

You guys probably know about Dienekes' Clusters Galore approach. The way it works is that varying the number of MDS dimensions used you compute the number of clusters inferred (done using Mclust) and use the number of MDS dimensions which give you the maximum number of clusters.

This sounded a little unsatisfactory for me. So I ran an experiment. I computed 100 MDS dimensions for the samples in this dataset which includes South Asians from Reference II as well as 38 Harappa participants. Then I kept 2,3,4,...,100 dimensions and ran NNClean (to get initial noise/outlier estimate) and Mclust on them.

This first graph shows the number of outliers NNclean computed from 586 samples.

Things go crazy with NNclean when 64 or more MDS dimensions are retained since it considers most of the samples to be noise then.

Now let's look at the number of outliers identified after Mclust's clustering procedure.

This shows us that probably somewhere between 8 and 65 MDS dimensions might be useful to keep.

Finally, a plot of the number of clusters inferred by Mclust versus the number of MDS dimensions used.

There are two big jumps here to consider. One is around 12 MDS dimensions and the other after 52. So we are looking at an optimum number of MDS dimensions between 12 and 52. However, in that range, the number of clusters computed is fairly noisy between 18 and 26. The only pattern I can discern with some smoothed fitting is that we should likely be looking at somewhere between 20 and 30 MDS dimensions.

But why choose the maximum number of clusters (26 clusters when 24 MDS dimensions are kept)? That could be the result of noise too.

Is there some other way to figure out what are the significant number of MDS dimensions to keep for population structure? It turns out there is. Patterson, Price and Reich proposed Tracy-Widom statistics for Principal Component Analysis in their paper "Population Structure and Eigenanalysis". We also know that the MDS analysis we are performing is the classical metric MDS which is in some ways equivalent to a PCA. Looking at the Tracy Widom stats then, we see that about 25 eigenvalues are significant. Thus, keeping 24 MDS dimensions to maximum the number of clusters seems defensible.

Finally, here are the clustering results.

Harappa Maps

Here are a couple of more maps of the South Asian admixture component from Simranjit incorporating the latest Harappa results.

He's posted more maps at his blog.

Ref 2 South Asians + Harappa PCA Clusters

Using the fifteen principal components shown before, I tried to use MClust to cluster the 573 individuals.

This time, I ran NNclean first to find out the outliers. NNClean pointed to the following as outliers:

HGDP00104 HGDP00100 HGDP00119 HGDP00112 HGDP00118 HGDP00279 HGDP00060 HGDP00029 HGDP00076 HGDP00041 HGDP00146 HGDP00163 HGDP00234 HGDP00412 HGDP00090 HGDP00148 HGDP00165 HGDP00068 HGDP00134 HGDP00149 HGDP00052 HGDP00074 HGDP00098 HGDP00153 HGDP00173 HGDP00376 HGDP00143 HGDP00158 HGDP00145 HGDP00161 HGDP00151 HGDP00243 HGDP00139 HGDP00140 HGDP00177 HGDP00224 GSM536497 GSM536806 GSM536807 GSM536808 I16 I3 I5 SS231506 HRP0001

As you can see, I am included in this list.

Then I used this list of outliers to initialize "noise" in the MClust procedure. The final list of outliers is as fllows:

HGDP00279 HGDP00029 HGDP00134 HGDP00151 GSM536806 GSM536807 GSM536808

These are 1 Kalash, 1 Brahui, 2 Makranis, and 3 Paniyas.

There are a bunch of interesting things in the results. For example, Pathans and Punjabis were mostly indistinguishable by this technique. But let me leave you with a caution: Some of these clusters are nice, tight ones and others are loose, long ones, so don't overread the results.

Ref 2 South Asians + Harappa PCA

I ran PCA on the South Asian populations included in Reference II dataset as well as 38 South Asian participants of Harappa Project. This is sort of a complementary analysis to the Ref1 South Asian one, as this one includes Kalash, Hazara and the additional South Asian groups in Xing et al.

The reference populations included are: Andhra Brahmin, Andhra Madiga, Andhra Mala, Balochi, Bnei Menashe Jews, Brahui, Burusho, Cochin Jews, Gujaratis, Gujaratis-B, Hazara, Irula, Kalash, Makrani, Malayan, Nepalese, North Kannadi, Paniya, Pathan, Punjabi Arain, Sakilli, Sindhi, Singapore Indians, Tamil Nadu Brahmin, and Tamil Nadu Dalit.

Here's the spreadsheet showing the eigenvalues and the first 15 principal components for each sample.

I computed the PCA using Eigensoft which removed 13 samples as outliers. The Tracy-Widom statistics show that about 25 eigenvectors are significant.

Here are the first 15 eigenvalues:

1 6.374483
2 3.650626
3 3.270121
4 2.999767
5 1.937818
6 1.713315
7 1.538295
8 1.503051
9 1.458331
10 1.448079
11 1.433288
12 1.414678
13 1.408943
14 1.390791
15 1.38101

Here is a 3-D PCA plot (hat tip: Doug McDonald) showing the first three eigenvectors. The plot is rotating about the 1st eigenvector which is vertical. Also, I have stretched the principal components based on the corresponding eigenvalues. Also, you can highlight the individual project participants in the plot by using the dropdown list below the plot.

Now here are plots of the first 14 eigenvectors. In this case, I have not stretched the principal components, so keep in mind that the first eigenvector explains 1.75 times variation compared to the 2nd eigenvector.

Ref1 South Asian + Harappa MDS MClust

Now I am going nuts on this dataset consisting of South Asians (minus Kalash and Hazara) from Reference I and some Harappa participants, but I promise this is the last item on this specific data. I will however do similar analyses some time after integrating all the new South Asian samples I have gotten (via project participation as well as from research data).

I ran MDS on the data in Plink and then retaining various number of MDS dimensions, ran MClust on it. This is what Dienekes calls Clusters Galore.

Here are the plots of the MDS, two dimensions at a time.

The graph of number of MDS dimensions retained versus optimum number of clusters computed by Mclust is as follows:

The maximum number of clusters (28) are inferred with 8 MDS dimensions. So I posted the clustering results for 8 MDS dimensions + 28 clusters.

Some observations on the clusters:

  1. 56 of the 62 Gujaratis are in cluster CL1 and the remaining 6 are in CL5. Both are Gujarati-only clusters. Let's see where the Harappa Gujaratis fall next time I do this analysis,
  2. CL2 has an Andhra Reddy, Caribbean Indians, a Keralan, a few Gujaratis-B, and a third of the Singapore Indians.
  3. Gujaratis-B are a varied lot spread out into CL3, CL7, CL2, CL8, CL4, CL6, and CL15, but half are in CL3.
  4. CL6 has a lot of the South Indian Brahmins
  5. Burusho are isolated
  6. Punjabis from the project seem to be divided among CL7, CL8 and CL15.

I also posted the results for 20 MDS dimensions resulting in 21 clusters.

Ref1 South Asians + Harappa PCA Clusters II

Using the PCA data for Reference I South Asians plus project participants, Sriram computed a tree-based clustering called clique optimization. The result for that is a pdf file. Take a look!

Thanks, Sriram!Балки

Ref1 South Asian + Harappa Admixture

Since I was working on this dataset consisting of South Asians (minus Kalash and Hazara) from Reference I and some Harappa participants, I thought I would run Admixture on it.

The optimum value for the number of ancestral populations K is 3 in this case. Roughly the three ancestral components correspond to South India, Balochistan and Gujarat.

The spreadsheet showing the admixture results is here. The first sheet shows the individual results for reference samples as well as project participants.

The 2nd sheet shows the average (and standard deviation) for the reference populations.

The 3rd sheet shows the average and standard deviation for each cluster computed by MClust. I included only the samples which had at least 90% probability of belonging to a cluster.

Note how clusters CL8, CL9 and CL13 have a lot more variation than the others. Of course, I am in CL9 along with some fairly eclectic samples.

More Admixture Maps

Simranjit has sent more maps incorporating the latest admixture results.

C1 South Asian:
http://visualcage.ru

C2 Balochistan/Caucasus:

C6 European:

South Asian Map

Simranjit has another map:http://dekor-okno.ru

I am working on improving the interpolation algorithm to take into account barriers such as oceans and even terrain features like mountain ranges. However, this process takes a long time.

Anyways in the meantime , this is one i think the participants would be interested in. It has several things in it, an isopleth layer for C1 - South Asian (12 gradations for more impact). It also have the other Components (C1, C2, C3, C4, C5, C6, C8) represented in the form of pie charts. Base map is a topographic one this time.