Category Archives: PCA - Page 2

Reference 3 PCA Clustering for South Asians

Posted by Zack on May 19, 2011 Comments Off

Using the first 32 dimensions of the Reference 3 PCA, I tried to classify the 51 South Asian populations. I did not try a full clustering on all populations because that took too long and seemed like there were more than 150 clusters.

You can see the South Asians on 3-D PCA plots of the first four principal components.

The clustering results from Mclust are in a spreadsheet.

PS. I used 32 eigenvectors as that's what gave me the maximum number of clusters with a small number of outliers.

Reference 3 South Asians PCA

Posted by Zack on May 16, 2011 Comments Off

Let's zoom into the PCA plots of Reference 3 (more here) and look at how the different South Asian populations line up.

First the 3-D plot of eigenvectors 1, 2 & 3 with principal component 1 being vertical (and axis of rotation).

Your browser does not support frames. Go <a href="http://www.harappadna.org/wp-content/uploads/2011/05/ref3_sa_pca.html">here</a> to see the animation.

And now principal components 2, 3 & 4 (with the vertical axis of rotation being 2):

Your browser does not support frames. Go <a href="http://www.harappadna.org/wp-content/uploads/2011/05/ref3_sa_pca_2_3_4.html">here</a> to see the animation.

Note that I performed PCA on the whole set of reference 3, so you are looking at the axes of variation of all populations, not just South Asians.

More Reference 3 PCA 3D Plots

Posted by Zack on May 14, 2011 Comments Off

As per Razib's request, here is the 3-D plot of principal components 1, 2 & 4 for reference 3.

Your browser does not support frames. Go <a href="http://www.harappadna.org/wp-content/uploads/2011/05/ref3_pca_1_2_4.html">here</a> to see the animation.

And here are principal components 2, 3 & 4:

Your browser does not support frames. Go <a href="http://www.harappadna.org/wp-content/uploads/2011/05/ref3_pca_2_3_4.html">here</a> to see the animation.

Reference 3 PCA

Posted by Zack on May 13, 2011 12 comments

Here's the Principal Component Analysis (PCA) of Reference 3 data.

First the 3-D plot of the first three eigenvectors. The plot is rotating about the 1st eigenvector which is vertical. Also, I have stretched the principal components based on the corresponding eigenvalues.

Your browser does not support frames. Go <a href="http://www.harappadna.org/wp-content/uploads/2011/05/ref3_pca.html">here</a> to see the animation.

And now the plots of the first 24 principal components. Please note that the eigenvectors are not scaled by the corresponding eigenvalues in these plots (unlike the 3D plot).

Here are the first 24 eigenvalues (expressed as percentage of the sum of all eigenvalues):

6.417%
4.045%
0.746%
0.624%
0.336%
0.330%
0.296%
0.250%
0.218%
0.166%
0.140%
0.131%
0.119%
0.112%
0.108%
0.105%
0.098%
0.087%
0.086%
0.080%
0.075%
0.073%
0.073%
0.071%

Together, the first 24 eigenvectors explain 14.79% of the variation in the data.

According to the Tracy-Widom statistics from eigensoft, the number of significant principle components is 118.

UPDATE: I thought the eigenvectors 2 & 4 looked interesting for South Asians so I plotted them together.

Ref 2 South Asians + Harappa PCA Clusters

Posted by Zack on March 30, 2011 1 comment

Using the fifteen principal components shown before, I tried to use MClust to cluster the 573 individuals.

This time, I ran NNclean first to find out the outliers. NNClean pointed to the following as outliers:

HGDP00104 HGDP00100 HGDP00119 HGDP00112 HGDP00118 HGDP00279 HGDP00060 HGDP00029 HGDP00076 HGDP00041 HGDP00146 HGDP00163 HGDP00234 HGDP00412 HGDP00090 HGDP00148 HGDP00165 HGDP00068 HGDP00134 HGDP00149 HGDP00052 HGDP00074 HGDP00098 HGDP00153 HGDP00173 HGDP00376 HGDP00143 HGDP00158 HGDP00145 HGDP00161 HGDP00151 HGDP00243 HGDP00139 HGDP00140 HGDP00177 HGDP00224 GSM536497 GSM536806 GSM536807 GSM536808 I16 I3 I5 SS231506 HRP0001

As you can see, I am included in this list.

Then I used this list of outliers to initialize "noise" in the MClust procedure. The final list of outliers is as fllows:

HGDP00279 HGDP00029 HGDP00134 HGDP00151 GSM536806 GSM536807 GSM536808

These are 1 Kalash, 1 Brahui, 2 Makranis, and 3 Paniyas.

There are a bunch of interesting things in the results. For example, Pathans and Punjabis were mostly indistinguishable by this technique. But let me leave you with a caution: Some of these clusters are nice, tight ones and others are loose, long ones, so don't overread the results.

Ref 2 South Asians + Harappa PCA

Posted by Zack on March 30, 2011 2 comments

I ran PCA on the South Asian populations included in Reference II dataset as well as 38 South Asian participants of Harappa Project. This is sort of a complementary analysis to the Ref1 South Asian one, as this one includes Kalash, Hazara and the additional South Asian groups in Xing et al.

The reference populations included are: Andhra Brahmin, Andhra Madiga, Andhra Mala, Balochi, Bnei Menashe Jews, Brahui, Burusho, Cochin Jews, Gujaratis, Gujaratis-B, Hazara, Irula, Kalash, Makrani, Malayan, Nepalese, North Kannadi, Paniya, Pathan, Punjabi Arain, Sakilli, Sindhi, Singapore Indians, Tamil Nadu Brahmin, and Tamil Nadu Dalit.

Here's the spreadsheet showing the eigenvalues and the first 15 principal components for each sample.

I computed the PCA using Eigensoft which removed 13 samples as outliers. The Tracy-Widom statistics show that about 25 eigenvectors are significant.

Here are the first 15 eigenvalues:

1	6.374483
2	3.650626
3	3.270121
4	2.999767
5	1.937818
6	1.713315
7	1.538295
8	1.503051
9	1.458331
10	1.448079
11	1.433288
12	1.414678
13	1.408943
14	1.390791
15	1.38101

Here is a 3-D PCA plot (hat tip: Doug McDonald) showing the first three eigenvectors. The plot is rotating about the 1st eigenvector which is vertical. Also, I have stretched the principal components based on the corresponding eigenvalues. Also, you can highlight the individual project participants in the plot by using the dropdown list below the plot.

Your browser does not support frames. Go <a href="http://www.harappadna.org/wp-content/uploads/2011/03/r2_sa_hrp_pca.html">here</a> to see the animation.

Now here are plots of the first 14 eigenvectors. In this case, I have not stretched the principal components, so keep in mind that the first eigenvector explains 1.75 times variation compared to the 2nd eigenvector.

Reference II PCA

Posted by Zack on March 26, 2011 2 comments

I ran PCA on the Reference II dataset which includes 3.161 samples from various populations but with only 23,000 SNPs in common.

Here are the top ten eigenvalues:

219.225396
146.835968
20.719760
9.721733
7.552482
6.216977
3.991663
3.484690
3.106919
2.805874

While the first two eigenvalues are much bigger than the rest, the first explains 7.12% of the variation and the second 4.77%, the Tracy-Widom stats show that about 54 eigenvectors are significant.

Here are the plots for the first 10 principal components. Remember that the 1st eigenvector is 1.5 times the 2nd.

Your browser does not support frames. Go <a href="http://www.harappadna.org/wp-content/uploads/2011/03/ref2_pca.html">here</a> to see the animation.

I also ran MClust on the PCA data and got 17 clusters. The results are in a spreadsheet. I am sure with more principal components than the 10 I used, I would be able to deduce finer population structure.

Do take a look at the clusters assigned to the South Asian populations from Xing et al.

Reference I PCA

Posted by Zack on March 24, 2011 5 comments

I ran PCA on the Reference I dataset which includes 2,654 samples from various populations.

Here are the top ten eigenvalues:

178.727040
118.884690
15.014072
9.346602
5.983225
5.140090
3.322723
2.739313
2.559640
2.475389

While the first two eigenvalues are much bigger than the rest, the first explains 6.82% of the variation and the second 4.54%, the Tracy-Widom stats show that about 70-something eeigenvectors are significant.

Here are the plots for the first 10 principal components. Remember that the 1st eigenvector is 1.5 times the 2nd.

Your browser does not support frames. Go <a href="http://www.harappadna.org/wp-content/uploads/2011/03/ref1_pca.html">here</a> to see the animation.

I also ran MClust on the PCA data and got 16 clusters. The results are in a spreadsheet. I am sure with more principal components than the 10 I used, I would be able to deduce finer population structure.

Note that African Americans cluster with East Africans in CL1. That's because African Americans have some European ancestry (20% on average) and that pulls them away from West Africans and towards Europeans. East Africans also lie in that direction, so they cluster together in a PCA. However, that doesn't mean that African Americans have East African ancestry. If you look at the Admixture results for African Americans, you see that their East African ancestry is negligible.

Ref1 South Asians + Harappa PCA Clusters II

Posted by Zack on March 23, 2011 1 comment

Using the PCA data for Reference I South Asians plus project participants, Sriram computed a tree-based clustering called clique optimization. The result for that is a pdf file. Take a look!

Thanks, Sriram!Балки

Ref1 South Asians + Harappa PCA Clusters

Posted by Zack on March 19, 2011 4 comments

Using the PCA results of the South Asians in Reference I as well as Harappa participants, I ran a couple of clustering algorithms.

First, I scaled the principal components by the respective eigenvalues.

Using Euclidean distance for hierarchical clustering with complete linkage, here's the dendrogram for the Harappa Project participants.

You can compare this to the Admixture-based dendrogram:

The most obvious thing is that I (HRP0001) am an outlier by far.

We inferred three major clusters with the admixture results. Those are intact, though changed a little.

I also ran MClust on the PCA data. The optimum number of clusters was 14. The resulting cluster assignments can be seen in a spreadsheet.

For the Harappa Project participants, the numbers give the probability of assignment to a cluster. For example, for HRP0009 there is a 72% of belonging to cluster 4. For the reference populations, the numbers give the expected number of samples assigned to a cluster.

« Previous page | Next page »

Harappa Ancestry Project

Genetics and South Asia

Category Archives: PCA - Page 2

Reference 3 PCA Clustering for South Asians

Reference 3 South Asians PCA

More Reference 3 PCA 3D Plots

Reference 3 PCA

Ref 2 South Asians + Harappa PCA Clusters

Ref 2 South Asians + Harappa PCA

Reference II PCA

Reference I PCA

Ref1 South Asians + Harappa PCA Clusters II

Ref1 South Asians + Harappa PCA Clusters

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Archives

Recent Comments

Blogroll

Genetics and South Asia

Category Archives: PCA - Page 2

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Tags

Archives

Recent Comments

Blogroll