Tag Archives: harappa - Page 6

Admixture K=4, HRP0071-HRP0080

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

The interesting samples here are Gujarati (HRP0071), Bengali Brahmin (HRP0077) and Brazilian (HRP0074).

If you can't see the interactive bar chart above, here's a static image.

PS. This was run using Admixture version 1.04.

Ref2 South Asian + Harappa Admixture

Using the reference II dataset of 548 South Asians and 38 Harappa Project South Asians that I have been working on, I ran Admixture.

The optimum number of ancestral components was 5-6. So I used K=6. The components are highest among the following groups:

C1 Brahui, Makrani, Balochi C2 TN Dalit, North Kannadi
C3 Irula C4 Gujaratis
C5 Hazara C6 Kalash

I consider the Irulas, a Scheduled tribe from Tamil Nadu, to be problematic in a similar way to the Kalash except that the Irulas are well-scattered in their own space in the PCA plot.

Also, note that all the European, West Asian, etc is being represented by C1. Similarly, all the East Asian ancestry is being collected in C5.

The spreadsheet showing the admixture results is here. The first sheet shows the individual results for the project participants.

The 2nd sheet shows the average (and standard deviation) for the reference populations.

The 3rd sheet shows the average and standard deviation for each cluster computed by MClust from PCA.

The 4th sheet shows the average and standard deviation for each cluster computed by MClust from MDS.

Also, take a look at the admixture percentage standard deviations. You'll notice that those are generally lower for the clusters compared to the population groups.

Supervised Continental Admixture

Since the version 1.1 of Admixture with supervised option came almost two months ago, I have been salivating over it.

My original use case for it is not possible (for now). I wanted to be able to assign a few of the K ancestral components to specific reference populations and let the other ancestral components fall where they may. But we can do supervised admixture only by assigning all K ancestral components.

So I decided to test this supervised option by mimicking the three continental percentages 23andme assigns you on their ancestry painting page. Mine are:

Europe 91.22%
Asia 8.69%
Africa 0.09%

You can get the extra precision (and false sense of accuracy) here.

Regarding the reference populations used for ancestry painting, 23andme says:

23andMe takes advantage of publicly available data for four populations studied extensively via the International HapMap project (hapmap.org). That project obtained the genotypes for 60 individuals of western European descent from Utah, 60 western African individuals from Nigeria, and 90 eastern Asian individuals, 45 from each of Japan and China. Because the two eastern Asian populations are geographically near one another and relatively similar at the genetic level, 23andMe combines these to form a single eastern Asian reference population.

So I dug up my reference admixture run at K=3 and found the same number of samples of these HapMap populations by looking for those samples which had the highest percentage in the respective component.

Then I combined these 210 samples from the HapMap with 74 Harappa Project participants (HRP0001 to HRP0079, excluding 5 who are related to others).

The results of the supervised admixture run are in a spreadsheet and also shown in a bar chart below.

Since I did run an unsupervised K=3 admixture analysis of the first Harappa batch with the whole reference I populations, you can compare these results to those.

Ref 2 South Asians + Harappa MDS Clusters

Why do MDS clusters when we already did PCA-based clustering for this data?

You guys probably know about Dienekes' Clusters Galore approach. The way it works is that varying the number of MDS dimensions used you compute the number of clusters inferred (done using Mclust) and use the number of MDS dimensions which give you the maximum number of clusters.

This sounded a little unsatisfactory for me. So I ran an experiment. I computed 100 MDS dimensions for the samples in this dataset which includes South Asians from Reference II as well as 38 Harappa participants. Then I kept 2,3,4,...,100 dimensions and ran NNClean (to get initial noise/outlier estimate) and Mclust on them.

This first graph shows the number of outliers NNclean computed from 586 samples.

Things go crazy with NNclean when 64 or more MDS dimensions are retained since it considers most of the samples to be noise then.

Now let's look at the number of outliers identified after Mclust's clustering procedure.

This shows us that probably somewhere between 8 and 65 MDS dimensions might be useful to keep.

Finally, a plot of the number of clusters inferred by Mclust versus the number of MDS dimensions used.

There are two big jumps here to consider. One is around 12 MDS dimensions and the other after 52. So we are looking at an optimum number of MDS dimensions between 12 and 52. However, in that range, the number of clusters computed is fairly noisy between 18 and 26. The only pattern I can discern with some smoothed fitting is that we should likely be looking at somewhere between 20 and 30 MDS dimensions.

But why choose the maximum number of clusters (26 clusters when 24 MDS dimensions are kept)? That could be the result of noise too.

Is there some other way to figure out what are the significant number of MDS dimensions to keep for population structure? It turns out there is. Patterson, Price and Reich proposed Tracy-Widom statistics for Principal Component Analysis in their paper "Population Structure and Eigenanalysis". We also know that the MDS analysis we are performing is the classical metric MDS which is in some ways equivalent to a PCA. Looking at the Tracy Widom stats then, we see that about 25 eigenvalues are significant. Thus, keeping 24 MDS dimensions to maximum the number of clusters seems defensible.

Finally, here are the clustering results.

Harappa Maps

Here are a couple of more maps of the South Asian admixture component from Simranjit incorporating the latest Harappa results.

He's posted more maps at his blog.

Ref 2 South Asians + Harappa PCA Clusters

Using the fifteen principal components shown before, I tried to use MClust to cluster the 573 individuals.

This time, I ran NNclean first to find out the outliers. NNClean pointed to the following as outliers:

HGDP00104 HGDP00100 HGDP00119 HGDP00112 HGDP00118 HGDP00279 HGDP00060 HGDP00029 HGDP00076 HGDP00041 HGDP00146 HGDP00163 HGDP00234 HGDP00412 HGDP00090 HGDP00148 HGDP00165 HGDP00068 HGDP00134 HGDP00149 HGDP00052 HGDP00074 HGDP00098 HGDP00153 HGDP00173 HGDP00376 HGDP00143 HGDP00158 HGDP00145 HGDP00161 HGDP00151 HGDP00243 HGDP00139 HGDP00140 HGDP00177 HGDP00224 GSM536497 GSM536806 GSM536807 GSM536808 I16 I3 I5 SS231506 HRP0001

As you can see, I am included in this list.

Then I used this list of outliers to initialize "noise" in the MClust procedure. The final list of outliers is as fllows:

HGDP00279 HGDP00029 HGDP00134 HGDP00151 GSM536806 GSM536807 GSM536808

These are 1 Kalash, 1 Brahui, 2 Makranis, and 3 Paniyas.

There are a bunch of interesting things in the results. For example, Pathans and Punjabis were mostly indistinguishable by this technique. But let me leave you with a caution: Some of these clusters are nice, tight ones and others are loose, long ones, so don't overread the results.

Ref 2 South Asians + Harappa PCA

I ran PCA on the South Asian populations included in Reference II dataset as well as 38 South Asian participants of Harappa Project. This is sort of a complementary analysis to the Ref1 South Asian one, as this one includes Kalash, Hazara and the additional South Asian groups in Xing et al.

The reference populations included are: Andhra Brahmin, Andhra Madiga, Andhra Mala, Balochi, Bnei Menashe Jews, Brahui, Burusho, Cochin Jews, Gujaratis, Gujaratis-B, Hazara, Irula, Kalash, Makrani, Malayan, Nepalese, North Kannadi, Paniya, Pathan, Punjabi Arain, Sakilli, Sindhi, Singapore Indians, Tamil Nadu Brahmin, and Tamil Nadu Dalit.

Here's the spreadsheet showing the eigenvalues and the first 15 principal components for each sample.

I computed the PCA using Eigensoft which removed 13 samples as outliers. The Tracy-Widom statistics show that about 25 eigenvectors are significant.

Here are the first 15 eigenvalues:

1 6.374483
2 3.650626
3 3.270121
4 2.999767
5 1.937818
6 1.713315
7 1.538295
8 1.503051
9 1.458331
10 1.448079
11 1.433288
12 1.414678
13 1.408943
14 1.390791
15 1.38101

Here is a 3-D PCA plot (hat tip: Doug McDonald) showing the first three eigenvectors. The plot is rotating about the 1st eigenvector which is vertical. Also, I have stretched the principal components based on the corresponding eigenvalues. Also, you can highlight the individual project participants in the plot by using the dropdown list below the plot.

Now here are plots of the first 14 eigenvectors. In this case, I have not stretched the principal components, so keep in mind that the first eigenvector explains 1.75 times variation compared to the 2nd eigenvector.

Admixture K=12, HRP0061-HRP0070

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

If you can't see the interactive bar chart above, here's a static image.

I dare you to generalize!

PS. This was run using Admixture version 1.04.

Admixture K=4, HRP0061-HRP0070

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

The interesting samples here are the Gujarati and the Punjabi. HRP0064 is very different from the other Punjabis so far.

If you can't see the interactive bar chart above, here's a static image.

PS. This was run using Admixture version 1.04.

Iranians

Since we have 7 Iranians in the project, it's time to look at them as a group. We also have 19 Iranians from the Behar et al dataset.

Let's look at their admixture results at K=12.

The big difference between Harappa Project Iranians and Behar et al Iranians is African admixture. Only one Harappa Iranian (HRP0046) has 1% African admixture while three Behar Iranians have more than 10%.

Let's do hierarchical clustering with complete linkage using the Euclidean distance between admixture components. First a caveat or two. This is not a phylogeny. Also, the Euclidean distance measure is not a good one for measuring differences in admixture but I am not sure what would be better.

HRP0010 who is an Assyrian actually clusters better with Caucasian, Iranian and Iraqi Jews than with Iranians.

I'll run an MDS or PCA of the whole region from Punjab/Kashmir to the Levant and Caucasus soon which should be more interesting for clustering.

UPDATE: Since Palisto wondered, I checked and found out that he, an Iraqi Kurd, is very like the Iranians in his admixture result. So I have included him (HRP0059).