Category Archives: Clusters - Page 2

Reference 3 PCA Clustering for South Asians

Using the first 32 dimensions of the Reference 3 PCA, I tried to classify the 51 South Asian populations. I did not try a full clustering on all populations because that took too long and seemed like there were more than 150 clusters.

You can see the South Asians on 3-D PCA plots of the first four principal components.

The clustering results from Mclust are in a spreadsheet.

PS. I used 32 eigenvectors as that's what gave me the maximum number of clusters with a small number of outliers.

Reference 3 Population Concordance

Dienekes had come up with a population concordance ratio which compared the IBS similarity percentages of a trio of individuals to compute the probability that two individuals from population A are more similar to each other than either is to any individual in population B.

Please note that

If two populations can be perfectly distinguished, then their population concordance ratio is 1. If however we randomly divide a set of individuals into two populations and try to calculate the population concordance ratio, we'll find it to be 0.25. It is possible for this ratio to be as low as zero.

If the concordance ratio between two populations is low, that does not necessarily mean that they are very similar. It's possible that a population does not form a tight cluster and has a lot of variation and thus is not distinguishable from another.

Now, here's the spreadsheet for the concordance ratios. You can focus on the South Asian population pairs here.

Harappa Nearest IBS Neighbors

After a long tease, here is the spreadsheet containing the top 500 nearest neighbors (using IBS similarity percentages) for the Harappa participants from HRP0001 to HRP0089.

I am also providing an R data object with the same data (except it contains all the 3,975 individual from reference 3 and Harappa). To use this data,

  1. Download R
  2. Install R on your computer
  3. When you start R, type

    to load the data

  4. Type

    to find the 20 closest IBS neighbors of HRP0001. You can use any of the Harappa IDs here.

  5. You can set the number of IBS neighbors (50, for example) to show using


Harappa Reference 2 IBS Concordance

Vasishta asked:

would it be possible to repeat the same exercise with the Reference II populations? These results seem to be far more plausible for every participant as compared to the previous ones.

Since it took only a few minutes, I calculated the scores as detailed in a previous post from the IBS measures between Harappa participants (1-80 only) and Reference 2.

The spreadsheet is here.

Harappa Ref3 Admixture Dendrograms

Now that we have the admixture results for project participants using Reference 3, let's take a look at a tree based on Euclidean distance of the admixture proportions for each participant.

Compare it to the earlier one with reference 1 admixture results.

And here is a dendrogram combining the average reference population results with the Harappa participants.

Harappa Reference Population Similarity

I was not satisfied with the median IBS with reference populations method for checking how similar you are to different populations. So I took inspiration from Dienekes' population concordance ratio to compute another measure.

Let's say we have a Harappa participant h and we want to compare h to a reference population A. We can then divide our reference dataset into the in-group A and the out-group A' (which consists of everyone not in A). Now for every individual a belonging to group A and every individual a' belonging to group A', we can compare the IBS similarities and score them as:

The condition in this equation is true when Harappa participant h is more similar to individual a in population A than he is to individual a' who's not in population A and h and a are closer to each other than a is to a'.

We can then sum up these values over the whole set of populations A and A' and divide by the number of pairs .

This score tells us how similar h is to population A compared to all the reference samples not in population A and varies from 0 (most disimilar) to 1 (most similar).

Let's see how the Harappa participants HRP0001 to HRP0089 score with the different reference 3 populations.

Go to the spreadsheet and click on your Harappa ID to sort the populations by your similarity score with them (click two times if you want to sort in decreasing order which I like better).

The first sheet Sheet1 has all the populations. In the Filtered 1 sheet, I removed 13 African populations that had really low similarity scores with all participants and recomputed the scores.

In Filtered 2, I further removed 9 populations (East Africa, America, Oceania) with low scores for everyone.

In Filtered 3, another 40 populations with low scores with at least 88 (out of 89) Harappa participants were removed. The reason I removed populations and recomputed is that this made the out-group not as different from the in-group as it was before. So we can check if this algorithm can provide us with some meaningful difference in scores with close populations.

In Filtered 4, another 25 populations were removed making it more South Asian centered.

Finally, I used the 68 unmixed South Asian Harappa participants and did a South Asian specific run (though I cheated a bit and kept myself HRP0001 and my sister HRP0035 in). The most interesting thing here is the really high score the Patel Gujaratis get with the Gujarati-A reference population.

Reference 3 K=11 Admixture Dendrogram

Laredo asked:

Is it possible for you to create an unrooted similarity tree of all the populations in your “Reference 3″ dataset?

So here's a dendrogram of the average K=11 admixture results for the reference 3 populations.

Harappa Median IBS with Reference 3

You guys didn't like it the last time I did this and you are not going to like this either, but while I am thinking of solutions for posting closest individual IBS neighbors, here's another go at which reference populations have the best median IBS matches with you.

I used Reference 3 with about 100,000 SNPs for this IBS run.

Go to the spreadsheet and click on your ID in the column headers to sort by your similarity to the different reference populations.

UPDATE: I have added a transpose spreadsheet too, on Onur's request, so that you can sort which Harappa participant has higher or lower scores with a specific reference population.

Harappa Genome Similarity MDS/Dendrogram

I computed the IBS similarity matrix for the Harappa participants HRP0001 to HRP0080 over 500,000 SNPs. This is exactly the same thing as the genome-wide gene comparison at 23andme.

Then, I converted the similarity matrix to a dissimilarity/distance matrix with the standard formula:

dij = sqrt(2 - 2 * sij)

where sij is the similarity between individuals i and j and dij is the distance/dissimilarity between the two.

Using the dissimilarity matrix, I classified all the participants (excluding close relatives) using hierarchical clustering with complete linkage. You can see the dendrogram below.

Then I used the same dissimilarity matrix to calculate 6-dimensional MDS. You can see the MDS plots below. The numbers on the plots are your Harappa IDs.

MDS Dimensions 1 & 2:

MDS Dimensions 3 & 4:

MDS Dimensions 5 & 6:

As you can see I (HRP0001) and my sister (HRP0035) are far away in the first four dimensions.

I'll let you guys speculate on what each dimension represents.

Now why create an MDS this way instead of directly using Plink's MDS functionality? Well, I needed to check if I could do it using only the similarity matrix because that would be really useful for something else. Tune in on my other blog for more later this week.

Harappa Gene Similarity

I was looking at Simranjit's DNA Tribes results and I thought I could provide you guys a list of how similar (part of) your genome is to different reference populations in a somewhat similar way to DNA Tribes results.

Basically, I computed an IBS (identity by state) matrix for all Harappa participants from HRP0001 to HRP0080 and my Reference II samples (info). These are the same as the Genome-wide comparison feature at 23andme.

Then I took the median similarity percentage between you and a reference population group. I found that median worked better here than mean as the mean was affected a lot by some outlier samples in the reference data.

Of course, since I am giving you a big discount compared to DNA Tribes, I am not doing a nice individual report. Instead, all you get is one spreadsheet including everyone. Click on your ID in the column headers to sort by your similarity to the different reference populations.

We see four outliers among the project participants who don't match any reference populations very well. One is HRP0074, a Brazilian, which is expected since I don't have any Native American populations. Then there are me (HRP0001) and my sister (HRP0035) which was well-known already. Finally, HRP0044, a Kashmiri.

Do note that this analysis was done using about 20,000 SNPs.