Tag Archives: dendrogram

Eurasian fineStructure Dendrograms

The dendrogram in the last post about Eurasian ChromoPainter/fineStructure analysis is a little hard to make sense of, so here is the same info in a better format.

First, the upper portion showing the relationship of the five branches:

Now, let's take a look at Branch1 which consists of South Asians:

Branch2 is European.

Branch3 is mostly the Near East and western Asia.

Branch4 is Inner Asia/Siberia.

And Branch5 is East Asian.

Note that the leaf labels consist of ethnicity followed by the number of that group who belong to that particular cluster. However, some of the labels are cut off in the images since they were long.

Related Reading:

Correlative Light and Electron MIcroscopy, Volume 111 (Methods in Cell Biology)
Lonely Planet Southeast Asia: On a Shoestring (Shoestring Travel Guide)
The World Island: Eurasian Geopolitics and the Fate of the West (Praeger Security International)
Shattered World 1 : The Eurasian War
Eurasian: Mixed Identities in the United States, China, and Hong Kong, 1842-1943

Eurasian ChromoPainter Analysis

Some months ago, I decided to run a big ChromoPainter analysis of the Eurasian samples I have. I removed from my dataset not only all Sub-Saharan Africans, but also North Africans and anyone else with more than 2% African admixture (which unfortunately included me).

Since the number of samples was still too large, I picked 25 random individuals from each non-South-Asian ethnicity while keeping all South Asians. I also tried to remove all close relatives and those with a high missing genotyping rate.

In the end, I had 254,576 SNPs for 2,001 samples belonging to 197 ethnic groups.

I ran ShapeIT to phase their genomes and then ChromoPainter and fineStructure. The whole process took about 2 months.

Then I got busy and the results sat on my computer for more than a month.

Now let's look at the ChromoPainter/fineStructure analysis. Due to my time constraints, I am going to present them in several posts.

Today, let's look at the fineStructure clustering run on the chunkcount output of ChromoPainter. It divided the individuals into 203 populations. Here's the spreadsheet containing the group and individual population clustering.

And here is the dendrogram showing the relationship of the clusters/populations computed by fineStructure.

UPDATE: Better dendrograms

Related Reading:

Numerical Ecology, Volume 24, Third Edition (Developments in Environmental Modelling)
The Horse, the Wheel, and Language: How Bronze-Age Riders from the Eurasian Steppes Shaped the Modern World
Eurasian Crossroads: A History of Xinjiang

Ref3 Admixture Dendrograms

I have posted the reference 3 K=11 admixture results for all populations and datasets. Here are the relevant links:

So let's try a dendrogram of all these populations' average admixture results. Instead of using regular Euclidean distance, I used some weighting based on Fst distances between admixture components, very similar to what Palisto did.

Here's a dendrogram of all datasets using complete linkage.

Since the Pan-Asian dataset had only 5,400 SNPs common with reference 3, we need to be careful interpreting the tree above. Just to make sure, here's the dendrogram excluding Pan-Asian populations.

Related Reading:

Reference Guide for Essential Oils Hard Cover 2012
Reference and Existence: The John Locke Lectures
Numerical Ecology with R (Use R!)
Mathematical Tools for Data Mining: Set Theory, Partial Orders, Combinatorics (Advanced Information and Knowledge Processing)

Interactive Tree Generation

Anyone know of any software to generate a javascript (or something) tree/dendrogram for the web which is interactive, i.e. branches can be expanded and collapsed and one can search for different nodes.

I want to use it to generate dendrograms including all Harappa participants and individual reference samples. So we are looking at more than 4,000 nodes on the tree.

Related Reading:

The Giving Tree with CD
The Tree of Bones
Numerical Ecology, Volume 24, Third Edition (Developments in Environmental Modelling)
Algorithms of the Intelligent Web

Admixture Ref3 Dendrogram HRP0001-HRP0160

I haven't done any admixture dendrograms in a while, so I thought you guys might be interested.

This uses admixture results using Reference 3. As usual, I used complete linkage for the hierarchical clustering.

Let's look at the dendrogram using regular Euclidean distance measure between admixture results.

I also decided to use chi squared distance measure to do the clustering.

PS. Any thoughts on the trees based on two different distance measures?

Related Reading:

Making Sense of Data II: A Practical Guide to Data Visualization, Advanced Data Mining Methods, and Applications
Script of Harappa & Mohenjodaro & Its Connection With Other Scripts
Applied Multivariate Data Analysis

Reference 3 + HAP PCA

I had run PCA on the Reference 3 dataset before. Now I included Harappa participants in it as well.

Here's the dendrogram based on the Euclidean distance between Harappa participants in their PCA results.

Since no one liked the IBS nearest neighbors lists, I thought making a spreadsheet with every participants' closest 100 neighbors in PCA space might be more fruitful.

Note that for the Harappa participants, the median distance to their nearest neighbor is 0.2064. So if your nearest neighbor is more than let's say 0.3 away, then you are not close to anyone.

Related Reading:

Ancient Cities of the Indus Valley Civilization
Applied Multivariate Data Analysis
Fierce Loyalty (Five Kingdoms 5) (The Five Kingdoms)
Mathematical Tools for Data Mining: Set Theory, Partial Orders, Combinatorics (Advanced Information and Knowledge Processing)

Harappa Ref3 Admixture Dendrograms

Now that we have the admixture results for project participants using Reference 3, let's take a look at a tree based on Euclidean distance of the admixture proportions for each participant.

Compare it to the earlier one with reference 1 admixture results.

And here is a dendrogram combining the average reference population results with the Harappa participants.

Related Reading:

Advancements of Ancient India's Vedic Culture: The Planet's Earliest Civilization and How it Influenced the World
Programming Collective Intelligence: Building Smart Web 2.0 Applications
The Harappa Files
Mathematical Tools for Data Mining: Set Theory, Partial Orders, Combinatorics (Advanced Information and Knowledge Processing)

Reference 3 K=11 Admixture Dendrogram

Laredo asked:

Is it possible for you to create an unrooted similarity tree of all the populations in your “Reference 3″ dataset?

So here's a dendrogram of the average K=11 admixture results for the reference 3 populations.

Related Reading:

Statistical Methods in the Atmospheric Sciences, Volume 100, Third Edition (International Geophysics)
Mathematical Tools for Data Mining: Set Theory, Partial Orders, Combinatorics (Advanced Information and Knowledge Processing)
Making Sense of Data II: A Practical Guide to Data Visualization, Advanced Data Mining Methods, and Applications
Applied Multivariate Data Analysis

Harappa Genome Similarity MDS/Dendrogram

I computed the IBS similarity matrix for the Harappa participants HRP0001 to HRP0080 over 500,000 SNPs. This is exactly the same thing as the genome-wide gene comparison at 23andme.

Then, I converted the similarity matrix to a dissimilarity/distance matrix with the standard formula:

dij = sqrt(2 - 2 * sij)

where sij is the similarity between individuals i and j and dij is the distance/dissimilarity between the two.

Using the dissimilarity matrix, I classified all the participants (excluding close relatives) using hierarchical clustering with complete linkage. You can see the dendrogram below.

Then I used the same dissimilarity matrix to calculate 6-dimensional MDS. You can see the MDS plots below. The numbers on the plots are your Harappa IDs.

MDS Dimensions 1 & 2:

MDS Dimensions 3 & 4:

MDS Dimensions 5 & 6:

As you can see I (HRP0001) and my sister (HRP0035) are far away in the first four dimensions.

I'll let you guys speculate on what each dimension represents.

Now why create an MDS this way instead of directly using Plink's MDS functionality? Well, I needed to check if I could do it using only the similarity matrix because that would be really useful for something else. Tune in on my other blog for more later this week.

Related Reading:

From Harappa to Hastinapura: A Study of the Earliest South Asian City and Civilization (American School of Prehistoric Research Monograph Series)
Harappa: The Cradle of Our Civilization
IBS For Dummies
Advancements of Ancient India's Vedic Culture: The Planet's Earliest Civilization and How it Influenced the World
The Complete Idiot's Guide to Eating Well with IBS

Harappa Admixture Dendrogram 1-80

It's time to update the Harappa admixture dendrogram since the last time I created one we had only 40 participants.

Note that this is not a phylogeny. It just visualizes the closeness of your admixture results to others.

Also note that I am using Euclidean distance between admixture proportions which has problems and using complete linkage hierarchical clustering.

Related Reading:

Advancements of Ancient India's Vedic Culture: The Planet's Earliest Civilization and How it Influenced the World
Mathematical Tools for Data Mining: Set Theory, Partial Orders, Combinatorics (Advanced Information and Knowledge Processing)
Script of Harappa & Mohenjodaro & Its Connection With Other Scripts