The dendrogram in the last post about Eurasian ChromoPainter/fineStructure analysis is a little hard to make sense of, so here is the same info in a better format.
First, the upper portion showing the relationship of the five branches:
Now, let's take a look at Branch1 which consists of South Asians:
Branch2 is European.
Branch3 is mostly the Near East and western Asia.
Branch4 is Inner Asia/Siberia.
And Branch5 is East Asian.
Note that the leaf labels consist of ethnicity followed by the number of that group who belong to that particular cluster. However, some of the labels are cut off in the images since they were long.
Some months ago, I decided to run a big ChromoPainter analysis of the Eurasian samples I have. I removed from my dataset not only all Sub-Saharan Africans, but also North Africans and anyone else with more than 2% African admixture (which unfortunately included me).
Since the number of samples was still too large, I picked 25 random individuals from each non-South-Asian ethnicity while keeping all South Asians. I also tried to remove all close relatives and those with a high missing genotyping rate.
In the end, I had 254,576 SNPs for 2,001 samples belonging to 197 ethnic groups.
I ran ShapeIT to phase their genomes and then ChromoPainter and fineStructure. The whole process took about 2 months.
Then I got busy and the results sat on my computer for more than a month.
Now let's look at the ChromoPainter/fineStructure analysis. Due to my time constraints, I am going to present them in several posts.
Today, let's look at the fineStructure clustering run on the chunkcount output of ChromoPainter. It divided the individuals into 203 populations. Here's the spreadsheet containing the group and individual population clustering.
And here is the dendrogram showing the relationship of the clusters/populations computed by fineStructure.
I have posted the reference 3 K=11 admixture results for all populations and datasets. Here are the relevant links:
So let's try a dendrogram of all these populations' average admixture results. Instead of using regular Euclidean distance, I used some weighting based on Fst distances between admixture components, very similar to what Palisto did.
Here's a dendrogram of all datasets using complete linkage.
Since the Pan-Asian dataset had only 5,400 SNPs common with reference 3, we need to be careful interpreting the tree above. Just to make sure, here's the dendrogram excluding Pan-Asian populations.
I want to use it to generate dendrograms including all Harappa participants and individual reference samples. So we are looking at more than 4,000 nodes on the tree.
I haven't done any admixture dendrograms in a while, so I thought you guys might be interested.
This uses admixture results using Reference 3. As usual, I used complete linkage for the hierarchical clustering.
Let's look at the dendrogram using regular Euclidean distance measure between admixture results.
I also decided to use chi squared distance measure to do the clustering.
PS. Any thoughts on the trees based on two different distance measures?
I had run PCA on the Reference 3 dataset before. Now I included Harappa participants in it as well.
Here's the dendrogram based on the Euclidean distance between Harappa participants in their PCA results.
Since no one liked the IBS nearest neighbors lists, I thought making a spreadsheet with every participants' closest 100 neighbors in PCA space might be more fruitful.
Note that for the Harappa participants, the median distance to their nearest neighbor is 0.2064. So if your nearest neighbor is more than let's say 0.3 away, then you are not close to anyone.
Now that we have the admixture results for project participants using Reference 3, let's take a look at a tree based on Euclidean distance of the admixture proportions for each participant.
Compare it to the earlier one with reference 1 admixture results.
And here is a dendrogram combining the average reference population results with the Harappa participants.
Is it possible for you to create an unrooted similarity tree of all the populations in your “Reference 3″ dataset?
So here's a dendrogram of the average K=11 admixture results for the reference 3 populations.
I computed the IBS similarity matrix for the Harappa participants HRP0001 to HRP0080 over 500,000 SNPs. This is exactly the same thing as the genome-wide gene comparison at 23andme.
Then, I converted the similarity matrix to a dissimilarity/distance matrix with the standard formula:
dij = sqrt(2 - 2 * sij)
where sij is the similarity between individuals i and j and dij is the distance/dissimilarity between the two.
Using the dissimilarity matrix, I classified all the participants (excluding close relatives) using hierarchical clustering with complete linkage. You can see the dendrogram below.
Then I used the same dissimilarity matrix to calculate 6-dimensional MDS. You can see the MDS plots below. The numbers on the plots are your Harappa IDs.
MDS Dimensions 1 & 2:
MDS Dimensions 3 & 4:
MDS Dimensions 5 & 6:
As you can see I (HRP0001) and my sister (HRP0035) are far away in the first four dimensions.
I'll let you guys speculate on what each dimension represents.
Now why create an MDS this way instead of directly using Plink's MDS functionality? Well, I needed to check if I could do it using only the similarity matrix because that would be really useful for something else. Tune in on my other blog for more later this week.
It's time to update the Harappa admixture dendrogram since the last time I created one we had only 40 participants.
Note that this is not a phylogeny. It just visualizes the closeness of your admixture results to others.
Also note that I am using Euclidean distance between admixture proportions which has problems and using complete linkage hierarchical clustering.