Monthly Archives: March 2011 - Page 3

Admixture K=4, HRP0051-HRP0060

Posted by Zack on March 21, 2011 4 comments

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

The interesting samples here are the Gujaratis and the Kurd.

If you can't see the interactive bar chart above, here's a static image.

PS. This was run using Admixture version 1.04.

Austroasiatic Dataset

Posted by Zack on March 20, 2011 Comments Off

Razib pointed out the paper "Population Genetic Structure in Indian Austroasiatic speakers: The Role of Landscape Barriers and Sex-specific Admixture" by Gyaneshwer Chaubey, Mait Metspalu, Ying Choi, Reedik MÃ¤gi, Irene Gallego Romero, Pedro Soares, Mannis van Oven, Doron M. Behar, Siiri Rootsi, Georgi Hudjashov, Chandana Basu Mallick, Monika Karmin, Mari Nelis, JÃ¼ri Parik, Alla Goverdhana Reddy, Ene Metspalu, George van Driem, Yali Xue, Chris Tyler-Smith, Kumarasamy Thangaraj, Lalji Singh, Maido Remm, Martin B. Richards, Marta Mirazon Lahr, Manfred Kayser, Richard Villems and Toomas Kivisild to me 36 hours ago. And I have their dataset now.

I have been told that the data will hopefully be in the NCBI GEO database soon.

There are a total of 41 samples with 527,319 SNPs in the data. There are Bonda, Savara, Juang and Gadaba from Orissa; Santhal and Asur from Jharkand; Kharia from Chattishgarh; Ho from Bihar; Khasi and Garo from Meghalaya; and some (15) Burmese.

PS. I have created a separate page for references where I link to the papers which led to the datasets I am using.

South Asian Map

Posted by Zack on March 19, 2011 Comments Off

Simranjit has another map:http://dekor-okno.ru

I am working on improving the interpolation algorithm to take into account barriers such as oceans and even terrain features like mountain ranges. However, this process takes a long time.

Anyways in the meantime , this is one i think the participants would be interested in. It has several things in it, an isopleth layer for C1 - South Asian (12 gradations for more impact). It also have the other Components (C1, C2, C3, C4, C5, C6, C8) represented in the form of pie charts. Base map is a topographic one this time.

Ref1 South Asians + Harappa PCA Clusters

Posted by Zack on March 19, 2011 4 comments

Using the PCA results of the South Asians in Reference I as well as Harappa participants, I ran a couple of clustering algorithms.

First, I scaled the principal components by the respective eigenvalues.

Using Euclidean distance for hierarchical clustering with complete linkage, here's the dendrogram for the Harappa Project participants.

You can compare this to the Admixture-based dendrogram:

The most obvious thing is that I (HRP0001) am an outlier by far.

We inferred three major clusters with the admixture results. Those are intact, though changed a little.

I also ran MClust on the PCA data. The optimum number of clusters was 14. The resulting cluster assignments can be seen in a spreadsheet.

For the Harappa Project participants, the numbers give the probability of assignment to a cluster. For example, for HRP0009 there is a 72% of belonging to cluster 4. For the reference populations, the numbers give the expected number of samples assigned to a cluster.

Harappa Participant Admixture Maps

Posted by Zack on March 18, 2011 3 comments

Following maps of the reference populations, Simranjit has gone ahead and included the Harappa Project participants in these maps as well.

Here's what he said:

I'm now incorporating project participants into the maps. I had to drop admixed individuals, however, and I made some choices, dropped the Bihari Kayastha and the Tamil Nadu non-Brahmin for now, as they differ a fair bit. Take note that as we don't have reference sample for some countries so this sometimes can cause the interpolation to be off (e.g. lack of Central Asian republics other than Uzbekistan is skewing Central Asia to be more South Asian than it really is).

These maps are based on K=12 admixture run.

The gradation is from Dark green (low) to Dark red (high) for most of them.

Basically the percentages for each Component are divided into 32 equal intervals, to create the contour effect. Take note that it represents relative values not absolute.

C1 South Asian component:

C2 Balochistan/Caucasus component:

C6 European component:

Distance Measures

Posted by Zack on March 18, 2011 10 comments

Referring to the dendrogram computed from the admixture results of Harappa Project participants, Thorfinn asked a long time ago:

Interesting that South Indian/Cow Belt Brahmins cluster together; while Punjabi Brahmins are closer to Punjabis.

I can understand the first clustering, assuming that Southern Brahmin communities are a spinoff of northern communities and have maintained relative genetic isolation; and the source Northern Brahmin population differed in original origin from other Cow Belt populations.

But how do both Brahmin communities differ equally from Punjabi/Rajasthani Brahmins; and why is that community closer to other Punjabi populations?

In terms of admixture results, that is correct in the case of the project participants. Why this is the case, I have no idea.

However, there is an issue here that we have to consider and nsriram commented about it:

The euclidean distance doesnâ€™t seem to be the appropriate metric to capture the pairwise similarities. Once you make a commitment to the distance measure then the side effects carry-over into the tree construction.

What is a good distance measure to compute the similarity or dissimilarity of the admixture results of two people? Is the Euclidean distance a good one in this case? It certainly is the most common and the easiest to use I guess. So we usually default to it.

However, if we look at the Fst divergences of the ancestral components, we see that the different components are more or less different from each other. So a 5% difference in C1 might not be the same as a 5% difference in C10.

A solution might be to use a weighted distance, but how to weight it? The Fst numbers give pairwise distances for the different ancestral populations. If you are focused on a specific population (e.g., South Asians), we could try weighting by the Fst values between that component and the others. But I am not sure if that's a good solution either.

In the end, a Euclidean distance measure gives us a rough idea of the differences between admixture results, but it should not be used to explain minor differences or to consider phylogenies.

Reich et al and Pan-Asian Datasets

Posted by Zack on March 17, 2011 16 comments

I got access to the Reich et al (Nature 2009) dataset used in their paper "Reconstructing Indian population history".

It has the following populations:

Aonaga	Aus	Bhil
Chenchu	Great_Andamanese	Hallaki
Kamsali	Kashmiri_Pandit	Kharia
Kurumba	Lodi	Madiga
Mala	Meghawal	Naidu
Nysha	Onge	Sahariya
Santhal	Satnami	Siddi
Somali	Srivastava	Tharu
Vaish	Velama	Vysya

There are 141 individuals with 587,753 SNPs in their dataset which conveniently is in PED format.

Also, Blaise pointed me to the Pan-Asian SNP data used in the Dec 2009 Science paper "Mapping Human Genetic Diversity in Asia".

It includes the following 71 populations:

Maya	Auca	Quechua	Karitiana	Pima
Ami	Atayal	Melanesians	Zhuang	Han_Cantonese
Hmong	Jiamao	Jinuo	Han_Shanghai	Uyghur
Wa	Alorese	Dayak	Javanese	Batak_Karo
Lamaholot	Lembata	Malay	Mentawai	Manggarai
Kambera	Sunda	Batak_Toba	Toraja	Andhra_Pradesh
Karnataka	Bengali-Assamese	Rajasthan	Uttaranchal	Uttar Pradesh
Haryana	Spiti	Bhili	Marathi	Japanese
Ryukyuan	Korean	Bidayuh	Jehai	Kelantan
Kensiu	Temuan	Ayta	Agta	Ati
Iraya	Minanubu	Mamanwa	Filipino	Singapore_Chinese
Singapore_Indian	Singapore_Malay	Hmong (Miao)	Karen	Lawa
Mlabri	Mon	Paluang	Plang	Tai_Khuen
Tai_Lue	H'tin	Tai_Yuan	Tai_Yong	Yao
Hakka	Minnan

It has 1,719 individuals with 54,794 SNPs. I wish it had more SNPs considering the wealth of populations.

Also, the Pan-Asian data is in the form of minor allele counts, so I need to convert that back to A/C/G/T. Since there are some HapMap populations included in the dataset, that shouldn't be too hard.

I am going to include both these datasets into my big reference set.

Harappa Participants on 3-D PCA

Posted by Zack on March 17, 2011 Comments Off

sv wanted to see where he was on the South Asian 3-D PCA plot, so I obliged.

It's a quick and dirty method, but you should see a dropdown select box under the 3-D plot. Just select one of the participant IDs from there and that person's dot on the 3-D plot should increase in size so that it's easier to spot.

Isopleths

Posted by Zack on March 17, 2011 14 comments

Simranjit has done a great job of creating some maps showing the distribution of the various ancestral components at K=16. He has posted them on DNA Forums and sent them to me.

The gradation is from Dark green (low) to Dark red (high) for most of them.

Basically the percentages for each Component are divided into 32 equal intervals, to create the contour effect. Take note that it represents relative values not absolute.

Here is C1 South Asian:

C2 Balochistan/Caucasus:

C5 Southwest Asian:

C6 European:

C12 Siberian:

Great job, Simranjit!

Ref 1 South Asians + Harappa PCA

Posted by Zack on March 16, 2011 28 comments

I ran PCA on the South Asian populations included in Reference I dataset (excluding Kalash and Hazara) as well as 38 South Asian participants of Harappa Project. I excluded Kalash and Hazara because they usually dominate a South Asian PCA plot being so distinct.

The reference populations included are: Balochi, Bnei Menashe Jews, Brahui, Burusho, Cochin Jews, Gujaratis (divided into two groups), Makrani, Malayan, North Kannadi, Paniya, Pathan, Sakilli, Sindhi, and Singapore Indians.

Here's the spreadsheet showing the eigenvalues and the first 15 principal components for each sample.

I computed the PCA using Eigensoft which removed 26 samples as outliers. The Tracy-Widom statistics show that about 30 eigenvectors are significant.

Here are the first 15 eigenvalues:

1	3.874124
2	1.819077
3	1.663232
4	1.335721
5	1.293500
6	1.242984
7	1.230921
8	1.225775
9	1.222177
10	1.214539
11	1.212808
12	1.204000
13	1.198930
14	1.195450
15	1.192848

Here is a 3-D PCA plot (hat tip: Doug McDonald) showing the first three eigenvectors. The plot is rotating about the 1st eigenvector which is vertical. Also, I have stretched the principal components based on the corresponding eigenvalues.

</p> <p>Your browser does not support frames. Go <a href="http://www.harappadna.org/wp-content/uploads/2011/03/r1_sa_hrp_pca.html">here</a> to see the animation.</p> <p>

Now here are plots of the first 14 eigenvectors. In this case, I have not stretched the principal components, so keep in mind that the first eigenvector explains 3.874124/1.819077=2.13 times variation compared to the 2nd eigenvector.

UPDATE: At the bottom of the 3-D plot, you can see a dropdown. Just select one of the project participants from there and that participant's dot in the plot with become bigger so they are easy to spot.

« Previous page | Next page »

Harappa Ancestry Project

Genetics and South Asia

Monthly Archives: March 2011 - Page 3

Admixture K=4, HRP0051-HRP0060

Austroasiatic Dataset

South Asian Map

Ref1 South Asians + Harappa PCA Clusters

Harappa Participant Admixture Maps

Distance Measures

Reich et al and Pan-Asian Datasets

Harappa Participants on 3-D PCA

Isopleths

Ref 1 South Asians + Harappa PCA

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Archives

Recent Comments

Blogroll

Genetics and South Asia

Monthly Archives: March 2011 - Page 3

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Tags

Archives

Recent Comments

Blogroll