Monthly Archives: December 2011

South Asian PCA + Mclust

I combined reference 3 with Metspalu et al data and Harappa Ancestry Project participants (up to HRP0200). Then I kept only those individuals whose combined proportion of South Asian and Onge components on my reference 3 admixture results was more than 50%.

I ran PCA on these South Asian samples and kept 31 dimensions. Running Mclust on the PCA results gave me 37 clusters.

The clustering results are in a spreadsheet.

For an individual, the value under a specific cluster shows the probability of that person belonging to that cluster. For example, HRP0152 has a 58% probability of belonging to cluster CL8 and 42% probability of being in cluster CL14.

For the populations in the first sheet, I added up the probabilities of all the samples in that population to get the expected number of individuals of that ethnicity belonging to a specific cluster.

In the second sheet, I have listed all the individual samples' clustering results.

There are some outliers who didn't belong in any cluster: HRP0001 (me, of course), 7 (out of 18) Makranis, 4 (out of 23) Sindhis, 3 (all) Great Andamanese, 1 (out of 20) Balochi, 1 (out of 4) Madiga, and 1 (only) Onge.

Related Reading:

Reference 3 + Yunusbayev + HAP PCA and Mclust

I ran Principal Component Analysis (PCA) on reference 3 along with Yunusbayev et al Caucasus dataset and Harappa Ancestry Project participants (up to HRP0200).

Then I ran mclust on the first 70 dimensions. The resulting 156 clusters can be seen in a spreadsheet.

For individuals belonging to Harappa Ancestry Project, the value in a column shows that person's probability of being in that cluster. So if there is a 1 in CL15 for example, then that person has a 100% probability of being in Cluster CL15.

For the reference population groups, I have added up the probabilities for all the individuals belonging to that group.

Related Reading:

Yunusbayev Ref3 Admixture Results

I ran supervised admixture on the Yunusbayev et al dataset from the Caucasus using my reference 3 data to see how the Yunusbayev samples looked in my Ref3 admixture component space.

Here's the spreadsheet for Yunusbayev admixture results. You can compare with the reference 3 results.

Here's our bar chart for Yunusbayev results. Remember you can click on the legend or the table headers to sort.

Related Reading:

Another 23andme Sale

23andme is having another sale till December 31: $23 off per kit (from $99 up front). The code to take advantage of the sale price is TPHG6P.

20111213-083449.jpg

UPDATE: Here is another link for a $23 discount for 23andme.

Related Reading:

Metspalu Ref3 Admixture Results

I ran supervised admixture on the Metspalu et al dataset using my reference 3 data. Here's the spreadsheet for Metspalu admixture results. You can compare with the reference 3 results.

Here's our bar chart for Metspalu results. Remember you can click on the legend or the table headers to sort.

These are very different from Dienekes for some reason.

UPDATE (Dec 13 10:04am): I found a major error. I had used the population info file I had downloaded from the paper instead of my reformatted one and thus I had not merged that info with the correct IDs with the admixture results. So the previously posted results were junk. I have fixed that now and the results are as expected.

Related Reading:

Metspalu et al Data Relatedness

I performed IBD analysis on the Metspalu dataset using plink and found the relatedness of the following samples to be too high.

ID1 Source1 Population1 ID2 Source2 Population2 IBD Estimate
Mawasi1 Metspalu Mawasi Mawasi1 Chaubey Mawasi 100%
VELZ260 Metspalu Velama Velama_184_R2 Reich Velama 99%
VELZ260 Metspalu Velama VELZ265 Metspalu Velama 19%
VELZ265 Metspalu Velama Velama_184_R2 Reich Velama 19%
D254 Metspalu Tharu Tharu_107_R1 Reich Tharu 99%
D260 Metspalu Tharu Tharu_108_R1 Reich Tharu 98%
evo_32 Metspalu Kanjar 321e Metspalu Kol 53%
HA030 Metspalu Dharkar HA039 Metspalu Dharkar 52%
A387 Metspalu Dusadh A388 Metspalu Dusadh 52%
A394 Metspalu Dusadh A395 Metspalu Dusadh 52%
A395 Metspalu Dusadh A393 Metspalu Dusadh 46%
A394 Metspalu Dusadh A393 Metspalu Dusadh 45%
A392 Metspalu Dusadh A393 Metspalu Dusadh 32%
A392 Metspalu Dusadh A395 Metspalu Dusadh 31%
A392 Metspalu Dusadh A394 Metspalu Dusadh 28%
evo_37 Metspalu Kanjar HA023 Metspalu Dharkar 27%
HA039 Metspalu Dharkar HA041 Metspalu Dharkar 24%
HLKP245 Metspalu Hakkipikki Hallaki_137_R2 Reich Hallaki 22%
PULD160 Metspalu Pulliyar PULD162 Metspalu Pulliyar 20%

As you can see, three samples from Reich et al seem to be the same as Metspalu et al. In addition, two Reich samples seem to be related to Metspalu samples.

There are some Metspalu samples who are likely related to one another. A 50% indicates likely a parent-child or sibling-sibling relationship. A 45-46% relatedness is most likely siblings in my opinion. An 18-19% percentage could be a 1st cousin relationship in an endogamous community. it could also just be the background relatedness in a small, bottlenecked and endogamous community.

It looks like about half of the Dusadh in the Metspalu dataset are related.

I am surprised at the close relationship of a Kanjar and a Kol in the dataset, though both are from Uttar Pradesh.

Related Reading:

Shared and Unique Components of Human Population Structure and Genome-Wide Signals of Positive Selection in South Asia

Metspalu et al have a new paper in American Journal of Human Genetics about South Asian genetics. Here's the abstract:

South Asia harbors one of the highest levels genetic diversity in Eurasia, which could be interpreted as a result of its long-term large effective population size and of admixture during its complex demographic history. In contrast to Pakistani populations, populations of Indian origin have been underrepresented in previous genomic scans of positive selection and population structure. Here we report data for more than 600,000 SNP markers genotyped in 142 samples from 30 ethnic groups in India. Combining our results with other available genome-wide data, we show that Indian populations are characterized by two major ancestry components, one of which is spread at comparable frequency and haplotype diversity in populations of South and West Asia and the Caucasus. The second component is more restricted to South Asia and accounts for more than 50% of the ancestry in Indian populations. Haplotype diversity associated with these South Asian ancestry components is significantly higher than that of the components dominating the West Eurasian ancestry palette. Modeling of the observed haplotype diversities suggests that both Indian ancestry components are older than the purported Indo-Aryan invasion 3,500 YBP. Consistent with the results of pairwise genetic distances among world regions, Indians share more ancestry signals with West than with East Eurasians. However, compared to Pakistani populations, a higher proportion of their genes show regionally specific signals of high haplotype homozygosity. Among such candidates of positive selection in India are MSTN and DOK5, both of which have potential implications in lipid metabolism and the etiology of type 2 diabetes.

I'll have some comments later today.

Related Reading:

Admixture (Ref3 K=11) HRP0191-HRP0200

Here are the admixture results using Reference 3 for Harappa participants HRP0191 to HRP0200.

You can see the participant results in a spreadsheet as well as their ethnic breakdowns and the reference population results.

Here's our bar chart and table. Remember you can click on the legend or the table headers to sort.

If the above interactive charts are not working, here's a static bar graph.

HRP0193 is Georgian and has very similar results to HRP0138 and HRP0175.

HRP0200 is Kazakh and is closely related to HRP0089. Thus the difference there (American, Onge & Papuan components) is somewhat interesting, though not high enough to be certain that it's not noise.

HRP0197 and HRP0198 are Somali. HRP0197 pointed out to me that 14S_R1, a Somali in the reference set, was an outlier who was more like East African Bantu (e.g., Luhya) than the other reference Somalis. So in the table below, I have excluded 14S_R1 for the average.

Component RefAverage HRP00197 HRP00198
S Asian 0 2 2
Onge 4 0 1
E Asian 0 1 2
SW Asian 28 33 34
European 0 0 0
Siberian 0 2 1
W African 12 14 13
Papuan 0 0 0
American 0 0 1
San/Pygmy 2 3 2
E African 52 44 43

Interestingly, the two project participants are more Asian than the reference average.

Related Reading: