Tag Archives: reference - Page 2

South Asian PCA 3D Plot

Here's a 3-D plot of my South Asian PCA run, showing the first three principal components.

The principal components have been scaled according to their respective eigenvalues. The plot is rotating about the vertical 1st eigenvector.

You can find out your position on the plot by using the dropdown below the plot and selecting your Harappa ID.

South Asian PCA Plots

I did a South Asian PCA + Mclust analysis last month. Here are the PCA plots from that analysis.

First, the eigenvectors are not scaled to the eigenvalues in the plots. So here's a table explaining how much each eigenvector is worth.

Eigenvector Percentage variation explained
1 1.134%
2 0.452%
3 0.351%
4 0.263%
5 0.254%
6 0.236%
7 0.228%
8 0.224%
9 0.215%
10 0.209%
11 0.207%
12 0.205%
13 0.203%
14 0.201%
15 0.198%
16 0.194%
17 0.191%
18 0.189%
19 0.189%
20 0.188%
21 0.188%
22 0.187%
23 0.186%
24 0.185%
25 0.184%
26 0.184%
27 0.183%
28 0.182%
29 0.180%
30 0.180%
31 0.179%
32 0.179%

Eigenvector 1 looks like the Indian cline but it's actually a West-East Eurasian cline. It's quite similar to Reich et al's Indian cline for their subset of populations (correlation between pc1 and ASI is 0.998869) but since East Asian is not separated out here due to the lack of any East Asian samples, we get a mix of East Asian and Ancestral South Indian towards the right of the plot.

Eigenvector 2 separates Kalash from everyone else.

Metspalu Ref3 Admixture Individual Results

I ran supervised admixture on the Metspalu et al dataset using my reference 3 data. AV asked for individual results, so here they are.

Here's the spreadsheet for Metspalu individual admixture results. You can compare with the reference 3 results.

Here's our bar chart. Remember you can click on the legend or the table headers to sort.

Metspalu Dataset Update

Dr. Metspalu, who has been very good about sharing data and information, has informed me about a couple of cases of mislabeling in the Metspalu et al dataset.

Our sample labelled D238 and reported as Tharu is in fact a Brahmin sample from Uttar Pradesh.

Following the publication we have identified that sample evo_32 was erroneously labelled as Kanjar before any genetic analyses. We hereby re-label the sample as belonging to Kol population.

Thus, I have updated the Metspalu admixture results and clustering results.

South Asian PCA + Mclust

I combined reference 3 with Metspalu et al data and Harappa Ancestry Project participants (up to HRP0200). Then I kept only those individuals whose combined proportion of South Asian and Onge components on my reference 3 admixture results was more than 50%.

I ran PCA on these South Asian samples and kept 31 dimensions. Running Mclust on the PCA results gave me 37 clusters.

The clustering results are in a spreadsheet.

For an individual, the value under a specific cluster shows the probability of that person belonging to that cluster. For example, HRP0152 has a 58% probability of belonging to cluster CL8 and 42% probability of being in cluster CL14.

For the populations in the first sheet, I added up the probabilities of all the samples in that population to get the expected number of individuals of that ethnicity belonging to a specific cluster.

In the second sheet, I have listed all the individual samples' clustering results.

There are some outliers who didn't belong in any cluster: HRP0001 (me, of course), 7 (out of 18) Makranis, 4 (out of 23) Sindhis, 3 (all) Great Andamanese, 1 (out of 20) Balochi, 1 (out of 4) Madiga, and 1 (only) Onge.

Reference 3 + Yunusbayev + HAP PCA and Mclust

I ran Principal Component Analysis (PCA) on reference 3 along with Yunusbayev et al Caucasus dataset and Harappa Ancestry Project participants (up to HRP0200).

Then I ran mclust on the first 70 dimensions. The resulting 156 clusters can be seen in a spreadsheet.

For individuals belonging to Harappa Ancestry Project, the value in a column shows that person's probability of being in that cluster. So if there is a 1 in CL15 for example, then that person has a 100% probability of being in Cluster CL15.

For the reference population groups, I have added up the probabilities for all the individuals belonging to that group.

Yunusbayev Ref3 Admixture Results

I ran supervised admixture on the Yunusbayev et al dataset from the Caucasus using my reference 3 data to see how the Yunusbayev samples looked in my Ref3 admixture component space.

Here's the spreadsheet for Yunusbayev admixture results. You can compare with the reference 3 results.

Here's our bar chart for Yunusbayev results. Remember you can click on the legend or the table headers to sort.

Metspalu Ref3 Admixture Results

I ran supervised admixture on the Metspalu et al dataset using my reference 3 data. Here's the spreadsheet for Metspalu admixture results. You can compare with the reference 3 results.

Here's our bar chart for Metspalu results. Remember you can click on the legend or the table headers to sort.

These are very different from Dienekes for some reason.

UPDATE (Dec 13 10:04am): I found a major error. I had used the population info file I had downloaded from the paper instead of my reformatted one and thus I had not merged that info with the correct IDs with the admixture results. So the previously posted results were junk. I have fixed that now and the results are as expected.

Metspalu et al Data Relatedness

I performed IBD analysis on the Metspalu dataset using plink and found the relatedness of the following samples to be too high.

ID1 Source1 Population1 ID2 Source2 Population2 IBD Estimate
Mawasi1 Metspalu Mawasi Mawasi1 Chaubey Mawasi 100%
VELZ260 Metspalu Velama Velama_184_R2 Reich Velama 99%
VELZ260 Metspalu Velama VELZ265 Metspalu Velama 19%
VELZ265 Metspalu Velama Velama_184_R2 Reich Velama 19%
D254 Metspalu Tharu Tharu_107_R1 Reich Tharu 99%
D260 Metspalu Tharu Tharu_108_R1 Reich Tharu 98%
evo_32 Metspalu Kanjar 321e Metspalu Kol 53%
HA030 Metspalu Dharkar HA039 Metspalu Dharkar 52%
A387 Metspalu Dusadh A388 Metspalu Dusadh 52%
A394 Metspalu Dusadh A395 Metspalu Dusadh 52%
A395 Metspalu Dusadh A393 Metspalu Dusadh 46%
A394 Metspalu Dusadh A393 Metspalu Dusadh 45%
A392 Metspalu Dusadh A393 Metspalu Dusadh 32%
A392 Metspalu Dusadh A395 Metspalu Dusadh 31%
A392 Metspalu Dusadh A394 Metspalu Dusadh 28%
evo_37 Metspalu Kanjar HA023 Metspalu Dharkar 27%
HA039 Metspalu Dharkar HA041 Metspalu Dharkar 24%
HLKP245 Metspalu Hakkipikki Hallaki_137_R2 Reich Hallaki 22%
PULD160 Metspalu Pulliyar PULD162 Metspalu Pulliyar 20%

As you can see, three samples from Reich et al seem to be the same as Metspalu et al. In addition, two Reich samples seem to be related to Metspalu samples.

There are some Metspalu samples who are likely related to one another. A 50% indicates likely a parent-child or sibling-sibling relationship. A 45-46% relatedness is most likely siblings in my opinion. An 18-19% percentage could be a 1st cousin relationship in an endogamous community. it could also just be the background relatedness in a small, bottlenecked and endogamous community.

It looks like about half of the Dusadh in the Metspalu dataset are related.

I am surprised at the close relationship of a Kanjar and a Kol in the dataset, though both are from Uttar Pradesh.

Ref3 + Yunusbayev Caucasus Data Admixture

To my standard reference 3 (list of populations), I added the Yunusbayev et al Caucasus samples which include the following:

  • 20 abhkasians
  • 16 armenians
  • 19 balkars
  • 13 bulgarians
  • 20 chechens
  • 14 kumyks
  • 6 kurds
  • 15 mordovians
  • 16 nogais
  • 15 north-ossetians
  • 15 tajiks
  • 15 turkmens
  • 20 ukranians

These 204 samples increased the total to 4,090.

Then I applied a stricter IBD relationship cutoff than I have before. Previously my focus was on removing relatives, but now I wanted to remove samples that seemed highly inbred or belonged to highly bottle-necked small groups so they would not create their own clusters in Admixture. This process removed the following 164 samples:

  • maasai 30
  • papuan 15
  • karitiana 12
  • pima 12
  • onge 8
  • surui 7
  • luhya 6
  • melanesian 6
  • colombian 5
  • hadza 5
  • koryaks 5
  • sandawe 5
  • san 4
  • turkmens 4
  • african-americans 3
  • east-greenlanders 3
  • great-andamanese 3
  • nganassans 3
  • chenchu 2
  • evenkis 2
  • han-chinese-south 2
  • maya 2
  • mbutipygmy 2
  • mexicans 2
  • utahn-whites 2
  • aus 1
  • bantukenya 1
  • british 1
  • chinese-americans 1
  • gujaratis-b 1
  • iranians 1
  • naxi 1
  • north-kannadi 1
  • samaritians 1
  • she 1
  • tuvinians 1
  • yemenese 1
  • yoruba 1
  • yukaghirs 1

Finally, I added the 165 founders from the Harappa Project participants (up to HRP0180).

The crossvalidation error for the admixture results with K (number of ancestral components) from 2 to 20 is plotted here.

Zooming in,

The lowest crossvalidation errors are for K=17 and K=12.

The admixture results are in a spreadsheet.

In addition to K=17 and K=12, take a look at the results for K=15.

PS. I should point out that the names for the ancestral components are just useful mnemonics based on the current distribution of that component. Also, a component with the same name at one value of K is different from a similarly named component at another K.