Tag Archives: reference - Page 4

Reference 3 Population Concordance

Dienekes had come up with a population concordance ratio which compared the IBS similarity percentages of a trio of individuals to compute the probability that two individuals from population A are more similar to each other than either is to any individual in population B.

Please note that

If two populations can be perfectly distinguished, then their population concordance ratio is 1. If however we randomly divide a set of individuals into two populations and try to calculate the population concordance ratio, we'll find it to be 0.25. It is possible for this ratio to be as low as zero.

If the concordance ratio between two populations is low, that does not necessarily mean that they are very similar. It's possible that a population does not form a tight cluster and has a lot of variation and thus is not distinguishable from another.

Now, here's the spreadsheet for the concordance ratios. You can focus on the South Asian population pairs here.

West, Central, South & Southeast Asian Admixture

Another set of admixture runs. This one uses the South Asian, Middle Eastern, Caucasian, Central Asian, Southeast Asian and Oceanian samples from Reference 3.

Basically I consider these to be our target populations. The idea is to build out from here by adding a few samples from other populations to make the results better.

Right now, the absence of African, European, East Asian and Siberian populations makes some of the other populations substitute for them. For example, Siddi works as African substitute while Aonaga works as East Asian substitute.

Here are the admixture results. You can choose the number of ancestral components, K, from the dropdown below.

I find K=11 and K=14 to be the most interesting. They have the two lowest cross-validation errors too.

Reference I Admixture Errors

I am have thinking about error estimation for Admixture results for some time since I have heard a lot of arguments about how even 0.1% result is significant. I was skeptical of that and have rounded off my admixture run results to the nearest percent.

There was a memory leak issue in the bootstrapping code for admixture which crashed it every time I tried running it. I emailed David Alexander and he fixed it in version 1.12.

So I ran the default 200 bootstrap replicates to measure standard error in our old Reference I K=12 admixture. Spreadsheet with population level results is here and participant results are here.

Here are some statistics for the standard error estimates:

Min. 1st Qu. Median Mean 3rd Qu. Max.
C1 S Asian 0.00% 0.02% 0.33% 0.52% 0.96% 1.93%
C2 Blch/Cauc 0.00% 0.00% 1.02% 0.79% 1.45% 2.63%
C3 Kalash 0.00% 0.01% 0.40% 0.50% 0.99% 3.76%
C4 SE Asian 0.00% 0.09% 0.37% 0.60% 1.27% 1.92%
C5 SW Asian 0.00% 0.00% 0.60% 0.66% 1.28% 2.90%
C6 Euro 0.00% 0.00% 0.35% 0.56% 1.12% 1.82%
C7 Papuan 0.00% 0.07% 0.22% 0.23% 0.36% 1.08%
C8 NE Asian 0.00% 0.07% 0.36% 0.67% 1.36% 2.45%
C9 Siberian 0.00% 0.08% 0.37% 0.51% 0.82% 2.29%
C10 E Bantu 0.00% 0.00% 0.00% 0.35% 0.72% 1.93%
C11W Afr 0.00% 0.00% 0.00% 0.28% 0.50% 1.51%
C12 E Afr 0.00% 0.00% 0.05% 0.31% 0.60% 1.79%

You can see the mean value of the standard errors per population and realize how many are over 1% (marked in red).

And statistics for bias estimates:

Min. 1st Qu. Median Mean 3rd Qu. Max.
C1 S Asian -1.104% -0.031% 0.000% -0.024% 0.075% 1.026%
C2 Blch/Cauc -0.835% -0.280% -0.009% -0.133% 0.000% 1.049%
C3 Kalash -1.575% 0.000% 0.020% 0.076% 0.147% 0.615%
C4 SE Asian -0.629% -0.021% 0.011% 0.018% 0.087% 0.478%
C5 SW Asian -0.691% -0.094% 0.000% -0.020% 0.035% 0.613%
C6 Euro -0.572% -0.086% 0.000% -0.039% 0.004% 0.468%
C7 Papuan -0.171% 0.008% 0.059% 0.070% 0.120% 0.312%
C8 NE Asian -0.739% 0.000% 0.016% 0.034% 0.107% 0.679%
C9 Siberian -1.044% 0.000% 0.015% 0.035% 0.103% 0.692%
C10 E Bantu -0.412% 0.000% 0.000% -0.007% 0.001% 0.370%
C11 W Afr -0.261% 0.000% 0.000% 0.009% 0.005% 0.304%
C12 E Afr -0.635% 0.000% 0.000% -0.017% 0.010% 0.405%

You can also see the average value of the bias in each ancestral component for each population.

Since the bias is lower than the standard error and distributed around zero, if a large number of samples of a population group have some small percentage of an ancestral component, the likelihood of that not being noise is higher.

Reference 3F(iltered) Admixture

I removed all American populations and San and Pygmy (i.e., South and Central African) from Reference 3 for a better focus on our target populations.

Here are the admixture results. You can choose the number of ancestral components, K, from the dropdown below.

K=13, 14, 15 (in that order) have the lowest cross-validation error.

There's a bunch of interesting results in there. For example, the split into northern and southern European, and the split of Siberian into Siberian and Russian Far East (or Bering Strait). However, the Onge component as a proxy of the ASI does not appear. Also, we don't get much breakdown of the South Asian populations as we would like.

Harappa Reference Population Similarity

I was not satisfied with the median IBS with reference populations method for checking how similar you are to different populations. So I took inspiration from Dienekes' population concordance ratio to compute another measure.

Let's say we have a Harappa participant h and we want to compare h to a reference population A. We can then divide our reference dataset into the in-group A and the out-group A' (which consists of everyone not in A). Now for every individual a belonging to group A and every individual a' belonging to group A', we can compare the IBS similarities and score them as:

The condition in this equation is true when Harappa participant h is more similar to individual a in population A than he is to individual a' who's not in population A and h and a are closer to each other than a is to a'.

We can then sum up these values over the whole set of populations A and A' and divide by the number of pairs .

This score tells us how similar h is to population A compared to all the reference samples not in population A and varies from 0 (most disimilar) to 1 (most similar).

Let's see how the Harappa participants HRP0001 to HRP0089 score with the different reference 3 populations.

Go to the spreadsheet and click on your Harappa ID to sort the populations by your similarity score with them (click two times if you want to sort in decreasing order which I like better).

The first sheet Sheet1 has all the populations. In the Filtered 1 sheet, I removed 13 African populations that had really low similarity scores with all participants and recomputed the scores.

In Filtered 2, I further removed 9 populations (East Africa, America, Oceania) with low scores for everyone.

In Filtered 3, another 40 populations with low scores with at least 88 (out of 89) Harappa participants were removed. The reason I removed populations and recomputed is that this made the out-group not as different from the in-group as it was before. So we can check if this algorithm can provide us with some meaningful difference in scores with close populations.

In Filtered 4, another 25 populations were removed making it more South Asian centered.

Finally, I used the 68 unmixed South Asian Harappa participants and did a South Asian specific run (though I cheated a bit and kept myself HRP0001 and my sister HRP0035 in). The most interesting thing here is the really high score the Patel Gujaratis get with the Gujarati-A reference population.

Reference 3 K=11 Admixture Dendrogram

Laredo asked:

Is it possible for you to create an unrooted similarity tree of all the populations in your “Reference 3″ dataset?

So here's a dendrogram of the average K=11 admixture results for the reference 3 populations.

Harappa Median IBS with Reference 3

You guys didn't like it the last time I did this and you are not going to like this either, but while I am thinking of solutions for posting closest individual IBS neighbors, here's another go at which reference populations have the best median IBS matches with you.

I used Reference 3 with about 100,000 SNPs for this IBS run.

Go to the spreadsheet and click on your ID in the column headers to sort by your similarity to the different reference populations.

UPDATE: I have added a transpose spreadsheet too, on Onur's request, so that you can sort which Harappa participant has higher or lower scores with a specific reference population.

Reference 3 Admixture Data

Onur asked:

BTW, Zack, are you planning to publish (in this blog or in a medium like Rapidshare) the ADMIXTURE results of the reference populations on an individual by individual basis like Dienekes?

So, I have uploaded a zip file which contains admixture results from K=2 to K=17 for all individual samples in the Reference 3 dataset. Do note that K=14 had the lowest crossvalidation error and I actually prefer even lower values of K.

I have plotted the population averages on the blog but this contains the individual level data. There are two files for each value of K. One is ref3.K.Q which has the admixture proportions for each individual. The other is ref3.K.F which has the allele frequencies for the inferred ancestral components. I haven't been able to look at the allele frequency files at all, so if you find anything interesting there, do let me know.

There is also a info file (

ref3_info.csv

) in the archive which has the information about the samples in the same order as their results are listed in the admixture output.

Reference 3 Admixture K=14

Continuing with the admixture analysis with our new reference 3 dataset.

Here's the results spreadsheet for K=14.

You can click on the legend to the right of the bar chart to sort by different ancestral components.

This one I am going to classify as a bad run. The east Asian splits are weird.

Fst divergences between estimated populations for K=14 in the form of an MDS plot.

And the numbers:
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13
C2 0.109
C3 0.110 0.160
C4 0.239 0.264 0.247
C5 0.107 0.080 0.161 0.267
C6 0.116 0.111 0.176 0.284 0.102
C7 0.132 0.180 0.092 0.265 0.176 0.195
C8 0.189 0.237 0.214 0.335 0.239 0.251 0.237
C9 0.178 0.206 0.154 0.324 0.192 0.229 0.164 0.294
C10 0.217 0.246 0.191 0.373 0.242 0.262 0.229 0.338 0.285
C11 0.209 0.220 0.248 0.350 0.230 0.223 0.272 0.314 0.312 0.344
C12 0.266 0.278 0.307 0.417 0.286 0.281 0.333 0.373 0.374 0.406 0.179
C13 0.143 0.143 0.186 0.287 0.149 0.135 0.209 0.254 0.247 0.278 0.117 0.177
C14 0.364 0.368 0.410 0.528 0.372 0.377 0.437 0.490 0.481 0.514 0.334 0.359 0.283

This is the last plot I am posting in this series of admixture runs since the crossvalidation error is minimized at K=14.

For some reason, Admixture starts acting weird at values of K higher than about 14-15.

Reference 3 Admixture K=13

Continuing with the admixture analysis with our new reference 3 dataset.

Here's the results spreadsheet for K=13.

You can click on the legend to the right of the bar chart to sort by different ancestral components.

The Hadza were expected to split but I thought the San/Pygmy would split first.

Fst divergences between estimated populations for K=13 in the form of an MDS plot.

And the numbers:
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12
C2 0.093
C3 0.098 0.141
C4 0.179 0.212 0.192
C5 0.100 0.056 0.150 0.224
C6 0.112 0.149 0.075 0.210 0.153
C7 0.109 0.062 0.161 0.234 0.072 0.170
C8 0.181 0.222 0.208 0.279 0.232 0.226 0.239
C9 0.198 0.202 0.239 0.308 0.217 0.254 0.208 0.306
C10 0.164 0.186 0.145 0.276 0.184 0.146 0.217 0.290 0.303
C11 0.320 0.318 0.365 0.443 0.336 0.381 0.325 0.444 0.284 0.437
C12 0.261 0.263 0.302 0.377 0.277 0.318 0.270 0.371 0.153 0.370 0.278
C13 0.137 0.124 0.180 0.248 0.138 0.193 0.121 0.250 0.088 0.241 0.288 0.163