Monthly Archives: May 2011 - Page 3

Reference I Admixture Errors

I am have thinking about error estimation for Admixture results for some time since I have heard a lot of arguments about how even 0.1% result is significant. I was skeptical of that and have rounded off my admixture run results to the nearest percent.

There was a memory leak issue in the bootstrapping code for admixture which crashed it every time I tried running it. I emailed David Alexander and he fixed it in version 1.12.

So I ran the default 200 bootstrap replicates to measure standard error in our old Reference I K=12 admixture. Spreadsheet with population level results is here and participant results are here.

Here are some statistics for the standard error estimates:

Min. 1st Qu. Median Mean 3rd Qu. Max.
C1 S Asian 0.00% 0.02% 0.33% 0.52% 0.96% 1.93%
C2 Blch/Cauc 0.00% 0.00% 1.02% 0.79% 1.45% 2.63%
C3 Kalash 0.00% 0.01% 0.40% 0.50% 0.99% 3.76%
C4 SE Asian 0.00% 0.09% 0.37% 0.60% 1.27% 1.92%
C5 SW Asian 0.00% 0.00% 0.60% 0.66% 1.28% 2.90%
C6 Euro 0.00% 0.00% 0.35% 0.56% 1.12% 1.82%
C7 Papuan 0.00% 0.07% 0.22% 0.23% 0.36% 1.08%
C8 NE Asian 0.00% 0.07% 0.36% 0.67% 1.36% 2.45%
C9 Siberian 0.00% 0.08% 0.37% 0.51% 0.82% 2.29%
C10 E Bantu 0.00% 0.00% 0.00% 0.35% 0.72% 1.93%
C11W Afr 0.00% 0.00% 0.00% 0.28% 0.50% 1.51%
C12 E Afr 0.00% 0.00% 0.05% 0.31% 0.60% 1.79%

You can see the mean value of the standard errors per population and realize how many are over 1% (marked in red).

And statistics for bias estimates:

Min. 1st Qu. Median Mean 3rd Qu. Max.
C1 S Asian -1.104% -0.031% 0.000% -0.024% 0.075% 1.026%
C2 Blch/Cauc -0.835% -0.280% -0.009% -0.133% 0.000% 1.049%
C3 Kalash -1.575% 0.000% 0.020% 0.076% 0.147% 0.615%
C4 SE Asian -0.629% -0.021% 0.011% 0.018% 0.087% 0.478%
C5 SW Asian -0.691% -0.094% 0.000% -0.020% 0.035% 0.613%
C6 Euro -0.572% -0.086% 0.000% -0.039% 0.004% 0.468%
C7 Papuan -0.171% 0.008% 0.059% 0.070% 0.120% 0.312%
C8 NE Asian -0.739% 0.000% 0.016% 0.034% 0.107% 0.679%
C9 Siberian -1.044% 0.000% 0.015% 0.035% 0.103% 0.692%
C10 E Bantu -0.412% 0.000% 0.000% -0.007% 0.001% 0.370%
C11 W Afr -0.261% 0.000% 0.000% 0.009% 0.005% 0.304%
C12 E Afr -0.635% 0.000% 0.000% -0.017% 0.010% 0.405%

You can also see the average value of the bias in each ancestral component for each population.

Since the bias is lower than the standard error and distributed around zero, if a large number of samples of a population group have some small percentage of an ancestral component, the likelihood of that not being noise is higher.

Reference 3F(iltered) Admixture

I removed all American populations and San and Pygmy (i.e., South and Central African) from Reference 3 for a better focus on our target populations.

Here are the admixture results. You can choose the number of ancestral components, K, from the dropdown below.

K=13, 14, 15 (in that order) have the lowest cross-validation error.

There's a bunch of interesting results in there. For example, the split into northern and southern European, and the split of Siberian into Siberian and Russian Far East (or Bering Strait). However, the Onge component as a proxy of the ASI does not appear. Also, we don't get much breakdown of the South Asian populations as we would like.

Harappa Nearest IBS Neighbors

After a long tease, here is the spreadsheet containing the top 500 nearest neighbors (using IBS similarity percentages) for the Harappa participants from HRP0001 to HRP0089.

I am also providing an R data object with the same data (except it contains all the 3,975 individual from reference 3 and Harappa). To use this data,

  1. Download R
  2. Install R on your computer
  3. When you start R, type

    to load the data

  4. Type

    to find the 20 closest IBS neighbors of HRP0001. You can use any of the Harappa IDs here.

  5. You can set the number of IBS neighbors (50, for example) to show using



Yesterday, we got to 100 participants in the Harappa Ancestry Project.

I made the project public on January 17, 2011. So, 100 submissions in 106 days. That's pretty good.

I am surprised at the speed and quantity of submissions. I probably have the largest dataset of South Asians right now.

Keep spreading the word and encouraging everyone to participate.

Accepting FTDNA Family Finder

In addition to 23andme data, I am now accepting the autosomal data from FTDNA Family Finder too.

This is due to the recent switch to Illumina Omni chip by FamilyTreeDNA which has a lot more markers in common with the 23andme data.

Since FTDNA is retesting all its current customers on the new chip, even if you tested with them earlier, you should have autosomal data from the new chip which you can download and email to me at

I am basically looking for participants who have at least some ancestry from the following countries/regions:

  • Afghanistan
  • Bangladesh
  • Bhutan
  • Burma
  • India
  • Iran
  • Maldives
  • Nepal
  • Pakistan
  • Sri Lanka
  • Tibet

But if you have ancestry from West or Central Asia or Caucasus, I am likely to accept your data too.

Details of participation are here.

April Update

I have a total of 97 participants in the project right now who have sent me their raw data. Six of those have relatives participating and thus have to be filtered out for most analysis other than individual admixture percentages etc where I divide participants into small groups.

The following groups are represented:

Let's try to get to hundred soon.

And yes, I am accepting FTDNA Family Finder (new Illumina chip) now.

Ref3 + Harappa Maps

More maps from The Jatt Gene using the Reference 3 and Harappa participants K=11 admixture results.

C1 South Asian Isopleth

C2 Onge Isopleth

C1 South Asian Chloropleth at state/province level

C2 Onge Chloropleth

As usual, Simranjit has more maps on his blog.