Admixture: Choice of K

Posted by Zack on March 24, 2011

Admixture lets you choose the number of ancestral populations, K. This number is really important and in a lot of cases we do not know how many ancestral populations our samples have descended from. In the Admixture manual, we are advised:

Use ADMIXTURE's cross-validation procedure. A good value of K will exhibit a low cross-validation error compared to other K values. Cross-validation is enabled by simply adding the --cv flag to the ADMIXTURE command line. In this default setting, the cross-validation procedure will do 10 repetitions, each time holding out 10% of the genotypes at
random.

I like this idea compared to using the BIC (Bayes Information Criterion) but I am plotting all the different variables for various K below.

For our Reference I dataset which is what I have used for most of the analysis so far, here is the spreadsheet for Log Likelihood, BIC, AIC and CV (cross-validation error). The plots follow.

Using the cross-validation error, the optimum value of K so far is 17 which is the largest I have run so far. It now takes days to run admixture (with cross-validation). Cross-validation almost doubles the time required to run.

For Reference II, here are the spreadsheet and graphs.

The cross-validation error is lowest at K=16 which is the highest I have run. So it is likely to decrease further for higher K.

Admixtureancestry

← Ref1 South Asian + Harappa MDS MClust

Iranians →

2 Comments.

Simranjits March 24, 2011 at 2:30 am

Just out of interest , could you share what kinda hardware are you running this on? Considering now that it takes days to do k=17, i'm interested to know.
- Zack March 24, 2011 at 5:46 am
  
  Intel Core 2 Duo 2.13GHz with 6GB ram.
  
  Most of the time the long admixture runs are on only one core, so I can do other stuff.

Harappa Ancestry Project

Genetics and South Asia