Admixture: Reference Population

For regular admixture analysis, I am using HapMap, HGDP, SGVP and Behar datasets with some samples removed as I wrote earlier.

For each of these datasets,

  1. I first filtered to keep only the list of SNPs present in 23andme v2 chip.
    plink --bfile data --extract 23andmev2.snplist
  2. I also filtered for founders:
    plink --bfile data --filter-founders
  3. And excluded SNPs with missing rates greater than 1%:
    plink --bfile data --geno 0.01

Then, I merged the datasets one by one. The reason for doing it one by one was that there were conflicts of strand orientation (forward or reverse) between the different datasets. If the merge operation gave an error, I had to flip those strands in one dataset and try the merge again.

plink --bfile data1 --bmerge data2.bed data2.bim data2.fam --make-bed
plink --bfile data2 --flip plink.missnp --make-bed --out data2flip
plink --bfile data1 --bmerge data2flip.bed data2flip.bim data2flip.fam --make-bed

Once all the four datasets were merged, I processed the combined data file:

  1. Removed SNPs with a missing rate of more than 1% in the combined dataset
    plink --bfile data --geno 0.01
  2. Then i performed linkage disequilibrium based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3:
    plink --bfile data --indep-pairwise 50 5 0.3
    plink --bfile data --extract plink.prune.in --make-bed

This gave me a reference population of 2,693 2,654 individuals with each sample having about 186,000 SNPs. Out of these 2,693 2,654 individuals, we have a total of 398 South Asians belonging to 16 ethnic groups.

Finally, it's time to start having some fun!

UPDATE: I removed 39 Pygmy and San samples because they were causing some trouble with African ancestral components. Since we are not interested in detailed African ancestry and African admixture among South Asians is not likely to be pygmy or San, I decided it would be best to remove them.

Related Posts:

12 Comments.

  1. Reference Dataset II | Harappa Ancestry Project - pingback on January 30, 2011 at 8:43 am
  2. Reference Admixture Analysis K=2-5 | Harappa Ancestry Project - pingback on February 2, 2011 at 9:38 am
  3. San and Pygmy | Harappa Ancestry Project - pingback on February 3, 2011 at 3:52 pm
  4. South Asian PCA | Harappa Ancestry Project - pingback on February 13, 2011 at 7:20 am
  5. Reference I: Eurasian Subsets | Harappa Ancestry Project - pingback on February 21, 2011 at 7:39 am
  6. Chromosomal Admixture Painting | Harappa Ancestry Project - pingback on March 3, 2011 at 7:15 am
  7. Ref 1 South Asians + Harappa PCA | Harappa Ancestry Project - pingback on March 16, 2011 at 8:48 am
  8. Admixture: Choice of K | Harappa Ancestry Project - pingback on March 24, 2011 at 1:59 am
  9. Reference I PCA | Harappa Ancestry Project - pingback on March 24, 2011 at 6:58 pm
  10. Chinese Samples | Harappa Ancestry Project - pingback on April 13, 2011 at 12:15 pm
  11. Ref1 South Asian + Harappa Admixture | Harappa Ancestry Project - pingback on April 13, 2011 at 1:24 pm