I am using several datasets in the public domain for my reference population samples. HapMap is one of those datasets.

According to its website,

The goal of the International HapMap Project is to develop a haplotype map of the human genome, the HapMap, which will describe the common patterns of human DNA sequence variation. The HapMap is expected to be a key resource for researchers to use to find genes affecting health, disease, and responses to drugs and environmental factors. The information produced by the Project will be made freely available.

In the first phase, it genotyped

30 Yoruba adult-and-both-parents trios from Ibadan, Nigeria, 30 trios of U.S. (Utah) residents of northern and western European ancestry, 44 unrelated individuals from Tokyo, Japan and 45 unrelated Han Chinese individuals from Beijing, China.

In their HapMap phase 3 release #3 (NCBI build 36, dbSNP b126), there are 1,397 samples with about 1,457,897 SNPs each.

I removed related individuals as well as individuals whose genomes were too similar. This left me with a total of 1,149 samples with about 474,606 SNPs in common with 23andme's version 2 data.

Since we are not interested in Native American ancestry, I also removed 58 Mexican samples, thus leaving me with 1,091 samples.

Here are the samples I am using from the HapMap data:

Ethnicity Region Count
African Americans Africa 48
European Americans (Utahns) Europe 111
Han Chinese East Asia 137
US Chinese East Asia 106
Gujaratis South Asia 98
Japanese East Asia 113
Kenyan Luhya East Africa 101
Maasai East Africa 135
Tuscans Europe 102
Yoruba West Africa 140

The region assignments are mine to aid me in the analysis, by including/excluding samples by region or by aggregating results by region to find patterns etc.

It was easiest to use the HapMap data since it's available for download in Plink format.


  1. Xing et al Data | Harappa Ancestry Project - pingback on January 28, 2011 at 6:58 pm
  2. Admixture: Reference Population | Harappa Ancestry Project - pingback on January 29, 2011 at 8:44 am
  3. Harappa Ancestry Project Update | Procrastination - pingback on February 2, 2011 at 8:41 am
  4. HapMap Gujaratis | Harappa Ancestry Project - pingback on February 7, 2011 at 12:19 pm
  5. Ancestry Painting | Procrastination - pingback on February 7, 2011 at 2:12 pm
  6. McDonald Ancestry Analysis II | Procrastination - pingback on February 25, 2011 at 9:44 am
  7. One PED File to Rule Them All | Harappa Ancestry Project - pingback on March 13, 2011 at 1:02 am

Trackbacks and Pingbacks: