Tag Archives: data

Reference I: Eurasian Subsets

Since we have established that none of the Harappa participants so far have African admixture except for HRP0001 (me) and HRP0027 (Caribbean Indian) and African populations are the most diverse, it's best to remove the African populations from our Reference I dataset and do some analysis using the Eurasian subset.

One option is to exclude the 517 samples of sub-Saharan African populations in our dataset:

  • Bantu Keyna: 11
  • Bantu South Africa: 8
  • Ethiopian Jews: 12
  • Ethiopians: 19
  • Kenyan Luhya: 101
  • Maasai: 135
  • Mandenka: 22
  • African Americans: 48
  • Yoruba: 161

However, in addition to the above, I decided to remove anyone from the reference I dataset who had more than x% African ancestry (sum of East African, East African Bantu and West African) at K=12 admixture run. I created two Eurasian datasets: Eurasian90 and Eurasian95.

Eurasian90 excludes all samples with more than 10% African admixture. That completely removes the following populations in addition to the above:

  • Egyptians: 12
  • Moroccans: 10
  • Mozabite: 29

Also, some samples from the following populations were removed for Eurasian90:

  • Balochi: 3/24
  • Bedouin: 19/46
  • Brahui: 2/25
  • Iranians: 3/19
  • Jordanians: 6/20
  • Lebanese: 2/7
  • Makrani: 3/25
  • Palestinian: 10/46
  • Saudis: 2/20
  • Sindhi: 2/24
  • Syrians: 2/16
  • Yemense: 7/8

That's a total of 629 samples in Reference I dataset that had at least 10% African admixture. Thus Eurasian90 has 2,025 samples. The complete list is here.

The other dataset, Eurasian95 excludes everyone with more than 5% African admixture. Thus in addition to the samples listed above, it excludes the following:

  • Balochi: 1
  • Bedouin: 19
  • Brahui: 1
  • Druze: 1
  • Iranians: 1
  • Jordanians: 14 (completely removed)
  • Makrani: 8
  • Morocco Jews: 2
  • Palestinian: 36 (completely removed)
  • Saudis: 16
  • Sindhi: 2
  • Syrians: 7
  • Yemenese: 1 (completely removed)
  • Yemen Jews: 15 (completely removed)

Eurasian95 is thus left with 1,901 whose breakdown is listed here.

I'll be experimenting with both Eurasian90 and Eurasian95.

Reference Dataset II

Combining my reference population with Xing et al data gets me 3,222 3,161 samples but with only about 23,000 SNPs after LD-pruning.

The good thing is that this dataset has 544 South Asian samples from 24 ethnic groups. So it'll be useful for some analyses despite the low number of SNPs. I'll try to run parallel analyses on my reference population and this dataset so we can compare the pros and cons of both.

UPDATE: I removed 61 pygmy and San samples.


Human Genome Diversity Project (HGDP) is the best resource for a diverse set of genomic data. It has 1050 individuals from 52 different populations.

I got the Stanford University data which has data for 660,918 SNPs from 1,043 samples. It is claimed that the forward strand is given but that turned out not to be true and I had to flip strands and make sure I didn't include any ambiguous A/T or C/G strands in my dataset.

I followed the recommendations of Rosenberg (spreadsheet) in excluding some atypical samples and relatives, leaving me with 940 samples.

I also excluded the Native American samples because we are not interested in them and they are very closely related either due to recent endogamy or ancient bottlenecks. (yeah I had the nerve to write that.)

Of the total of 876 samples, here are the numbers for our populations of interest:

Balochi 24
Brahui 25
Burusho 25
Hazara 22
Kalash 23
Makrani 25
Pathan 22
Sindhi 24
Total South Asians 190

These samples have about 541,560 SNPs in common with 23andme v2.


I am using several datasets in the public domain for my reference population samples. HapMap is one of those datasets.

According to its website,

The goal of the International HapMap Project is to develop a haplotype map of the human genome, the HapMap, which will describe the common patterns of human DNA sequence variation. The HapMap is expected to be a key resource for researchers to use to find genes affecting health, disease, and responses to drugs and environmental factors. The information produced by the Project will be made freely available.

In the first phase, it genotyped

30 Yoruba adult-and-both-parents trios from Ibadan, Nigeria, 30 trios of U.S. (Utah) residents of northern and western European ancestry, 44 unrelated individuals from Tokyo, Japan and 45 unrelated Han Chinese individuals from Beijing, China.

In their HapMap phase 3 release #3 (NCBI build 36, dbSNP b126), there are 1,397 samples with about 1,457,897 SNPs each.

I removed related individuals as well as individuals whose genomes were too similar. This left me with a total of 1,149 samples with about 474,606 SNPs in common with 23andme's version 2 data.

Since we are not interested in Native American ancestry, I also removed 58 Mexican samples, thus leaving me with 1,091 samples.

Here are the samples I am using from the HapMap data:

Ethnicity Region Count
African Americans Africa 48
European Americans (Utahns) Europe 111
Han Chinese East Asia 137
US Chinese East Asia 106
Gujaratis South Asia 98
Japanese East Asia 113
Kenyan Luhya East Africa 101
Maasai East Africa 135
Tuscans Europe 102
Yoruba West Africa 140

The region assignments are mine to aid me in the analysis, by including/excluding samples by region or by aggregating results by region to find patterns etc.

It was easiest to use the HapMap data since it's available for download in Plink format.