Tag Archives: data

Reference I: Eurasian Subsets

Posted by Zack on February 21, 2011 7 comments

Since we have established that none of the Harappa participants so far have African admixture except for HRP0001 (me) and HRP0027 (Caribbean Indian) and African populations are the most diverse, it's best to remove the African populations from our Reference I dataset and do some analysis using the Eurasian subset.

One option is to exclude the 517 samples of sub-Saharan African populations in our dataset:

Bantu Keyna: 11
Bantu South Africa: 8
Ethiopian Jews: 12
Ethiopians: 19
Kenyan Luhya: 101
Maasai: 135
Mandenka: 22
African Americans: 48
Yoruba: 161

However, in addition to the above, I decided to remove anyone from the reference I dataset who had more than x% African ancestry (sum of East African, East African Bantu and West African) at K=12 admixture run. I created two Eurasian datasets: Eurasian90 and Eurasian95.

Eurasian90 excludes all samples with more than 10% African admixture. That completely removes the following populations in addition to the above:

Egyptians: 12
Moroccans: 10
Mozabite: 29

Also, some samples from the following populations were removed for Eurasian90:

Balochi: 3/24
Bedouin: 19/46
Brahui: 2/25
Iranians: 3/19
Jordanians: 6/20
Lebanese: 2/7
Makrani: 3/25
Palestinian: 10/46
Saudis: 2/20
Sindhi: 2/24
Syrians: 2/16
Yemense: 7/8

That's a total of 629 samples in Reference I dataset that had at least 10% African admixture. Thus Eurasian90 has 2,025 samples. The complete list is here.

The other dataset, Eurasian95 excludes everyone with more than 5% African admixture. Thus in addition to the samples listed above, it excludes the following:

Balochi: 1
Bedouin: 19
Brahui: 1
Druze: 1
Iranians: 1
Jordanians: 14 (completely removed)
Makrani: 8
Morocco Jews: 2
Palestinian: 36 (completely removed)
Saudis: 16
Sindhi: 2
Syrians: 7
Yemenese: 1 (completely removed)
Yemen Jews: 15 (completely removed)

Eurasian95 is thus left with 1,901 whose breakdown is listed here.

I'll be experimenting with both Eurasian90 and Eurasian95.

Reference Dataset II

Posted by Zack on January 30, 2011 9 comments

Combining my reference population with Xing et al data gets me ~~3,222~~ 3,161 samples but with only about 23,000 SNPs after LD-pruning.

The good thing is that this dataset has 544 South Asian samples from 24 ethnic groups. So it'll be useful for some analyses despite the low number of SNPs. I'll try to run parallel analyses on my reference population and this dataset so we can compare the pros and cons of both.

UPDATE: I removed 61 pygmy and San samples.

HGDP

Posted by Zack on January 25, 2011 3 comments

Human Genome Diversity Project (HGDP) is the best resource for a diverse set of genomic data. It has 1050 individuals from 52 different populations.

I got the Stanford University data which has data for 660,918 SNPs from 1,043 samples. It is claimed that the forward strand is given but that turned out not to be true and I had to flip strands and make sure I didn't include any ambiguous A/T or C/G strands in my dataset.

I followed the recommendations of Rosenberg (spreadsheet) in excluding some atypical samples and relatives, leaving me with 940 samples.

I also excluded the Native American samples because we are not interested in them and they are very closely related either due to recent endogamy or ancient bottlenecks. (yeah I had the nerve to write that.)

Of the total of 876 samples, here are the numbers for our populations of interest:

Total South Asians	190
Balochi	24
Brahui	25
Burusho	25
Hazara	22
Kalash	23
Makrani	25
Pathan	22
Sindhi	24

These samples have about 541,560 SNPs in common with 23andme v2.

HapMap

Posted by Zack on January 24, 2011 7 comments

I am using several datasets in the public domain for my reference population samples. HapMap is one of those datasets.

According to its website,

The goal of the International HapMap Project is to develop a haplotype map of the human genome, the HapMap, which will describe the common patterns of human DNA sequence variation. The HapMap is expected to be a key resource for researchers to use to find genes affecting health, disease, and responses to drugs and environmental factors. The information produced by the Project will be made freely available.

In the first phase, it genotyped

30 Yoruba adult-and-both-parents trios from Ibadan, Nigeria, 30 trios of U.S. (Utah) residents of northern and western European ancestry, 44 unrelated individuals from Tokyo, Japan and 45 unrelated Han Chinese individuals from Beijing, China.

In their HapMap phase 3 release #3 (NCBI build 36, dbSNP b126), there are 1,397 samples with about 1,457,897 SNPs each.

I removed related individuals as well as individuals whose genomes were too similar. This left me with a total of 1,149 samples with about 474,606 SNPs in common with 23andme's version 2 data.

Since we are not interested in Native American ancestry, I also removed 58 Mexican samples, thus leaving me with 1,091 samples.

Here are the samples I am using from the HapMap data:

Ethnicity	Region	Count
African Americans	Africa	48
European Americans (Utahns)	Europe	111
Han Chinese	East Asia	137
US Chinese	East Asia	106
Gujaratis	South Asia	98
Japanese	East Asia	113
Kenyan Luhya	East Africa	101
Maasai	East Africa	135
Tuscans	Europe	102
Yoruba	West Africa	140

The region assignments are mine to aid me in the analysis, by including/excluding samples by region or by aggregating results by region to find patterns etc.

It was easiest to use the HapMap data since it's available for download in Plink format.

Harappa Ancestry Project

Genetics and South Asia

Tag Archives: data

Reference I: Eurasian Subsets

Reference Dataset II

HGDP

HapMap

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Archives

Recent Comments

Blogroll

Harappa Ancestry Project

Genetics and South Asia

Tag Archives: data

Reference I: Eurasian Subsets

Share this:

Reference Dataset II

Share this:

HGDP

Share this:

HapMap

Share this:

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Tags

Archives

Recent Comments

Blogroll