Tag Archives: pan-asian

Pan-Asian Admixture Results

Posted by Zack on April 3, 2012 1 comment

I ran Reference 3 based supervised ADMIXTURE on the HUGO Pan-Asian dataset. While it used only 5,400 SNPs, it did get me curious about any relationship between Onge and Jehai and Kensiu. Unfortunately, Pan-Asian data doesn't have a good overlap even with Reich et al. So as a first exercise, I decided to run unsupervised ADMIXTURE on the Pan-Asian dataset by itself.

Here are the bar charts for the admixture results. K=12 ancestral components had the lowest cross-validation error.

You can see these results in a spreadsheet too.

Pan-Asian Ref3 K=11 Admixture

Posted by Zack on March 13, 2012 9 comments

The HUGO Pan-Asian dataset covers South and East Asia with the following South Asian populations:

23 Andhra Pradesh & Karnataka
10 Bengali
23 Bhil (Rajasthan)
20 Haryana
23 Kashmir Spiti
12 Marathi
12 Rajasthani
30 Singapore Indian
20 Uttaranchal
13 Uttar Pradesh

Unfortunately, they do not specify ethnic or caste background for most Indian groups. Instead, their focus is on Mongoloid/Caucasoid/Australoid etc.

Also, the SNP overlap with other datasets is really small. Therefore, this reference 3 admixture run was done using only 5,400 SNPs. I recommend a big bucket of salt when interpreting these results.

Here is the spreadsheet with the Pan-Asian group averages for reference 3 admixture at K=11 ancestral components.

Relatives in Datasets

Posted by Zack on February 6, 2012 Comments Off

Recently, there was a paper Identification of Close Relatives in the HUGO Pan-Asian SNP Database by Xiong Yang, Shuhua Xu, and the HUGO Pan-Asian SNP Consortium.

three individuals involved in MZ pairs were excluded from the whole dataset to construct standardized subset PASNP1716; seventy-six individuals involved in first-degree relationships were excluded from PASNP1716 to construct standardized subset PASNP1640; and 57 individuals involved in second-degree relationships were excluded from PASNP1640 to construct standardized subset PASNP1583. The individuals excluded were summarized inÂ Table S6,Â S7,Â S8.

Let me engage in some blog triumphalism by saying I wrote about the duplicates and relatives in the Pan-Asian dataset in April 2011.

Here are my blog posts about relatedness in datasets:

Early on, I was removing only first degree relatives from the reference datasets. Nowadays, I try to remove all second degree relatives too. I leave the third degree relatives in the data since it's sometimes hard to figure out how real the low IBD values are in Plink. There are a lot of 3rd degree relatives if Plink is to be believed, but I am a little skeptical.

Since Plink's IBD analysis requires homogenous samples, I am now using KING (paper) for the purpose. I am also looking at kcoeff (paper)

Pan-Asian to PED Conversion

Posted by Zack on April 16, 2011 8 comments

Even though the Pan-Asian dataset is not public, there was a request for my script to convert the data to Plink's PED format.

Here is how I convert the Pan-Asian data to Plink's transposed file format.

#!/usr/bin/perl -w
 
$file="Genotypes_All.txt";
 
open(INFILE,"<",$file);
open(TFAM,">","panasian.tfam");
open(TPED,">","panasian.tped");
 
$line = <INFILE>;
chomp $line;
@first = split('\t',$line);
foreach my $sample (5..$#first) {
        print TFAM "0 $first[$sample] 0 0 0 -9\n";
}
 
my $alleles;
 
while(<INFILE>) {
        chomp;
        @lines = split('\t',$_);
        my ($major,$minor) = split('/',$lines[4]);
        print TPED "$lines[2] $lines[1] 0 $lines[3]";
        foreach my $snp (5..$#lines) {
                if ($lines[$snp] == 0) {
                        $alleles = "$major $major";}
                elsif ($lines[$snp] == 1) {
                        $alleles = "$major $minor";}
                elsif ($lines[$snp] == 2) {
                        $alleles = "$minor $minor";}
                else {
                        $alleles = "0 0";}
                print TPED " $alleles";
        }
        print TPED "\n";
}
 
close(INFILE);
close(TFAM);
close(TPED);

Again, no guarantees! It's Perl though, so it should be more stable across various operating systems.

Pan-Asian Dataset Duplicates and Relatives

Posted by Zack on April 9, 2011 2 comments

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

Looking at the Pan-Asian dataset, I found 3 pairs of duplicate samples and 82 pairs that could be closely related. I have removed 64 samples from the dataset.

You can see the IBD results from plink as well as the list of sample IDs I removed in a spreadsheet.

UPDATE: I found 4 Melanesians in the Pan-Asian dataset who were the same as those in HGDP. So I have removed those as well and added them in the list in the spreadsheet.

Reich et al and Pan-Asian Datasets

Posted by Zack on March 17, 2011 16 comments

I got access to the Reich et al (Nature 2009) dataset used in their paper "Reconstructing Indian population history".

It has the following populations:

Aonaga	Aus	Bhil
Chenchu	Great_Andamanese	Hallaki
Kamsali	Kashmiri_Pandit	Kharia
Kurumba	Lodi	Madiga
Mala	Meghawal	Naidu
Nysha	Onge	Sahariya
Santhal	Satnami	Siddi
Somali	Srivastava	Tharu
Vaish	Velama	Vysya

There are 141 individuals with 587,753 SNPs in their dataset which conveniently is in PED format.

Also, Blaise pointed me to the Pan-Asian SNP data used in the Dec 2009 Science paper "Mapping Human Genetic Diversity in Asia".

It includes the following 71 populations:

Maya	Auca	Quechua	Karitiana	Pima
Ami	Atayal	Melanesians	Zhuang	Han_Cantonese
Hmong	Jiamao	Jinuo	Han_Shanghai	Uyghur
Wa	Alorese	Dayak	Javanese	Batak_Karo
Lamaholot	Lembata	Malay	Mentawai	Manggarai
Kambera	Sunda	Batak_Toba	Toraja	Andhra_Pradesh
Karnataka	Bengali-Assamese	Rajasthan	Uttaranchal	Uttar Pradesh
Haryana	Spiti	Bhili	Marathi	Japanese
Ryukyuan	Korean	Bidayuh	Jehai	Kelantan
Kensiu	Temuan	Ayta	Agta	Ati
Iraya	Minanubu	Mamanwa	Filipino	Singapore_Chinese
Singapore_Indian	Singapore_Malay	Hmong (Miao)	Karen	Lawa
Mlabri	Mon	Paluang	Plang	Tai_Khuen
Tai_Lue	H'tin	Tai_Yuan	Tai_Yong	Yao
Hakka	Minnan

It has 1,719 individuals with 54,794 SNPs. I wish it had more SNPs considering the wealth of populations.

Also, the Pan-Asian data is in the form of minor allele counts, so I need to convert that back to A/C/G/T. Since there are some HapMap populations included in the dataset, that shouldn't be too hard.

I am going to include both these datasets into my big reference set.

Harappa Ancestry Project

Genetics and South Asia

Tag Archives: pan-asian

Pan-Asian Admixture Results

Pan-Asian Ref3 K=11 Admixture

Relatives in Datasets

Pan-Asian to PED Conversion

Pan-Asian Dataset Duplicates and Relatives

Reich et al and Pan-Asian Datasets

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Archives

Recent Comments

Blogroll

Harappa Ancestry Project

Genetics and South Asia

Tag Archives: pan-asian

Pan-Asian Admixture Results

Share this:

Pan-Asian Ref3 K=11 Admixture

Share this:

Relatives in Datasets

Share this:

Pan-Asian to PED Conversion

Share this:

Pan-Asian Dataset Duplicates and Relatives

Share this:

Reich et al and Pan-Asian Datasets

Share this:

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Tags

Archives

Recent Comments

Blogroll