Tag Archives: pan-asian

Pan-Asian Admixture Results

I ran Reference 3 based supervised ADMIXTURE on the HUGO Pan-Asian dataset. While it used only 5,400 SNPs, it did get me curious about any relationship between Onge and Jehai and Kensiu. Unfortunately, Pan-Asian data doesn't have a good overlap even with Reich et al. So as a first exercise, I decided to run unsupervised ADMIXTURE on the Pan-Asian dataset by itself.

Here are the bar charts for the admixture results. K=12 ancestral components had the lowest cross-validation error.

You can see these results in a spreadsheet too.

Related Reading:

Asian Dumplings: Mastering Gyoza, Spring Rolls, Samosas, and More
Asian American Panethnicity: Bridging Institutions and Identities
Practical Paleo: A Customized Approach to Health and a Whole-Foods Lifestyle
The Politics of Anti-Westernism in Asia: Visions of World Order in Pan-Islamic and Pan-Asian Thought (Columbia Studies in International and Global History)
Transnational Asian Identities in Pan-Pacific Cinemas: The Reel Asian Exchange (Routledge Advances in Film Studies)

Pan-Asian Ref3 K=11 Admixture

The HUGO Pan-Asian dataset covers South and East Asia with the following South Asian populations:

  • 23 Andhra Pradesh & Karnataka
  • 10 Bengali
  • 23 Bhil (Rajasthan)
  • 20 Haryana
  • 23 Kashmir Spiti
  • 12 Marathi
  • 12 Rajasthani
  • 30 Singapore Indian
  • 20 Uttaranchal
  • 13 Uttar Pradesh

Unfortunately, they do not specify ethnic or caste background for most Indian groups. Instead, their focus is on Mongoloid/Caucasoid/Australoid etc.

Also, the SNP overlap with other datasets is really small. Therefore, this reference 3 admixture run was done using only 5,400 SNPs. I recommend a big bucket of salt when interpreting these results.

Here is the spreadsheet with the Pan-Asian group averages for reference 3 admixture at K=11 ancestral components.

Related Reading:

Pan-Asian Express: Quick Fixes for Asian-Food Fans
Honga's Lotus Petal: Pan-Asian Cuisine
Reference Guide for Essential Oils Hard Cover 2012
Reference and Existence: The John Locke Lectures
Asian Dumplings: Mastering Gyoza, Spring Rolls, Samosas, and More

Relatives in Datasets

Recently, there was a paper Identification of Close Relatives in the HUGO Pan-Asian SNP Database by Xiong Yang, Shuhua Xu, and the HUGO Pan-Asian SNP Consortium.

three individuals involved in MZ pairs were excluded from the whole dataset to construct standardized subset PASNP1716; seventy-six individuals involved in first-degree relationships were excluded from PASNP1716 to construct standardized subset PASNP1640; and 57 individuals involved in second-degree relationships were excluded from PASNP1640 to construct standardized subset PASNP1583. The individuals excluded were summarized in Table S6S7S8.

Let me engage in some blog triumphalism by saying I wrote about the duplicates and relatives in the Pan-Asian dataset in April 2011.

Here are my blog posts about relatedness in datasets:

Early on, I was removing only first degree relatives from the reference datasets. Nowadays, I try to remove all second degree relatives too. I leave the third degree relatives in the data since it's sometimes hard to figure out how real the low IBD values are in Plink. There are a lot of 3rd degree relatives if Plink is to be believed, but I am a little skeptical.

Since Plink's IBD analysis requires homogenous samples, I am now using KING (paper) for the purpose. I am also looking at kcoeff (paper)

Related Reading:

Pan-Asian Express: Quick Fixes for Asian-Food Fans
The Politics of Anti-Westernism in Asia: Visions of World Order in Pan-Islamic and Pan-Asian Thought (Columbia Studies in International and Global History)
Practical Paleo: A Customized Approach to Health and a Whole-Foods Lifestyle
Weight Watchers New Points Plus Plan The Very Best Asian Recipes Cookbook
Lower Your Blood Pressure in Eight Weeks: A Revolutionary Program for a Longer, Healthier Life

Pan-Asian to PED Conversion

Even though the Pan-Asian dataset is not public, there was a request for my script to convert the data to Plink's PED format.

Here is how I convert the Pan-Asian data to Plink's transposed file format.

#!/usr/bin/perl -w
 
$file="Genotypes_All.txt";
 
open(INFILE,"<",$file);
open(TFAM,">","panasian.tfam");
open(TPED,">","panasian.tped");
 
$line = <INFILE>;
chomp $line;
@first = split('\t',$line);
foreach my $sample (5..$#first) {
        print TFAM "0 $first[$sample] 0 0 0 -9\n";
}
 
my $alleles;
 
while(<INFILE>) {
        chomp;
        @lines = split('\t',$_);
        my ($major,$minor) = split('/',$lines[4]);
        print TPED "$lines[2] $lines[1] 0 $lines[3]";
        foreach my $snp (5..$#lines) {
                if ($lines[$snp] == 0) {
                        $alleles = "$major $major";}
                elsif ($lines[$snp] == 1) {
                        $alleles = "$major $minor";}
                elsif ($lines[$snp] == 2) {
                        $alleles = "$minor $minor";}
                else {
                        $alleles = "0 0";}
                print TPED " $alleles";
        }
        print TPED "\n";
}
 
close(INFILE);
close(TFAM);
close(TPED);

Again, no guarantees! It's Perl though, so it should be more stable across various operating systems.

Related Reading:

Houndsley and Catina Plink and Plunk: Candlewick Sparks
Plink, Plank, Plunk!
Asian American Panethnicity: Bridging Institutions and Identities
Conversion to Judaism: A Guidebook
Convert!: Designing Web Sites to Increase Traffic and Conversion

Pan-Asian Dataset Duplicates and Relatives

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

Looking at the Pan-Asian dataset, I found 3 pairs of duplicate samples and 82 pairs that could be closely related. I have removed 64 samples from the dataset.

You can see the IBD results from plink as well as the list of sample IDs I removed in a spreadsheet.

UPDATE: I found 4 Melanesians in the Pan-Asian dataset who were the same as those in HGDP. So I have removed those as well and added them in the list in the spreadsheet.

Related Reading:

Windows 8 Quick Reference Guide (Cheat Sheet of Instructions, Tips & Shortcuts - Laminated Guide)
How to Make Money in Stocks:  A Winning System in Good Times and Bad, Fourth Edition
Recipes for the Specific Carbohydrate Diet: The Grain-Free, Lactose-Free, Sugar-Free Solution to IBD, Celiac Disease, Autism, Cystic Fibrosis, and Other Health Conditions (Healthy Living Cookbooks)
Practical Paleo: A Customized Approach to Health and a Whole-Foods Lifestyle

Reich et al and Pan-Asian Datasets

I got access to the Reich et al (Nature 2009) dataset used in their paper "Reconstructing Indian population history".

It has the following populations:

Aonaga Aus Bhil
Chenchu Great_Andamanese Hallaki
Kamsali Kashmiri_Pandit Kharia
Kurumba Lodi Madiga
Mala Meghawal Naidu
Nysha Onge Sahariya
Santhal Satnami Siddi
Somali Srivastava Tharu
Vaish Velama Vysya

There are 141 individuals with 587,753 SNPs in their dataset which conveniently is in PED format.

Also, Blaise pointed me to the Pan-Asian SNP data used in the Dec 2009 Science paper "Mapping Human Genetic Diversity in Asia".

It includes the following 71 populations:

Maya Auca Quechua Karitiana Pima
Ami Atayal Melanesians Zhuang Han_Cantonese
Hmong Jiamao Jinuo Han_Shanghai Uyghur
Wa Alorese Dayak Javanese Batak_Karo
Lamaholot Lembata Malay Mentawai Manggarai
Kambera Sunda Batak_Toba Toraja Andhra_Pradesh
Karnataka Bengali-Assamese Rajasthan Uttaranchal Uttar Pradesh
Haryana Spiti Bhili Marathi Japanese
Ryukyuan Korean Bidayuh Jehai Kelantan
Kensiu Temuan Ayta Agta Ati
Iraya Minanubu Mamanwa Filipino Singapore_Chinese
Singapore_Indian Singapore_Malay Hmong (Miao) Karen Lawa
Mlabri Mon Paluang Plang Tai_Khuen
Tai_Lue H'tin Tai_Yuan Tai_Yong Yao
Hakka Minnan

It has 1,719 individuals with 54,794 SNPs. I wish it had more SNPs considering the wealth of populations.

Also, the Pan-Asian data is in the form of minor allele counts, so I need to convert that back to A/C/G/T. Since there are some HapMap populations included in the dataset, that shouldn't be too hard.

I am going to include both these datasets into my big reference set.

Related Reading:

The Philippines: 100 Travel Tips
Korea's Place in the Sun: A Modern History (Updated)
Bones of the Lost: A Temperance Brennan Novel
Lonely Planet Korea (Country Guide)
Forbidden Nation: A History of Taiwan