Tag Archives: xing

Xing Ref3 Admixture South Asians

As per AV's comment, here are the individual results for Xing et al South Asians.

Related Reading:

Legends of the middle ages, narrated with special reference to literature and art
The Handy Cyclopedia of Things Worth Knowing A Manual of Ready Reference
C# in Front Office: Advanced C# in Practice
Xing
Ugly's Electrical References, 2011 Edition

Xing Ref3 K=11 Admixture

Xing et al dataset is interesting because it has a number of South Asian populations:

  • 25 Andhra Pradesh Brahmin
  • 10 Andhra Pradesh Madiga
  • 11 Andhra Pradesh Mala
  • 22 Irula
  • 25 Nepalese
  • 25 Punjabi Arain
  • 14 Tamil Nadu Brahmin
  • 12 Tamil Nadu Dalit

Unfortunately, the dataset does not have a lot of common SNPs with 23andme, FTDNA and the other data I am using.

However, I did run a reference 3 admixture on Xing data using about 30,000 SNPs. Since this is a lot less than the usual 118,000 SNPs, the noise levels are much larger.

Here is the spreadsheet with the Xing group averages for reference 3 admixture at K=11 ancestral components.

Related Reading:

Ugly's Electrical References, 2011 Edition
The New York Times Guide to Essential Knowledge, Second Edition: A Desk Reference for the Curious Mind
Study Bible KJV - Scofield Reference Bible
The Foolish Dictionary An exhausting work of reference to un-certain English words, their origin, meaning, legitimate and illegitimate use, confused by a few pictures [not included]
The New York Times Guide to Essential Knowledge: A Desk Reference for the Curious Mind

Xing to PED Conversion

Following mallu's request, here is the code I used to convert Xing et al data to Plink's PED format.

#!/bin/bash
dos2unix *.csv
 
head --lines=1  JHS_Genotype.csv > header.txt
awk -F, '{for (i=2;i<=NF;i++) print "0",$i,"0","0","0", "0"}' header.txt > xing.tfam
sed '1d' JHS_Genotype.csv > genotype.csv
sort -t',' -k 1b,1 genotype.csv > genotype_sorted.csv
sort -t',' -k 1b,1 JHS_SNP.csv > snp_sorted.csv
join -t',' -j 1 snp_sorted.csv genotype_sorted.csv > xing_compound.csv
awk -F, '{printf("%s %s 0 %s ",substr($2,4),$1,$3); 
        for (i=6;i<=NF;i++)
                printf("%s %s ",substr($i,1,1),substr($i,2,1));
        printf("\n")}' xing_compound.csv > xing.tped
 
plink --tfile xing --out xing --make-bed --missing-genotype N --output-missing-genotype 0

I make no guarantees that it will work for you. I used it on my Ubuntu box, but I am sure it'll have trouble on Mac OS.

Related Reading:

Transforming Conversion: Rethinking the Language and Contours of Christian Initiation
Plink, Plink, Plink.
Teaching and Learning Chinese as a Foreign Language: A Pedagogical Grammar
Best in Children's Books Volume 12: With Alice in Wonderland, The Man Who Didn't Wash His Dishes, The Three Little Kittens, The Sun Keeps Us Warm, Plink Plink!, America's Lake and River Fish, Val Rides the Oregon Trail, Let's Visit Brazil
A Little Town called Plink...and other stories to read with a silly accent

Xing Redo

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Xing et al dataset, which you can download from their website.

I found no duplicates within the Xing et al data but there are 259 samples common from the HapMap. Since they are not assigned any family IDs they will pass through the ped files without being merged into HapMap samples. So you need to remove any samples with IDs starting with "NA".

Xing et al also contains 6 duplicates from HGDP with completely different IDs and two Xing samples look to be related to HGDP samples.

There are also three pairs with very high identity-by-descent values, which I calculated using Plink. You can see the samples with PI_HAT greater than 0.5 in this spreadsheet. PI_HAT is the proportion IBD estimated by plink. Notice also that all these pairs also have high IBS similarity (the DSC column), more than 85% similar.

The samples I have removed as a result of this (other than HapMap) are listed in this spreadsheet.

Related Reading:

The New York Times Guide to Essential Knowledge, Second Edition: A Desk Reference for the Curious Mind
Wu Xing, Five Elements
Duplicate Keys
Nei Jia Quan: Internal Martial Arts Teachers of Tai Ji Quan, Xing Yi Quan, and Ba Gua Zhang
Duplicate - Shahrukh Khan

One PED File to Rule Them All

I am interested in North African populations due to my own heritage, so when Razib alerted me that Henn et al had a paper out about South African origins of humans and their African dataset was publicly available and included populations from all over Africa, I immediately downloaded it.

I have also been considering looking into the East Asian admixture in South Asians and Iranians in some detail to see where it originates from: Southeast Asia, Chinese/Japanese/Koreans, or the Turkic/Mongolian/Siberian populations of interior northeastern Asia. At a quick glance, Razib is correct:

The eastern Asian components are enriched among Bengalis, as you’d expect, but they’re found in different proportions among many individuals who hail from the northern fringe of South Asia more generally. It seems clear that the further west you go, the more likely the “eastern” element is going to be Turk, while the further east (and to some extent south) the more likely it is to be more southernly in provenance.

To do a better job though, it would be better to have more than the Yakut as an examplar of the Siberian component as I have done till now. Therefore, I downloaded the arctic populations dataset from Rasmussen et al.

Combining Henn et al and Rasmussen et al with my previous datasets (HapMap, HGDP, SGVP, Behar et al and Xing et al), I got 3,970 samples with a total of 1,716,031 SNPs represented, though at 99% genotyping rate it gets reduced to about 27,000 SNPs.

I did not remove any populations or individuals except for any duplicates and non-founders.

Here's the information on the populations represented in this dataset.

Now I am on the lookout for more datasets that are public, have enough SNPs in common with this set and can easily be converted into the Plink PED format. So if you know of any, let me know. May be I will have the biggest and most diverse dataset with your help.

Related Reading:

Study Bible KJV - Scofield Reference Bible
"Dainty Japanese" or Yellow Peril?: Western War Postcards 1900-1945
Pocket Ref 4th Edition
Book of Lost Souls (2005 Marvel) # 2
Joy Shtick

San and Pygmy

I have removed San and Pygmy groups from my reference datasets. That meant removing 39 samples from Reference Data I and 61 samples from Reference Data II.

The presence of those groups was creating some weird effects in admixture runs at K=8,9. Basically, the ancestral components for Africans I was getting were not stable. Instead they were varying with/without different Harappa participant batches. Also, at K=10,11, there were too many Africa-only ancestral components, forcing me to run even higher values of K.

Since we are not really interested in African diversity in this project and any African admixture among South Asians is most likely to be East, West or North African instead of Pygmy or San, the removal of these groups should not have any implications for the Harappa Ancestry Project.

To make sure that the above assertion is true, I'll re-run admixture analysis for K=2-5 and update later with the results.

Related Reading:

A Genetic and Cultural Odyssey: The Life and Work of L. Luca Cavalli-Sforza
Commodifying Bodies (Published in association with Theory, Culture & Society)
Statistical Models and Methods for Financial Markets (Springer Texts in Statistics)
Xing Yi Nei Gong: Xing Yi Health Maintenance and Internal Strength Development
The inquiring Minds: Cattle X-ing

Reference Dataset II

Combining my reference population with Xing et al data gets me 3,222 3,161 samples but with only about 23,000 SNPs after LD-pruning.

The good thing is that this dataset has 544 South Asian samples from 24 ethnic groups. So it'll be useful for some analyses despite the low number of SNPs. I'll try to run parallel analyses on my reference population and this dataset so we can compare the pros and cons of both.

UPDATE: I removed 61 pygmy and San samples.

Related Reading:

An Island Called Home: Returning to Jewish Cuba
Legends of the middle ages, narrated with special reference to literature and art
"Dainty Japanese" or Yellow Peril?: Western War Postcards 1900-1945
A Genetic and Cultural Odyssey: The Life and Work of L. Luca Cavalli-Sforza
Lonely Planet Southeast Asia: On a Shoestring (Shoestring Travel Guide)

Xing et al Data

The data for Xing et al's paper "Toward a more uniform sampling of human genetic diversity: a survey of worldwide populations by high-density genotyping" is available online.

This dataset consists of 850 individuals, but 259 of them overlap with the HapMap. Another 15 samples had to be removed because they were too similar to others. I also removed Native American samples. This leaves us with 529 samples.

Ethnic group Count
Slovenian 25
Punjabi Arain 25
N. European 25
Nepalese 25
Kyrgyzstani 25
Iban 25
Buryat 25
Bambaran 25
Andhra Pradesh Brahmin 25
Kurd 24
Dogon 24
Irula 23
Thai 22
Pygmy 22
Urkarah 18
Tamil Nadu Brahmin 14
Hema 14
Tongan 13
Tamil Nadu Dalit 13
Samoan 13
!Kung 13
Japanese 13
Andhra Pradesh Mala 11
Pedi 10
Andhra Pradesh Madiga 10
Alur 10
Nguni 9
Sotho/Tswana 8
Vietnamese 7
Stalskoe 5
Chinese 5
Khmer Cambodian 3

This dataset is valuable because it contains several South Asian, Central Asian, Southeast Asian and Caucasian groups. However, it does not have a good SNP overlap with 23andme and the other datasets. It has only about 29,000 SNPs in common with 23andme v2 data. Combining HapMap, HGDP, SGVP, Behar et al and Xing et al with 23andme data leaves us with 25,000 SNPs. Due to that, I'll be using Xing et al data for only a few analyses.

Related Reading:

Developmental State and the Dalit Question in Madhya Pradesh: Congress Response
Precolonial India in Practice: Society, Region, and Identity in Medieval Andhra
Focus on Clinical Neurophysiology: Neurology Self-Assessment (Neurology Self-Assessment Series)
Tales Of The Punjab