Tag Archives: austroasiatic

Behar Paniya

Posted by Zack on April 16, 2011 6 comments

Behar as in the Behar et al paper/dataset and not the Indian state of Bihar. The Behar dataset contains 4 samples of Paniya, which apparently is a Dravidian language of some Scheduled Tribes in Kerala.

I had always been suspicious of those four samples since one of them had admixture proportions similar to other South Indians but the other three were like Southeast Asians.

When I got the Austroasiatic dataset, I found out that they had the four Paniyas from Behar et al in their data. However, only one of those four was the same as Behar. The other three were different. So I now had 7 Paniya samples.

Let's look at the K=12 admixture results for these Paniyas.

Behar's GSM536916 was the one which was the same as Austroasiatic's D36 and it has regular South Indian results. The other three Behar Paniyas are very Southeast Asian (yellow in the plot) while the three Paniyas from Austroasiatic data are similar to GSM536916/D36.

Since the Austroasiatic Paniya samples originated from Behar et al, I guess at some point before the Behar data being submitted to the GEO database the Paniyas got mislabeled.

I am now excluding the four Paniyas from Behar et al dataset and only using the Paniya samples from Austroasiatic dataset.

Introducing Reference 3

Posted by Zack on April 13, 2011 34 comments

Having collected 12 datasets, I have gone through them and finally selected the samples and SNPs I want to include in my new dataset, which I'll call Reference 3.

It has 3,889 individuals and 217,957 SNPs. Since this is a South Asia focused blog, there are a total of 558 South Asians in this reference set (compared to 398 in my Reference I).

You can see the number of SNPs of various datasets which are common to 23andme version 2, 23andme version 3 and FTDNA Family Finder (Illumina chip).

The following datasets had more than 280,000 SNPs common with all three platforms and hence were included in Reference 3:

HapMap
HGDP
SGVP
Behar
Henn (Khoisan data)
Rasmussen
Austroasiatic
Latino
1000genomes

Reich et al had about 100,000 SNPs in common with 23andme (v2 & v3 intersection) and 137,000 with FTDNA, but there was not a great overlap. Only 59,000 Reich et al SNPs were present in all three platforms. Since I really wanted Reich et al data in Reference 3, I included it but the SNPs used for FTDNA comparisons won't be the same as for the 23andme comparisons.

Of the datasets I could not include, I am most disappointed about the Pan-Asian dataset since it has a good coverage of South and Southeast Asia. Unfortunately, it has only 19,000 SNPs in common with 23andme v2 and 23,000 with 23andme v3. I am going to have to do some analyses with the Pan-Asian data but it just can't be included in my Reference 3.

I am also interested in doing some analysis with the Henn et al African data with about 52,000 SNPs for personal reasons.

Xing et al has about 71,000 SNPs in common with 23andme v3, so some good work could be done with that, though I'll have to use only 23andme version 3 participants.

The information about the populations included in Reference 3 is in a spreadsheet as usual.

Reich et al Duplicates

Posted by Zack on April 12, 2011 Comments Off

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Reich et al Indian dataset.

The dataset doesn't have any duplicate or likely relative samples itself. However, there are two Kharia samples that are the same as the Austroasiatic dataset. Since Austroasiatic dataset has more SNPs in common with 23andme, I removed these two samples from Reich et al.

The IBS/IBS analysis and the sample IDs are in a spreadsheet as usual.

Austroasiatic Dataset Duplicates

Posted by Zack on April 8, 2011 Comments Off

So I went back to the Chaubey et al Austroasiatic Indians dataset.

The dataset doesn't have any duplicate or likely relative samples itself. Of course, I had to remove the 632 HGDP samples it had, but that's easy to do since they have the same IDs (starting with HGDP).

As their paper mentions, the dataset also has 19 Dravidian speaking Indian samples from Behar et al. Since I got Behar et al data from the GEO site, I had different IDs for them than what they use in this dataset. So I had to figure out which samples were the same in both. The IBS/IBD results of duplicates as well as the list of sample IDs I removed is given in a spreadsheet.

Checking this out resolved an issue I had with Behar et al. Behar et al has 4 Paniya samples from South India. One of those four has admixture proportions similar to Indians but three seem very East Asian. I had always suspected that those three samples were mislabeled. Now the Austroasiatic dataset also has those four Paniya samples. However, only one of them is identical to the Behar et al one. The other three are different. I haven't checked yet which one of the Behar samples matches Austroasiatic, but my guess is that it is the more Indian admixture one. So I am keeping the other three Paniya samples from the Austroasiatic dataset and hoping that they are the correct ones.

Austroasiatic Dataset

Posted by Zack on March 20, 2011 Comments Off

Razib pointed out the paper "Population Genetic Structure in Indian Austroasiatic speakers: The Role of Landscape Barriers and Sex-specific Admixture" by Gyaneshwer Chaubey, Mait Metspalu, Ying Choi, Reedik MÃ¤gi, Irene Gallego Romero, Pedro Soares, Mannis van Oven, Doron M. Behar, Siiri Rootsi, Georgi Hudjashov, Chandana Basu Mallick, Monika Karmin, Mari Nelis, JÃ¼ri Parik, Alla Goverdhana Reddy, Ene Metspalu, George van Driem, Yali Xue, Chris Tyler-Smith, Kumarasamy Thangaraj, Lalji Singh, Maido Remm, Martin B. Richards, Marta Mirazon Lahr, Manfred Kayser, Richard Villems and Toomas Kivisild to me 36 hours ago. And I have their dataset now.

I have been told that the data will hopefully be in the NCBI GEO database soon.

There are a total of 41 samples with 527,319 SNPs in the data. There are Bonda, Savara, Juang and Gadaba from Orissa; Santhal and Asur from Jharkand; Kharia from Chattishgarh; Ho from Bihar; Khasi and Garo from Meghalaya; and some (15) Burmese.

PS. I have created a separate page for references where I link to the papers which led to the datasets I am using.

Harappa Ancestry Project

Genetics and South Asia

Tag Archives: austroasiatic

Behar Paniya

Introducing Reference 3

Reich et al Duplicates

Austroasiatic Dataset Duplicates

Austroasiatic Dataset

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Archives

Recent Comments

Blogroll

Harappa Ancestry Project

Genetics and South Asia

Tag Archives: austroasiatic

Behar Paniya

Share this:

Introducing Reference 3

Share this:

Reich et al Duplicates

Share this:

Austroasiatic Dataset Duplicates

Share this:

Austroasiatic Dataset

Share this:

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Tags

Archives

Recent Comments

Blogroll