reich | Search Results | Harappa Ancestry Project

Admixture Onge Component Map

Posted by Zack on April 22, 2011 Comments Off

Since the Onge component on my K=11 admixture run was very strongly correlated with Reich et al's Ancestral South Indian (r^{2Simranjit has been kind enough to let me share his map of the Onge component in South Asia.}

He also has maps of the K=12 admixture run.

Reference 3 Admixture K=11

Posted by Zack on April 21, 2011 39 comments

Continuing with the admixture analysis with our new reference 3 dataset.

Here's the results spreadsheet for K=11.

You can click on the legend to the right of the bar chart to sort by different ancestral components.

You don't know how excited I am to see the Onge (C2) component. Let's compare the Onge component with Reich et al's ASI (Ancestral South Indian):

	Reich ASI %	Onge Component %
Mala	61.2	39.9
Madiga	59.4	37.9
Chenchu	59.3	38.6
Bhil	57.1	37.5
Satnami	57	36.4
Kurumba	56.8	39.5
Kamsali	55.5	35.5
Vysya	53.8	34.4
Lodi	50.1	31.8
Naidu	49.9	32.1
Tharu	49	32.2
Velama	45.3	28.9
Srivastava	43.6	27.8
Meghawal	39.7	25.4
Vaish	37.4	23.8
Kashmiri-Pandit	29.4	17.6
Sindhi	26.3	13.4
Pathan	23.1	10.6

Let's plot that with a linear regression:

How do you like that?

Now let's take all the reference populations with an Onge component between 10% to 50% and use the equation above to calculate their ASI percentage. The results are in a spreadsheet. There are several populations with an even higher Ancestral South Indian than any of the Reich et al groups, with Paniya being the highest at 67.4%.

Fst divergences between estimated populations for K=11 in the form of an MDS plot.

I guess you might want to see the Fst dendrogram too. Just remember it's not a phylogeny.

And the numbers:

	C1	C2	C3	C4	C5	C6	C7	C8	C9	C10
C2	0.165
C3	0.121	0.122
C4	0.090	0.161	0.152
C5	0.071	0.152	0.137	0.048
C6	0.134	0.144	0.067	0.163	0.143
C7	0.184	0.224	0.216	0.179	0.186	0.232
C8	0.210	0.209	0.205	0.235	0.223	0.228	0.286
C9	0.175	0.207	0.139	0.208	0.178	0.141	0.281	0.290
C10	0.261	0.304	0.294	0.257	0.261	0.311	0.123	0.367	0.364
C11	0.150	0.195	0.187	0.143	0.148	0.203	0.059	0.260	0.252	0.133

Reference 3 Fixed

Posted by Zack on April 17, 2011 2 comments

I have fixed the problem with Reference 3 but if you notice any strange results, do let me know.

While the Reference 3 admixture results were generally good (and I have some nice surprises on the way I hope), the Reich et al populations had some weird behavior. From one K value to the next, their admixture would swing wildly especially among the minor components.

For example, for Chenchu, the 2nd component after South Asian was Southwest Asian (42%) at K=6, European (45%) at K=7 and American (32%) at K=8. That just didn't make any sense. It was similar for other Reich et al populations, but all the other reference populations seemed pretty stable.

The issue was that when I was creating Reference 3, I had to juggle lists of SNPs to figure out a way to include Reich et al with a large (>100,000) number of SNPs in the dataset since Reich doesn't have as many SNPs in common with the other datasets plus 23andme (v2 and v3) and FTDNA. In that effort where I was doing lots of SNP set intersections and unions I messed up. I used 217,000 SNPs. While these SNPs were present in all the other datasets, Reich et al had only 102,000 SNPs common with that set. Ouch! This was a royal mess as the high missing rate of Reich et al caused weird instability in its admixture results even though the rest of the results were mostly stable.

Now, I have pared down Reference 3 to 118,000 SNPs. These have a low missing rate in all the datasets. So I don't expect the same problems.

I am redoing the admixture runs with this new data and will have some of the results up soon.

Pan-Asian to PED Conversion

Posted by Zack on April 16, 2011 8 comments

Even though the Pan-Asian dataset is not public, there was a request for my script to convert the data to Plink's PED format.

Here is how I convert the Pan-Asian data to Plink's transposed file format.

#!/usr/bin/perl -w
 
$file="Genotypes_All.txt";
 
open(INFILE,"<",$file);
open(TFAM,">","panasian.tfam");
open(TPED,">","panasian.tped");
 
$line = <INFILE>;
chomp $line;
@first = split('\t',$line);
foreach my $sample (5..$#first) {
        print TFAM "0 $first[$sample] 0 0 0 -9\n";
}
 
my $alleles;
 
while(<INFILE>) {
        chomp;
        @lines = split('\t',$_);
        my ($major,$minor) = split('/',$lines[4]);
        print TPED "$lines[2] $lines[1] 0 $lines[3]";
        foreach my $snp (5..$#lines) {
                if ($lines[$snp] == 0) {
                        $alleles = "$major $major";}
                elsif ($lines[$snp] == 1) {
                        $alleles = "$major $minor";}
                elsif ($lines[$snp] == 2) {
                        $alleles = "$minor $minor";}
                else {
                        $alleles = "0 0";}
                print TPED " $alleles";
        }
        print TPED "\n";
}
 
close(INFILE);
close(TFAM);
close(TPED);

Again, no guarantees! It's Perl though, so it should be more stable across various operating systems.

Introducing Reference 3

Posted by Zack on April 13, 2011 34 comments

Having collected 12 datasets, I have gone through them and finally selected the samples and SNPs I want to include in my new dataset, which I'll call Reference 3.

It has 3,889 individuals and 217,957 SNPs. Since this is a South Asia focused blog, there are a total of 558 South Asians in this reference set (compared to 398 in my Reference I).

You can see the number of SNPs of various datasets which are common to 23andme version 2, 23andme version 3 and FTDNA Family Finder (Illumina chip).

The following datasets had more than 280,000 SNPs common with all three platforms and hence were included in Reference 3:

HapMap
HGDP
SGVP
Behar
Henn (Khoisan data)
Rasmussen
Austroasiatic
Latino
1000genomes

Reich et al had about 100,000 SNPs in common with 23andme (v2 & v3 intersection) and 137,000 with FTDNA, but there was not a great overlap. Only 59,000 Reich et al SNPs were present in all three platforms. Since I really wanted Reich et al data in Reference 3, I included it but the SNPs used for FTDNA comparisons won't be the same as for the 23andme comparisons.

Of the datasets I could not include, I am most disappointed about the Pan-Asian dataset since it has a good coverage of South and Southeast Asia. Unfortunately, it has only 19,000 SNPs in common with 23andme v2 and 23,000 with 23andme v3. I am going to have to do some analyses with the Pan-Asian data but it just can't be included in my Reference 3.

I am also interested in doing some analysis with the Henn et al African data with about 52,000 SNPs for personal reasons.

Xing et al has about 71,000 SNPs in common with 23andme v3, so some good work could be done with that, though I'll have to use only 23andme version 3 participants.

The information about the populations included in Reference 3 is in a spreadsheet as usual.

Pan-Asian Dataset Duplicates and Relatives

Posted by Zack on April 9, 2011 2 comments

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

Looking at the Pan-Asian dataset, I found 3 pairs of duplicate samples and 82 pairs that could be closely related. I have removed 64 samples from the dataset.

You can see the IBD results from plink as well as the list of sample IDs I removed in a spreadsheet.

UPDATE: I found 4 Melanesians in the Pan-Asian dataset who were the same as those in HGDP. So I have removed those as well and added them in the list in the spreadsheet.

Ref 2 South Asians + Harappa MDS Clusters

Posted by Zack on April 1, 2011 2 comments

Why do MDS clusters when we already did PCA-based clustering for this data?

You guys probably know about Dienekes' Clusters Galore approach. The way it works is that varying the number of MDS dimensions used you compute the number of clusters inferred (done using Mclust) and use the number of MDS dimensions which give you the maximum number of clusters.

This sounded a little unsatisfactory for me. So I ran an experiment. I computed 100 MDS dimensions for the samples in this dataset which includes South Asians from Reference II as well as 38 Harappa participants. Then I kept 2,3,4,...,100 dimensions and ran NNClean (to get initial noise/outlier estimate) and Mclust on them.

This first graph shows the number of outliers NNclean computed from 586 samples.

Things go crazy with NNclean when 64 or more MDS dimensions are retained since it considers most of the samples to be noise then.

Now let's look at the number of outliers identified after Mclust's clustering procedure.

This shows us that probably somewhere between 8 and 65 MDS dimensions might be useful to keep.

Finally, a plot of the number of clusters inferred by Mclust versus the number of MDS dimensions used.

There are two big jumps here to consider. One is around 12 MDS dimensions and the other after 52. So we are looking at an optimum number of MDS dimensions between 12 and 52. However, in that range, the number of clusters computed is fairly noisy between 18 and 26. The only pattern I can discern with some smoothed fitting is that we should likely be looking at somewhere between 20 and 30 MDS dimensions.

But why choose the maximum number of clusters (26 clusters when 24 MDS dimensions are kept)? That could be the result of noise too.

Is there some other way to figure out what are the significant number of MDS dimensions to keep for population structure? It turns out there is. Patterson, Price and Reich proposed Tracy-Widom statistics for Principal Component Analysis in their paper "Population Structure and Eigenanalysis". We also know that the MDS analysis we are performing is the classical metric MDS which is in some ways equivalent to a PCA. Looking at the Tracy Widom stats then, we see that about 25 eigenvalues are significant. Thus, keeping 24 MDS dimensions to maximum the number of clusters seems defensible.

Finally, here are the clustering results.

Ref 2 South Asians + Harappa PCA

Posted by Zack on March 30, 2011 2 comments

I ran PCA on the South Asian populations included in Reference II dataset as well as 38 South Asian participants of Harappa Project. This is sort of a complementary analysis to the Ref1 South Asian one, as this one includes Kalash, Hazara and the additional South Asian groups in Xing et al.

The reference populations included are: Andhra Brahmin, Andhra Madiga, Andhra Mala, Balochi, Bnei Menashe Jews, Brahui, Burusho, Cochin Jews, Gujaratis, Gujaratis-B, Hazara, Irula, Kalash, Makrani, Malayan, Nepalese, North Kannadi, Paniya, Pathan, Punjabi Arain, Sakilli, Sindhi, Singapore Indians, Tamil Nadu Brahmin, and Tamil Nadu Dalit.

Here's the spreadsheet showing the eigenvalues and the first 15 principal components for each sample.

I computed the PCA using Eigensoft which removed 13 samples as outliers. The Tracy-Widom statistics show that about 25 eigenvectors are significant.

Here are the first 15 eigenvalues:

1	6.374483
2	3.650626
3	3.270121
4	2.999767
5	1.937818
6	1.713315
7	1.538295
8	1.503051
9	1.458331
10	1.448079
11	1.433288
12	1.414678
13	1.408943
14	1.390791
15	1.38101

Here is a 3-D PCA plot (hat tip: Doug McDonald) showing the first three eigenvectors. The plot is rotating about the 1st eigenvector which is vertical. Also, I have stretched the principal components based on the corresponding eigenvalues. Also, you can highlight the individual project participants in the plot by using the dropdown list below the plot.

</p> <p>Your browser does not support frames. Go <a href="http://www.harappadna.org/wp-content/uploads/2011/03/r2_sa_hrp_pca.html">here</a> to see the animation.</p> <p>

Now here are plots of the first 14 eigenvectors. In this case, I have not stretched the principal components, so keep in mind that the first eigenvector explains 1.75 times variation compared to the 2nd eigenvector.

Dienekes on ANI/ASI

Posted by Zack on March 22, 2011 9 comments

Dienekes has a word of caution about choosing reference populations and admixture results.

Consider a sample of 25 Mexicans from the HapMap and 25 Yoruba from the Hapmap, 25 Iberian Spanish from the 1000 Genomes Project, and 25 Pima from the HGDP as parental populations. We obtain for our Mexican sample:

59.7% European

36.9% "Native American"

3.4% African

Let's run a final experiment with just the Mexicans, Spanish, and Yoruba, i.e., with no Native American samples. At K=3 we obtain:

70% "Native American"

29.7% European

0.4% African

The "Native American" component has increased again! The explanation is simple: as we exclude less admixed Native American groups, Mexicans appear (comparatively) more Native American. The "Native American pole" has shifted, and so has the relative position of populations between them.

In other terms, what is labeled "Native American" in the three experiments is not the same: in the first one it is anchored on the more unadmixed Pima, in the last one in the more admixed Mexicans.

Thus, it seems that unadmixed reference samples are much more useful in getting good results from Admixture.

Then he runs Admixture on the Reich et al dataset for South Asians and tries to estimate the relationship between the Ancestral North Indian percentage computed by Reich et al and his K=2 admixture results on the same data.

Dienekes then included South Asian Dodecad participants in the analysis and ran a K=4 admixture analysis on Reich et al + Dodecad South Asian data, including Yoruba and Beijing Chinese from the HapMap to catch any African or East Asian ancestry.

Here are the admixture results for the reference populations:

The R² correlation between the West Eurasian admixture component and the Reich et al ANI component is 0.98 which is good. His relationship equation comes out to:

ANI = 0.779*WestEurasian + 39.674

Using this relationship, he calculates the ANI and ASI (Ancestral South Indian) components for Dodecad project members. My results (DOD128) are as follows:

East Eurasian	0.0%
African	3.5%
Ancestral North Indian	75.9%
Ancestral South Indian	20.6%

I should point out that due to my recent Egyptian ancestry, my ANI result is wrong since it's collecting all of the non-African Egyptian in there too.

Also, in the case of Razib, I don't think his East Asian 14.4% should be separated out from his ANI-ASI like that. At least some of it should form part of his ASI percentage in my opinion.

Otherwise, this seems like a very good exercise by Dienekes.

References

Posted by Zack on March 18, 2011 No comments

Datasets

Behar, Doron M., Bayazit Yunusbayev, Mait Metspalu, Ene Metspalu, Saharon Rosset, Juri Parik, Siiri Rootsi, et al. "The genome-wide structure of the Jewish people." Nature 466, no. 7303 (July 8, 2010): 238-242. Paper & Data.
Botigué, Laura R., Brenna M. Henn, Simon Gravel, Brian K. Maples, Christopher R. Gignoux, Erik Corona, Gil Atzmon, et al. "Gene Flow from North Africa Contributes to Differential Human Genetic Diversity in Southern Europe." Proceedings of the National Academy of Sciences (June 3, 2013). doi:10.1073/pnas.1306223110. Paper & Data.
Bryc, K., C. Velez, T. Karafet, A. Moreno-Estrada, A. Reynolds, A. Auton, M. Hammer, C. D. Bustamante, and H. Ostrer. "Colloquium Paper: Genome-wide patterns of population structure and admixture among Hispanic/Latino populations." Proceedings of the National Academy of Sciences 107, no. 2 (2010): 8954-8961. Paper & Data.
Chaubey, Gyaneshwer, Mait Metspalu, Ying Choi, Reedik Mägi, Irene Gallego Romero, Pedro Soares, Mannis van Oven, et al. "Population Genetic Structure in Indian Austroasiatic Speakers: The Role of Landscape Barriers and Sex-Specific Admixture." Molecular Biology and Evolution 28, no. 2 (February 1, 2011): 1013 -1024. Paper.
Consortium, The 1000 Genomes Project. "A Map of Human Genome Variation from Population-scale Sequencing." Nature 467, no. 7319 (October 27, 2010): 1061-1073. Paper & Data.
Di Cristofaro J, Pennarun E, Mazières S, Myres NM, Lin AA, et al. (2013) Afghan Hindu Kush: Where Eurasian Sub-Continent Gene Flows Converge. PLoS ONE 8(10): e76748. doi:10.1371/journal.pone.0076748. Paper & Data
Haber, Marc, Dominique Gauguier, Sonia Youhanna, Nick Patterson, Priya Moorjani, Laura R. Botigué, Daniel E. Platt, et al. "Genome-Wide Diversity in the Levant Reveals Recent Structuring by Culture." PLoS Genet 9, no. 2 (February 28, 2013): e1003316. doi:10.1371/journal.pgen.1003316. Paper & Data.
Henn, Brenna M., Laura R. Botigué, Simon Gravel, Wei Wang, Abra Brisbin, Jake K. Byrnes, Karima Fadhlaoui-Zid, et al. "Genomic Ancestry of North Africans Supports Back-to-Africa Migrations." PLoS Genet 8, no. 1 (January 12, 2012): e1002397. Paper & Data.
Henn, Brenna M., Christopher R. Gignoux, Matthew Jobin, Julie M. Granka, J. M. Macpherson, Jeffrey M. Kidd, Laura Rodriguez-Botigué, et al. "Hunter-gatherer genomic diversity suggests a southern African origin for modern humans." Proceedings of the National Academy of Sciences 108, no. 13 (March 29, 2011): 5154 -5162. Paper & Data.
Hodoğlugil, Uğur, and Robert W Mahley. "Turkish Population Structure and Genetic Ancestry Reveal Relatedness Among Eurasian Populations." Annals of Human Genetics 76, no. 2 (March 1, 2012): 128-141. Paper.
Li, Jun Z., Devin M. Absher, Hua Tang, Audrey M. Southwick, Amanda M. Casto, Sohini Ramachandran, Howard M. Cann, et al. "Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation." Science 319, no. 5866 (February 22, 2008): 1100 -1104. Paper & Data.
Metspalu, Mait, Irene Gallego Romero, Bayazit Yunusbayev, Gyaneshwer Chaubey, Chandana Basu Mallick, Georgi Hudjashov, Mari Nelis, et al. "Shared and Unique Components of Human Population Structure and Genome-Wide Signals of Positive Selection in South Asia." The American Journal of Human Genetics 89, no. 6 (December 9, 2011): 731-744. Paper & Data.
Pagani, Luca, Toomas Kivisild, Ayele Tarekegn, Rosemary Ekong, Chris Plaster, Irene Gallego Romero, Qasim Ayub, et al. "Ethiopian Genetic Diversity Reveals Linguistic Stratification and Complex Influences on the Ethiopian Gene Pool." The American Journal of Human Genetics (n.d.). Paper & Data.
Rasmussen, Morten, Yingrui Li, Stinus Lindgreen, Jakob Skou Pedersen, Anders Albrechtsen, Ida Moltke, Mait Metspalu, et al. "Ancient human genome sequence of an extinct Palaeo-Eskimo." Nature 463, no. 7282 (February 11, 2010): 757-762. Paper & Data.
Reich, David, Kumarasamy Thangaraj, Nick Patterson, Alkes L. Price, and Lalji Singh. "Reconstructing Indian population history." Nature 461, no. 7263 (2009): 489-494. Paper.
Schlebusch, Carina M., Pontus Skoglund, Per Sjödin, Lucie M. Gattepaille, Dena Hernandez, Flora Jay, Sen Li, et al. "Genomic Variation in Seven Khoe-San Groups Reveals Adaptation and Complex African History." Science 338, no. 6105 (October 19, 2012): 374-379. doi:10.1126/science.1227721. Paper & Data.
Simonson, Tatum S, Yingzhong Yang, Chad D Huff, Haixia Yun, Ga Qin, David J Witherspoon, Zhenzhong Bai, et al. "Genetic Evidence for High-Altitude Adaptation in Tibet." Science 329, no. 5987 (July 2, 2010): 72-75. Paper & Data.
Teo, Yik-Ying, Xueling Sim, Rick T H Ong, Adrian K S Tan, Jieming Chen, Erwin Tantoso, Kerrin S Small, et al. "Singapore Genome Variation Project: a haplotype map of three Southeast Asian populations." Genome Research 19, no. 11 (November 2009): 2154-2162. Paper & Data.
The HUGO Pan-Asian SNP Consortium. "Mapping Human Genetic Diversity in Asia.â€ Science 326, no. 5959 (December 11, 2009): 1541 -1545. Paper.
The International HapMap 3 Consortium. "Integrating common and rare genetic variation in diverse human populations." Nature 467, no. 7311 (2010): 52-58. Paper & Data.
Xing, Jinchuan, W Scott Watkins, Adam Shlien, Erin Walker, Chad D Huff, David J Witherspoon, Yuhua Zhang, et al. "Toward a more uniform sampling of human genetic diversity: a survey of worldwide populations by high-density genotyping." Genomics 96, no. 4 (October 2010): 199-210. Paper & Data.
Yunusbayev, Bayazit, Mait Metspalu, Mari Järve, Ildus Kutuev, Siiri Rootsi, Ene Metspalu, Doron M. Behar, et al. "The Caucasus as an Asymmetric Semipermeable Barrier to Ancient Human Migrations." Molecular Biology and Evolution (2011). Paper.

Analysis

Alexander, David H., John Novembre, and Kenneth Lange. "Fast model-based estimation of ancestry in unrelated individuals." Genome Research 19, no. 9 (2009): 1655 -1664. http://genome.cshlp.org/content/19/9/1655.abstract.
Browning, Sharon R., and Brian L. Browning. "Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering." The American Journal of Human Genetics 81, no. 5 (November 1, 2007): 1084-1097. http://www.cell.com/AJHG/retrieve/pii/S0002929707638828.
Delaneau, Olivier, Jonathan Marchini, and Jean-Francois Zagury. "A linear complexity phasing method for thousands of genomes." Nat Meth advance online publication (December 4, 2011). http://dx.doi.org/10.1038/nmeth.1785.
Lawson, Daniel John, Garrett Hellenthal, Simon Myers, and Daniel Falush. "Inference of Population Structure using Dense Haplotype Data." PLoS Genet 8, no. 1 (January 26, 2012): e1002453. http://dx.doi.org/10.1371/journal.pgen.1002453.
Manichaikul, Ani, Josyf C. Mychaleckyj, Stephen S. Rich, Kathy Daly, Michèle Sale, and Wei-Min Chen. "Robust Relationship Inference in Genome-wide Association Studies." Bioinformatics 26, no. 22 (November 15, 2010): 2867 -2873. http://bioinformatics.oxfordjournals.org/content/26/22/2867.abstract.
Patterson, Nick, Alkes L Price, and David Reich. "Population Structure and Eigenanalysis." PLoS Genet 2, no. 12 (December 22, 2006): e190. http://dx.plos.org/10.1371/journal.pgen.0020190.
Patterson, Nick, Priya Moorjani, Yontao Luo, Swapan Mallick, Nadin Rohland, Yiping Zhan, Teri Genschoreck, Teresa Webster, and David Reich. “Ancient Admixture in Human History.” Genetics 192, no. 3 (November 1, 2012): 1065–1093. doi:10.1534/genetics.112.145037.
Price, Alkes L, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. "Principal components analysis corrects for stratification in genome-wide association studies." Nat Genet 38, no. 8 (2006): 904-909. http://dx.doi.org/10.1038/ng1847.
Purcell, Shaun, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel A. R. Ferreira, David Bender, Julian Maller, et al. "PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses." American Journal of Human Genetics 81, no. 3 (September 2007): 559-575. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1950838/

Software

Harappa Ancestry Project

Genetics and South Asia

Search Results for: reich - Page 3

Admixture Onge Component Map

Reference 3 Admixture K=11

Reference 3 Fixed

Pan-Asian to PED Conversion

Introducing Reference 3

Pan-Asian Dataset Duplicates and Relatives

Ref 2 South Asians + Harappa MDS Clusters

Ref 2 South Asians + Harappa PCA

Dienekes on ANI/ASI

References

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Archives

Recent Comments

Blogroll

Genetics and South Asia

Search Results for: reich - Page 3

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Tags

Archives

Recent Comments

Blogroll