Reich et al and Pan-Asian Datasets

I got access to the Reich et al (Nature 2009) dataset used in their paper "Reconstructing Indian population history".

It has the following populations:

Aonaga Aus Bhil
Chenchu Great_Andamanese Hallaki
Kamsali Kashmiri_Pandit Kharia
Kurumba Lodi Madiga
Mala Meghawal Naidu
Nysha Onge Sahariya
Santhal Satnami Siddi
Somali Srivastava Tharu
Vaish Velama Vysya

There are 141 individuals with 587,753 SNPs in their dataset which conveniently is in PED format.

Also, Blaise pointed me to the Pan-Asian SNP data used in the Dec 2009 Science paper "Mapping Human Genetic Diversity in Asia".

It includes the following 71 populations:

Maya Auca Quechua Karitiana Pima
Ami Atayal Melanesians Zhuang Han_Cantonese
Hmong Jiamao Jinuo Han_Shanghai Uyghur
Wa Alorese Dayak Javanese Batak_Karo
Lamaholot Lembata Malay Mentawai Manggarai
Kambera Sunda Batak_Toba Toraja Andhra_Pradesh
Karnataka Bengali-Assamese Rajasthan Uttaranchal Uttar Pradesh
Haryana Spiti Bhili Marathi Japanese
Ryukyuan Korean Bidayuh Jehai Kelantan
Kensiu Temuan Ayta Agta Ati
Iraya Minanubu Mamanwa Filipino Singapore_Chinese
Singapore_Indian Singapore_Malay Hmong (Miao) Karen Lawa
Mlabri Mon Paluang Plang Tai_Khuen
Tai_Lue H'tin Tai_Yuan Tai_Yong Yao
Hakka Minnan

It has 1,719 individuals with 54,794 SNPs. I wish it had more SNPs considering the wealth of populations.

Also, the Pan-Asian data is in the form of minor allele counts, so I need to convert that back to A/C/G/T. Since there are some HapMap populations included in the dataset, that shouldn't be too hard.

I am going to include both these datasets into my big reference set.


  1. Great finds!

  2. I second SV! I cannot believe you got your hands on THE data-set! Very, very awesome!

  3. Wait, this means that an ANI-ASI division is now possible, right?

    the ANI-ASI division was created using a different methodology than what ADMIXTURE does. the whole point of reich et al. was that the south asian cluster which always pops out of ADMIXTURE/STRUCTURE/frappe doesn't naturally disaggregate, so they had to use a different technique.

  4. Paul Ó Duḃṫaiġ

    Great find on the pan-asian data. If you want I can send you my OH's 23andme (v3) data. She's Filipino though she has some Spanish admixture (1/8th -- great grandfather)


  5. Good job Zack, will there be more cool admixture analysies now, with the pan asia data set included?

    By the way will you make a K=17 and K=18 analysies to?

    • Yes I'll have more admixture results after I have incorporated the new data.

      For the current reference 1, K=17 is probably the highest I'll go.

  6. Dienekes on ANI/ASI | Harappa Ancestry Project - pingback on March 22, 2011 at 8:27 am
  7. Zack, will you be able to run these two data-sets against K=16 like you did for the Xing data set a while ago, sometime in the future? It'd be great if you could do so as the second data set has Uttar Pradesh, Rajasthan and Maharashtra - some of the most populous states in India.


    • I have incorporated Reich in my new reference which I am testing right now.

      As for Pan-Asian dataset, I am going to run some Admixture and PCA experiments on it soon. Unfortunately it has too few SNPs in common with 23andme to be of much use for Harappa analysis.

  8. Pan-Asian to PED Conversion | Harappa Ancestry Project - pingback on April 16, 2011 at 8:05 am
  9. Pan-Asian Ref3 K=11 Admixture | Harappa Ancestry Project - pingback on March 13, 2012 at 6:27 am
  10. Zack, I'm sooo bad at finding stuff online. Can you post the link to the genotype (ped) you found?

    I'm mainly interested in the Andamese and Austroasiatic populations. I'm Filipino and would like to see my admixture with those groups.

    Many thanks and keep on posting!!!

Trackbacks and Pingbacks: