Project Update

I have a total of 42 participants in the project right now who have sent me their raw data. This is not counting two people who have relatives participating and thus have to be filtered out for most analysis other than individual admixture percentages etc where I divide participants into small groups.

The following groups are represented:

  • Punjab: 7
  • Iran: 6
  • Tamil: 5
  • Andhra Pradesh: 2
  • Bengal: 2
  • Bihar: 2
  • Karnataka: 2
  • Caribbean Indian: 2
  • Kashmir: 2
  • Anglo-Indian: 1
  • Roma: 1
  • Goa: 1
  • Uttar Pradesh: 1
  • Sri Lankan: 1
  • Rajasthan: 1
  • Kerala: 1
  • Baloch: 1
  • Unknown: 1

The unknown is Manu Sporny who has put his genetic data in the public domain and I have drafted him into our project.

In addition, out of curiosity, I have accepted data from the following:

  • Iraqi Arab: 2
  • Egyptian/Iraqi Jew: 1

I know a bunch of you have done a lot to make this project known and gotten people to submit their data. But we really do need more participants of every ethnicity and geographic region in and around South Asia. So keep on!

I am working on K=12 admixture runs for the batches we have already done. In addition, the reference I dataset will be used for even higher values of K admixture components to see where the limit is.

Also, I am looking into doing chromosome by chromosome admixture (and other analysis). I have done some experimental runs and once I have pored over that data, I'll have something to report.

As we have seen, even with the removal of the San and Pygmy, the Africans take up 3 ancestral components and most South Asians (excepting me of course) do not have any African admixture. So I am working on a reference dataset without any Africans. I have my own take on how to do that which I'll share in the next few days.

In short, my home computer is running admixture, plink, eigensoft, etc. 24x7.


  1. Sporny's half Sri Lankan, half white American of German and Polish ancestry.

  2. Personal genome in the public domain | Biology News by Biologged - pingback on February 21, 2011 at 8:32 am
  3. Hi Zack,

    I am a Bengali Brahmin whose grandparents are all from Barisal district of Bangladesh. I have been trying to send you my 23andme v3 data for a while, but am getting no response. Could you let me know whether you received it, and will process it at some point; or whether both my emails landed up in the bitbucket?


    • I am really sorry. Both of your emails ended up in my spam folder.

      I downloaded your data and will include it in batch 6.

  4. We have more Punjabis and Iranians than UPites, Biharis and Marathis!

    • Well technically iran is a country. So it's not a fair comparison. Currently the makeup is representative of the actual indian diaspora to some degree.

      • According to the American Community Survey (from the US Census), 26.3% of Indian Americans speak Hindi at home, 14.1% speak Gujarati, 10.1% speak English, and 10.0% speak Punjabi. (For comparison, 3.4% speak Marathi.) A lot of those Hindi speakers are probably Gujarati or Punjabi, but a fair number are probably from the cow belt. The Indo-Canadian population is predominantly Punjabi- and Tamil-speaking. But when you add in the diaspora population from other countries, the proportion of UPers and Biharis probably goes up.

        So I think people from the Indo-Gangetic plain are underrepresented -- even in terms of their diaspora populations -- but Gujaratis are really underrepresented among the participants (as opposed to the HapMap reference samples). Punjabis and South Indians (South Indian Brahmins in particular) are probably overrepresented.

        • What about Tamil in ACS?

          • 6.7% speak Tamil at home. (Compared to 9.7% who speak Telugu, 6.1% Malayalam, and 1.7% Kannada.)

            Note that these figures reflect the proportion of Indian-born respondents who speak the language at home, so it doesn't include Pakistani-born Punjabi speakers, Singaporean Tamils, American-born Telugu speakers, etc.

      • Oh, the British Asian population is also largely Punjabi and Gujarati. But I still think people from U.P. and Bihar are underrepresented.

      • Iran has a population of 77 million while the two Punjabs together are 105 million.

  5. Bengal (W. + Bangladesh) has 270 million people - another region/linguistic group underrepresented.

    • Bengalis are still underrepresented but now we have more than Razib's family.

      • Did see one of them above - Tanmoy. There is some excellent material on his web-site. I remember communicating with him on his match to the Andronovo ancient DNA's STR.

Trackbacks and Pingbacks: