HarappaWorld Tweaks

First of all, I wanted to draw your attention to the fact that I am using weighted means for population averages for HarappaWorld instead of just averaging all samples' results. The weighting gives less importance to outliers. I find this to be a better solution than a simple average or median. A median removes all outliers but it also rejects a lot of information.

An example of the weighted mean effect can be seen in the Behar et al Armenian samples. Four of the samples have higher NE European percentages than the rest. As you can see in the table below, the weighting makes their impact on the population results low.

Mean Weighted Mean
Ethnicity armenian armenian armenian armenian
Dataset behar yunusbayev behar yunusbayev
N 19 16 19 16
S Indian 0.37% 0.52% 0.41% 0.52%
Baloch 16.57% 17.73% 17.07% 17.65%
Caucasian 54.35% 56.43% 57.29% 56.61%
NE Euro 8.96% 2.98% 5.35% 2.95%
SE Asian 0.10% 0.12% 0.10% 0.13%
Siberian 0.49% 0.09% 0.29% 0.09%
NE Asian 0.14% 0.08% 0.16% 0.09%
Papuan 0.28% 0.27% 0.26% 0.27%
American 0.19% 0.18% 0.22% 0.18%
Beringian 0.26% 0.19% 0.23% 0.20%
Mediterranean 8.46% 8.37% 8.21% 8.40%
SW Asian 9.81% 13.03% 10.40% 12.91%
San 0.00% 0.00% 0.00% 0.00%
E African 0.02% 0.00% 0.01% 0.00%
Pygmy 0.00% 0.00% 0.00% 0.00%
W African 0.00% 0.00% 0.00% 0.00%

Another example is the Somali samples in Reich et al data. There is one sample (out of 6) who seems to be eastern Bantu. Let's compare the unweighted mean and weighted mean for Somalis in Reich et al and Harappa participants.

Mean Weighted Mean
Ethnicity somali somali somali somali
Dataset harappa reich harappa reich
N 2 6 2 6
S Indian 0.00% 1.62% 0.00% 1.49%
Baloch 0.00% 0.00% 0.00% 0.00%
Caucasian 2.76% 0.00% 2.76% 0.00%
NE Euro 0.00% 0.11% 0.00% 0.04%
SE Asian 0.27% 0.05% 0.27% 0.06%
Siberian 0.00% 0.04% 0.00% 0.05%
NE Asian 0.00% 0.41% 0.00% 0.46%
Papuan 0.26% 0.10% 0.26% 0.11%
American 0.14% 0.17% 0.14% 0.19%
Beringian 0.23% 0.33% 0.23% 0.38%
Mediterranean 2.12% 3.25% 2.12% 3.65%
SW Asian 31.73% 24.48% 31.73% 27.33%
San 1.96% 1.48% 1.96% 1.37%
E African 60.37% 56.75% 60.37% 60.13%
Pygmy 0.15% 1.78% 0.15% 1.23%
W African 0.00% 9.43% 0.00% 3.51%

Also, I have divided Singapore Indians into 4 groups (actually 3 groups and 1 outlier) since they are so heterogeneous. Here are the weighted mean admixture proportions for all Singapore Indians and the four subgroups.

Ethnicity singapore-indian singapore-indian-1 singapore-indian-2 singapore-indian-3 singapore-indian-4
Dataset sgvp sgvp sgvp sgvp sgvp
N 83 31 41 10 1
S Indian 53.57% 61.95% 50.39% 33.68% 27.81%
Baloch 33.97% 30.24% 36.00% 40.72% 14.27%
Caucasian 3.55% 1.92% 4.03% 9.32% 4.53%
NE Euro 2.93% 0.08% 3.89% 9.84% 35.38%
SE Asian 1.31% 1.30% 1.23% 0.63% 1.20%
Siberian 0.45% 0.47% 0.44% 0.43% 1.19%
NE Asian 0.92% 0.91% 0.80% 1.19% 3.26%
Papuan 0.72% 1.09% 0.50% 0.35% 0.62%
American 0.42% 0.35% 0.44% 0.69% 1.29%
Beringian 0.56% 0.38% 0.65% 0.76% 0.00%
Mediterranean 0.67% 0.40% 0.72% 1.33% 10.38%
SW Asian 0.90% 0.86% 0.87% 1.05% 0.06%
San 0.01% 0.00% 0.01% 0.00% 0.00%
E African 0.03% 0.02% 0.04% 0.00% 0.00%
Pygmy 0.00% 0.00% 0.00% 0.00% 0.00%
W African 0.01% 0.01% 0.00% 0.00% 0.00%

I have updated the spreadsheet as well as HarappaWorld Oracle.


  1. Fine thing (truly) but try to invent something that will give the real ancestry data (or closest to it) for individuals and groups.

    • Good idea. Why didn't I think of that.

      • Is is it just me or 3/4ths of the content of Nirjhar's comments generally don't make sense at all?

        • Well i wrote it cause i thought that is the thing that we should aim to find the truth among all kinds of indirect cra+p and i expected a noble reply thats all.

  2. Zack, Just wanted to say thanks for all the hard work and effort you put into constantly refining the tools, and adding new ones on a regular basis. On behalf of all of the silent participants, thanks and keep it up.

    • I concur. A heartfelt thank you to Zack for putting time into HAP for the past year. He's assembled a plethora of knowledge and data that serves as excellent reference on the incredibly genetic canvas of South-Asia that academia has yet to match up to.

    • It really is a blessing to have someone of your knowledge and profound interest.

      Thank you Zack for making these tools available not only to your brethren from South Asia, but to the whole world.

  3. Yes i truly acknowledge zacks efforts in truth i can say he is working his rear off to improve the tools but personally i will be happy if a true and direct data tool emerges yes you can mock me or get angry to me but the fact is without the direct truth i consider everything worthless.
    "Truth is riddance" thats what i know.

    • Nirjhar,
      Can you achieve what you seek on your own? Do you have the theory, math, tools resolved to see if this is even possible? If so why not do it? I am sure we would all surely more than welcome any possible contribution!
      There is no magic here. You can only extract the information that exists in the available data. How can one extract accurate ancestral information from completely admixed data with poor sampling? Think about it.
      Zack is doing this as a hobby, and is going above and beyond what any of us are doing...

  4. thanks zack!

    btw, is the previous commenter just retarded or something?

  5. Zack, how is this for an idea for a future post?
    - Create a non-3D PCA/MDS plot with both Harappa participants and reference samples from South Asia.
    - Different colors be assigned to different groups' dots on the MDS. So for instance, while different ethnic groups will have different colors; you could also assign different colors to Harappa, Metspalu and Xing Tamil-Brahmins; for instance.
    - In addition to that, a 3D PCA plot to infer individual participants' positions.

    All this only when you have the time, of course!

    • I am working on a fairly compute-intensive analysis right now. Once that's done, I'll likely do SPA as well as PCA.

  6. First of all i have no intention to do it simply cause that is not my field but you dont have to be a scientist or an engineer to say all of this tools and data are simply indirect ones giving indirect results or more correctly false!
    About achieving the goal myself well your question is the answer itself.
    Yes zack can do this as a hobby with a good will but its just not a nother admixture fun but about the ancestral identity of over 1 billion people so you can say this population is heavily admixed but i dont see that much as i think if i'm not wrong there are 4 main components for south asian folks 1.South asian specific.2.West asian specific3.European specific and 4. East asian specific so if we can see the calculated ages of those components then that you can say will be a relatively trustable and important data like eg. Metspalu et al. But i'm not saying that will be 100% accurate at all and about poor sampling well if some one can verify it i believe he can also correct that according to the "error".

    • Nirjhar, no one held a gun to your head to make you use this site. I would like to remind you that you paid no fees to access any of tools Zack has developed. If you think the results are false, please leave, stop commenting and let the rest of us use these tools in peace. The only benefit of having you remain in this project is that by studying your DNA, one day we perhaps may discover the gene that cause ingratitude and bad manners.

      • He's not even participating in the project. He's clearly just a little upset that some of the genetic data assembled by various genome bloggers and academia does not gel in with his preconceptions and cultural biases.

        • Sorry AV but i'm not upset and not playing with the tools, just asking a simple question how truthful they are?.
          However, the age of the main components are vital so i'm just keep pleading to calculate them, please zack do that i fully trust you but not foolishly;-).

  7. I'm a free bird Ranil and i only say what are the seeds and bases.