HarappaWorld Tweaks

Posted by Zack on May 29, 2012

First of all, I wanted to draw your attention to the fact that I am using weighted means for population averages for HarappaWorld instead of just averaging all samples' results. The weighting gives less importance to outliers. I find this to be a better solution than a simple average or median. A median removes all outliers but it also rejects a lot of information.

An example of the weighted mean effect can be seen in the Behar et al Armenian samples. Four of the samples have higher NE European percentages than the rest. As you can see in the table below, the weighting makes their impact on the population results low.

	Mean		Weighted Mean
Ethnicity	armenian	armenian	armenian	armenian
Dataset	behar	yunusbayev	behar	yunusbayev
N	19	16	19	16
S Indian	0.37%	0.52%	0.41%	0.52%
Baloch	16.57%	17.73%	17.07%	17.65%
Caucasian	54.35%	56.43%	57.29%	56.61%
NE Euro	8.96%	2.98%	5.35%	2.95%
SE Asian	0.10%	0.12%	0.10%	0.13%
Siberian	0.49%	0.09%	0.29%	0.09%
NE Asian	0.14%	0.08%	0.16%	0.09%
Papuan	0.28%	0.27%	0.26%	0.27%
American	0.19%	0.18%	0.22%	0.18%
Beringian	0.26%	0.19%	0.23%	0.20%
Mediterranean	8.46%	8.37%	8.21%	8.40%
SW Asian	9.81%	13.03%	10.40%	12.91%
San	0.00%	0.00%	0.00%	0.00%
E African	0.02%	0.00%	0.01%	0.00%
Pygmy	0.00%	0.00%	0.00%	0.00%
W African	0.00%	0.00%	0.00%	0.00%

Another example is the Somali samples in Reich et al data. There is one sample (out of 6) who seems to be eastern Bantu. Let's compare the unweighted mean and weighted mean for Somalis in Reich et al and Harappa participants.

	Mean		Weighted Mean
Ethnicity	somali	somali	somali	somali
Dataset	harappa	reich	harappa	reich
N	2	6	2	6
S Indian	0.00%	1.62%	0.00%	1.49%
Baloch	0.00%	0.00%	0.00%	0.00%
Caucasian	2.76%	0.00%	2.76%	0.00%
NE Euro	0.00%	0.11%	0.00%	0.04%
SE Asian	0.27%	0.05%	0.27%	0.06%
Siberian	0.00%	0.04%	0.00%	0.05%
NE Asian	0.00%	0.41%	0.00%	0.46%
Papuan	0.26%	0.10%	0.26%	0.11%
American	0.14%	0.17%	0.14%	0.19%
Beringian	0.23%	0.33%	0.23%	0.38%
Mediterranean	2.12%	3.25%	2.12%	3.65%
SW Asian	31.73%	24.48%	31.73%	27.33%
San	1.96%	1.48%	1.96%	1.37%
E African	60.37%	56.75%	60.37%	60.13%
Pygmy	0.15%	1.78%	0.15%	1.23%
W African	0.00%	9.43%	0.00%	3.51%

Also, I have divided Singapore Indians into 4 groups (actually 3 groups and 1 outlier) since they are so heterogeneous. Here are the weighted mean admixture proportions for all Singapore Indians and the four subgroups.

Ethnicity	singapore-indian	singapore-indian-1	singapore-indian-2	singapore-indian-3	singapore-indian-4
Dataset	sgvp	sgvp	sgvp	sgvp	sgvp
N	83	31	41	10	1
S Indian	53.57%	61.95%	50.39%	33.68%	27.81%
Baloch	33.97%	30.24%	36.00%	40.72%	14.27%
Caucasian	3.55%	1.92%	4.03%	9.32%	4.53%
NE Euro	2.93%	0.08%	3.89%	9.84%	35.38%
SE Asian	1.31%	1.30%	1.23%	0.63%	1.20%
Siberian	0.45%	0.47%	0.44%	0.43%	1.19%
NE Asian	0.92%	0.91%	0.80%	1.19%	3.26%
Papuan	0.72%	1.09%	0.50%	0.35%	0.62%
American	0.42%	0.35%	0.44%	0.69%	1.29%
Beringian	0.56%	0.38%	0.65%	0.76%	0.00%
Mediterranean	0.67%	0.40%	0.72%	1.33%	10.38%
SW Asian	0.90%	0.86%	0.87%	1.05%	0.06%
San	0.01%	0.00%	0.01%	0.00%	0.00%
E African	0.03%	0.02%	0.04%	0.00%	0.00%
Pygmy	0.00%	0.00%	0.00%	0.00%	0.00%
W African	0.01%	0.01%	0.00%	0.00%	0.00%

I have updated the spreadsheet as well as HarappaWorld Oracle.

Admixturearmenia, harappaworld, indian, singapore, somali

← HarappaWorld on GEDmatch

HarappaWorld HRP0240-HRP0244 →

18 Comments.

Nirjhar May 29, 2012 at 11:14 pm

Fine thing (truly) but try to invent something that will give the real ancestry data (or closest to it) for individuals and groups.
- Zack May 30, 2012 at 6:41 am
  
  Good idea. Why didn't I think of that.
  - AV May 30, 2012 at 10:52 am
    
    Is is it just me or 3/4ths of the content of Nirjhar's comments generally don't make sense at all?
    - Nirjhar June 1, 2012 at 10:12 pm
      
      Well i wrote it cause i thought that is the thing that we should aim to find the truth among all kinds of indirect cra+p and i expected a noble reply thats all.
RanilB June 2, 2012 at 3:31 am

Zack, Just wanted to say thanks for all the hard work and effort you put into constantly refining the tools, and adding new ones on a regular basis. On behalf of all of the silent participants, thanks and keep it up.
- AV June 2, 2012 at 1:36 pm
  
  I concur. A heartfelt thank you to Zack for putting time into HAP for the past year. He's assembled a plethora of knowledge and data that serves as excellent reference on the incredibly genetic canvas of South-Asia that academia has yet to match up to.
- Sakiusad June 3, 2012 at 5:00 pm
  
  It really is a blessing to have someone of your knowledge and profound interest.
  
  Thank you Zack for making these tools available not only to your brethren from South Asia, but to the whole world.
Nirjhar June 2, 2012 at 10:37 pm

Yes i truly acknowledge zacks efforts in truth i can say he is working his rear off to improve the tools but personally i will be happy if a true and direct data tool emerges yes you can mock me or get angry to me but the fact is without the direct truth i consider everything worthless.
"Truth is riddance" thats what i know.
- HRP142 June 4, 2012 at 11:25 am
  
  Nirjhar,
  Can you achieve what you seek on your own? Do you have the theory, math, tools resolved to see if this is even possible? If so why not do it? I am sure we would all surely more than welcome any possible contribution!
  There is no magic here. You can only extract the information that exists in the available data. How can one extract accurate ancestral information from completely admixed data with poor sampling? Think about it.
  Zack is doing this as a hobby, and is going above and beyond what any of us are doing...
razib June 3, 2012 at 11:56 pm

thanks zack!

btw, is the previous commenter just retarded or something?
AV June 4, 2012 at 1:41 am

Zack, how is this for an idea for a future post?
- Create a non-3D PCA/MDS plot with both Harappa participants and reference samples from South Asia.
- Different colors be assigned to different groups' dots on the MDS. So for instance, while different ethnic groups will have different colors; you could also assign different colors to Harappa, Metspalu and Xing Tamil-Brahmins; for instance.
- In addition to that, a 3D PCA plot to infer individual participants' positions.

All this only when you have the time, of course!
- Zack June 21, 2012 at 3:30 pm
  
  I am working on a fairly compute-intensive analysis right now. Once that's done, I'll likely do SPA as well as PCA.
Nirjhar June 4, 2012 at 10:29 pm

First of all i have no intention to do it simply cause that is not my field but you dont have to be a scientist or an engineer to say all of this tools and data are simply indirect ones giving indirect results or more correctly false!
About achieving the goal myself well your question is the answer itself.
Yes zack can do this as a hobby with a good will but its just not a nother admixture fun but about the ancestral identity of over 1 billion people so you can say this population is heavily admixed but i dont see that much as i think if i'm not wrong there are 4 main components for south asian folks 1.South asian specific.2.West asian specific3.European specific and 4. East asian specific so if we can see the calculated ages of those components then that you can say will be a relatively trustable and important data like eg. Metspalu et al. But i'm not saying that will be 100% accurate at all and about poor sampling well if some one can verify it i believe he can also correct that according to the "error".
- RanilB June 5, 2012 at 8:10 am
  
  Nirjhar, no one held a gun to your head to make you use this site. I would like to remind you that you paid no fees to access any of tools Zack has developed. If you think the results are false, please leave, stop commenting and let the rest of us use these tools in peace. The only benefit of having you remain in this project is that by studying your DNA, one day we perhaps may discover the gene that cause ingratitude and bad manners.
  - AV June 5, 2012 at 8:34 am
    
    He's not even participating in the project. He's clearly just a little upset that some of the genetic data assembled by various genome bloggers and academia does not gel in with his preconceptions and cultural biases.
    - Nirjhar June 5, 2012 at 11:08 pm
      
      Sorry AV but i'm not upset and not playing with the tools, just asking a simple question how truthful they are?.
      However, the age of the main components are vital so i'm just keep pleading to calculate them, please zack do that i fully trust you but not foolishly;-).
Nirjhar June 5, 2012 at 8:25 am

I'm a free bird Ranil and i only say what are the seeds and bases.
- RKM July 3, 2012 at 6:10 pm
  
  Google Translator, please!

Harappa Ancestry Project

Genetics and South Asia