Search Results for: arain

HarappaWorld Oracle

Here's the HarappaWorld Oracle to go with the HarappaWorld admixture results and DIYHarappaWorld.

It works similar to the old Ref3 Harappa Oracle, with a couple of differences. One, there is no panasian switch since the Pan-Asian dataset is not included in this calculator.

I have added an optional mincount argument. It picks only those groups where the number of individuals is equal to or more than mincount for the Oracle calculation. By default mincount is 2, so only those groups which have 2 or more samples are used to compute your Oracle results.

Let's look at my top 20 Oracle results in mixed mode excluding population groups with less than 4 individuals.

HarappaOracle(c(26.46,36.82,14.22,4.78,0.00,1.32,0.86,0.04,0.19,0.06,3.63,8.07,0.00,2.44,0.43,0.67),k=20,mincount=4,mixedmode=T)

[,1] [,2]
[1,] "18.1% egyptian_behar_12 + 81.9% punjabi-arain_xing_25" "2.3361"
[2,] "18.1% egypt_henn2012_19 + 81.9% punjabi-arain_xing_25" "2.5615"
[3,] "80.7% punjabi-arain_xing_25 + 19.3% yemenese_behar_8" "2.8388"
[4,] "18.4% palestinian_hgdp_46 + 81.6% punjabi-arain_xing_25" "2.9944"
[5,] "84.7% punjabi-arain_xing_25 + 15.3% yemen-jew_behar_15" "3.0923"
[6,] "19.1% jordanian_behar_20 + 80.9% punjabi-arain_xing_25" "3.1877"
[7,] "18% egypt_henn2012_19 + 82% sindhi_hgdp_24" "3.4814"
[8,] "17.9% egyptian_behar_12 + 82.1% sindhi_hgdp_24" "3.5554"
[9,] "20.3% jordanian_behar_20 + 79.7% punjabi_harappa_7" "3.6161"
[10,] "18.9% egyptian_behar_12 + 81.1% punjabi_harappa_7" "3.6587"
[11,] "19.5% palestinian_hgdp_46 + 80.5% punjabi_harappa_7" "3.7079"
[12,] "19% egypt_henn2012_19 + 81% punjabi_harappa_7" "3.8303"
[13,] "18.3% palestinian_hgdp_46 + 81.7% sindhi_hgdp_24" "3.8762"
[14,] "80.4% punjabi-arain_xing_25 + 19.6% syrian_behar_16" "3.8908"
[15,] "19% lebanese_behar_7 + 81% punjabi-arain_xing_25" "4.0494"
[16,] "18.9% jordanian_behar_20 + 81.1% sindhi_hgdp_24" "4.078"
[17,] "79.9% punjabi_harappa_7 + 20.1% yemenese_behar_8" "4.1222"
[18,] "15.1% bedouin_hgdp_46 + 84.9% punjabi-arain_xing_25" "4.1522"
[19,] "85.3% punjabi-arain_xing_25 + 14.7% saudi_behar_20" "4.2014"
[20,] "79.1% punjabi_harappa_7 + 20.9% syrian_behar_16" "4.2191"

These results are closer to my actual reported ancestry than the ones from reference 3 oracle.

Harappa Oracle

Based on the Dodecad Oracle, here is Harappa Oracle using reference 3 admixture results.

I am using Dienekes' code with a couple of changes. One of them is using weighted distance based on Fst divergences between ancestral components. Because of that it is several times slower than DodecadOracle. I plan to offer an option soon to switch between Euclidean distance and Fst-weighted distance.

You need to install R to use it. Then unzip the Oracle zip file. Double-click on the file or use the following in R:

load('HarappaOracleR3fst.RData')

In R, you can look at the 385 populations included by typing:

X[,1]

To use it to find your closest populations, you need your Harappa Reference 3 admixture results. Use them separated by commas like this (for me):

HarappaOracle(c(44,12,0,24,14,1,2,0,0,1,2))

You will get a result, with the first column showing the closest populations and the 2nd column their distance to you.

[,1] [,2]
[1,] "balochi" "8.0242"
[2,] "bene-israel" "9.2843"
[3,] "brahui" "9.5158"
[4,] "pathan" "9.7034"
[5,] "makrani" "10.1014"
[6,] "sindhi" "10.9236"
[7,] "Bhatia" "11.8441"
[8,] "Sindhi" "12.1704"
[9,] "Kashmiri" "13.4229"
[10,] "punjabi-arain" "13.9192"

You can also find out the closest populations to one of the reference populations:

HarappaOracle("punjabi-arain")

By default, the Oracle shows the 10 closest populations. You can change that:

HarappaOracle("punjabi-arain",k=20)

Also, by default, the Oracle excludes the Pan-Asian dataset since the overlap is only 5,400 SNPs. You can include Pan-Asian populations:

HarappaOracle("punjabi-arain",panasian=T)

There is also a mixed mode where the individual (or mean reference population) is compared against all pairs of populations as ancestors.

HarappaOracle("Haryana Jatt",mixedmode=T)

which has the following output:

[1,] "Haryana Jatt" "0"
[2,] "15.4% lithuanians + 84.6% Punjabi Brahmin" "1.9553"
[3,] "10.6% russian + 89.4% Rajasthani Brahmin" "2.0626"
[4,] "14.7% finnish + 85.3% Punjabi Brahmin" "2.0863"
[5,] "9.2% finnish + 90.8% Rajasthani Brahmin" "2.1142"
[6,] "89.4% Rajasthani Brahmin + 10.6% mordovians" "2.1727"
[7,] "9.6% lithuanians + 90.4% Rajasthani Brahmin" "2.1989"
[8,] "10.1% belorussian + 89.9% Rajasthani Brahmin" "2.2938"
[9,] "16.8% russian + 83.2% Punjabi Brahmin" "2.3015"
[10,] "16.2% belorussian + 83.8% Punjabi Brahmin" "2.3656"

You can of course combine any or all of the options.

Think of Harappa Oracle as a tool to help you interpret your admixture results by comparing who you are closest to. Do not think of it as giving you your real ancestry.

Xing Ref3 K=11 Admixture

Xing et al dataset is interesting because it has a number of South Asian populations:

  • 25 Andhra Pradesh Brahmin
  • 10 Andhra Pradesh Madiga
  • 11 Andhra Pradesh Mala
  • 22 Irula
  • 25 Nepalese
  • 25 Punjabi Arain
  • 14 Tamil Nadu Brahmin
  • 12 Tamil Nadu Dalit

Unfortunately, the dataset does not have a lot of common SNPs with 23andme, FTDNA and the other data I am using.

However, I did run a reference 3 admixture on Xing data using about 30,000 SNPs. Since this is a lot less than the usual 118,000 SNPs, the noise levels are much larger.

Here is the spreadsheet with the Xing group averages for reference 3 admixture at K=11 ancestral components.

Ref 2 South Asians + Harappa PCA

I ran PCA on the South Asian populations included in Reference II dataset as well as 38 South Asian participants of Harappa Project. This is sort of a complementary analysis to the Ref1 South Asian one, as this one includes Kalash, Hazara and the additional South Asian groups in Xing et al.

The reference populations included are: Andhra Brahmin, Andhra Madiga, Andhra Mala, Balochi, Bnei Menashe Jews, Brahui, Burusho, Cochin Jews, Gujaratis, Gujaratis-B, Hazara, Irula, Kalash, Makrani, Malayan, Nepalese, North Kannadi, Paniya, Pathan, Punjabi Arain, Sakilli, Sindhi, Singapore Indians, Tamil Nadu Brahmin, and Tamil Nadu Dalit.

Here's the spreadsheet showing the eigenvalues and the first 15 principal components for each sample.

I computed the PCA using Eigensoft which removed 13 samples as outliers. The Tracy-Widom statistics show that about 25 eigenvectors are significant.

Here are the first 15 eigenvalues:

1 6.374483
2 3.650626
3 3.270121
4 2.999767
5 1.937818
6 1.713315
7 1.538295
8 1.503051
9 1.458331
10 1.448079
11 1.433288
12 1.414678
13 1.408943
14 1.390791
15 1.38101

Here is a 3-D PCA plot (hat tip: Doug McDonald) showing the first three eigenvectors. The plot is rotating about the 1st eigenvector which is vertical. Also, I have stretched the principal components based on the corresponding eigenvalues. Also, you can highlight the individual project participants in the plot by using the dropdown list below the plot.

Now here are plots of the first 14 eigenvectors. In this case, I have not stretched the principal components, so keep in mind that the first eigenvector explains 1.75 times variation compared to the 2nd eigenvector.

Reference II Admixture Analysis K=6-9

Continuing with admixture analysis of Reference II dataset, here's the spreadsheet.

Other than the differences with Reference I analysis, do take a look at the additional ethnic groups included in this dataset, especially the 8 South Asian groups: Tamil Nadu Dalit, Irula, Andhra Pradesh Madiga, Andhra Pradesh Mala, Tamil Nadu Brahmin, Andhra Pradesh Brahmin, Punjabi Arain, Nepali.

Let's start with K=6.

Reference II Admixture K=6

Note the difference between Tamil Nadu Dalits and Brahmins. The Dalits lack the European ancestral component of the Brahmins.

For K=7, the East Asian component splits into Northeast Asian and Southeast Asian.

Reference II Admixture K=7

Punjabi Arain are about the same as Sindhis (excluding the those with some African ancestry) in terms of their ancestral components.

Comparing the Andhra Brahmins to the Mala and Madiga, we see the same pattern as in Tamil Nadu: Brahmins have more European and Southwest/West Asian while Mala and Madiga have more Southeast Asian and South Asian.

At K=8, the African component splits into West African and East African.

Reference II Admixture K=8

The Nepalese samples are interesting. They have about 49% South Asian, 19% Northeast Asian, 16% European and 10% Southeast Asian. So they look like a mix of South Asian and East Asian.

Similar to the previous post, here's a comparison of K=8 admixture analysis between Reference I and Reference II datasets.

Here's the average absolute difference between the two datasets for each ancestral component:

Ancestral Component Mean(Abs(Ref1-Ref2))
South Asian (C1) 2.17%
Southwest Asian (C2) 1.32%
European (C3) 1.70%
Southeast Asian (C4) 2.16%
Papuan (C5) 0.33%
Northeast Asian (C6) 1.93%
West African (C7) 0.27%
East African (C8) 0.48%

The larger differences are for Balochi, Cambodian, Dai, Han, Kalash, Lahu, Miao, Naxi, She, Singapore Chinese, Tu, Tujia, US Chinese, and Yi, Thus, it's mostly East Asian groups.

For K=9, we see some divergence between the ancestral components inferred from Reference II as compared to Reference I. Instead of the Kalash component in Reference I analysis, we get the Polynesian component here. This is likely due to the inclusion of Tongan and Samoan samples.

Reference II Admixture K=9

Here's a summary of the ancestral components inferred from Reference II dataset:

K=2 K=3 K=4 K=5 K=6 K=7 K=8 K=9
Eurasian European S Asian S Asian S Asian S Asian S Asian S Asian
African E Asian European European European European SW Asian European
African E Asian E Asian E Asian SE Asian European SW Asian
African SW Asian SW Asian SW Asian SE Asian SE Asian
African Papuan Papuan Papuan Papuan
African NE Asian NE Asian NE Asian
African W African Polynesian
E African W African
E African

I might do some admixture runs for Reference II with Harappa participants later.

Xing et al Data

The data for Xing et al's paper "Toward a more uniform sampling of human genetic diversity: a survey of worldwide populations by high-density genotyping" is available online.

This dataset consists of 850 individuals, but 259 of them overlap with the HapMap. Another 15 samples had to be removed because they were too similar to others. I also removed Native American samples. This leaves us with 529 samples.

Ethnic group Count
Slovenian 25
Punjabi Arain 25
N. European 25
Nepalese 25
Kyrgyzstani 25
Iban 25
Buryat 25
Bambaran 25
Andhra Pradesh Brahmin 25
Kurd 24
Dogon 24
Irula 23
Thai 22
Pygmy 22
Urkarah 18
Tamil Nadu Brahmin 14
Hema 14
Tongan 13
Tamil Nadu Dalit 13
Samoan 13
!Kung 13
Japanese 13
Andhra Pradesh Mala 11
Pedi 10
Andhra Pradesh Madiga 10
Alur 10
Nguni 9
Sotho/Tswana 8
Vietnamese 7
Stalskoe 5
Chinese 5
Khmer Cambodian 3

This dataset is valuable because it contains several South Asian, Central Asian, Southeast Asian and Caucasian groups. However, it does not have a good SNP overlap with 23andme and the other datasets. It has only about 29,000 SNPs in common with 23andme v2 data. Combining HapMap, HGDP, SGVP, Behar et al and Xing et al with 23andme data leaves us with 25,000 SNPs. Due to that, I'll be using Xing et al data for only a few analyses.