Category Archives: Admixture

HarappaWorld Oracle

Here's the HarappaWorld Oracle to go with the HarappaWorld admixture results and DIYHarappaWorld.

It works similar to the old Ref3 Harappa Oracle, with a couple of differences. One, there is no panasian switch since the Pan-Asian dataset is not included in this calculator.

I have added an optional mincount argument. It picks only those groups where the number of individuals is equal to or more than mincount for the Oracle calculation. By default mincount is 2, so only those groups which have 2 or more samples are used to compute your Oracle results.

Let's look at my top 20 Oracle results in mixed mode excluding population groups with less than 4 individuals.

HarappaOracle(c(26.46,36.82,14.22,4.78,0.00,1.32,0.86,0.04,0.19,0.06,3.63,8.07,0.00,2.44,0.43,0.67),k=20,mincount=4,mixedmode=T)

[,1] [,2]
[1,] "18.1% egyptian_behar_12 + 81.9% punjabi-arain_xing_25" "2.3361"
[2,] "18.1% egypt_henn2012_19 + 81.9% punjabi-arain_xing_25" "2.5615"
[3,] "80.7% punjabi-arain_xing_25 + 19.3% yemenese_behar_8" "2.8388"
[4,] "18.4% palestinian_hgdp_46 + 81.6% punjabi-arain_xing_25" "2.9944"
[5,] "84.7% punjabi-arain_xing_25 + 15.3% yemen-jew_behar_15" "3.0923"
[6,] "19.1% jordanian_behar_20 + 80.9% punjabi-arain_xing_25" "3.1877"
[7,] "18% egypt_henn2012_19 + 82% sindhi_hgdp_24" "3.4814"
[8,] "17.9% egyptian_behar_12 + 82.1% sindhi_hgdp_24" "3.5554"
[9,] "20.3% jordanian_behar_20 + 79.7% punjabi_harappa_7" "3.6161"
[10,] "18.9% egyptian_behar_12 + 81.1% punjabi_harappa_7" "3.6587"
[11,] "19.5% palestinian_hgdp_46 + 80.5% punjabi_harappa_7" "3.7079"
[12,] "19% egypt_henn2012_19 + 81% punjabi_harappa_7" "3.8303"
[13,] "18.3% palestinian_hgdp_46 + 81.7% sindhi_hgdp_24" "3.8762"
[14,] "80.4% punjabi-arain_xing_25 + 19.6% syrian_behar_16" "3.8908"
[15,] "19% lebanese_behar_7 + 81% punjabi-arain_xing_25" "4.0494"
[16,] "18.9% jordanian_behar_20 + 81.1% sindhi_hgdp_24" "4.078"
[17,] "79.9% punjabi_harappa_7 + 20.1% yemenese_behar_8" "4.1222"
[18,] "15.1% bedouin_hgdp_46 + 84.9% punjabi-arain_xing_25" "4.1522"
[19,] "85.3% punjabi-arain_xing_25 + 14.7% saudi_behar_20" "4.2014"
[20,] "79.1% punjabi_harappa_7 + 20.9% syrian_behar_16" "4.2191"

These results are closer to my actual reported ancestry than the ones from reference 3 oracle.

Related Reading:

OCA/OCP Oracle Database 11g All-in-One Exam Guide with CD-ROM: Exams 1Z0-051, 1Z0-052, 1Z0-053 (Oracle Press)
India Divided Religion 'Then' (1947) (East-West): 'Now' What Languages ( North-South ) ?....
Ancient Cities of the Indus Valley Civilization
Oracle 11g For Dummies
Expert Oracle Database Architecture: Oracle Database 9i, 10g, and 11g Programming Techniques and Solutions

DIY HarappaWorld

Based on Dienekes' instructions, I have created DIYHarappaWorld for anyone to compute their admixture results for my HarappaWorld calculator.

Here's what you need to do:

  1. Download DIYHarappaWorld files and unzip them.
  2. Download DIYDodecad v2.1 (File->Download).
  3. Unpack DIYDodecad2.1.rar by using 7-zip, WinRAR, or Linux rar/unrar command.
  4. Start R and change the working directory to where you have the DIY files.
  5. Enter the following command in R:
    source('standardize.r')
  6. If you have your 23andme raw data, run the following command in R:
    standardize('genome_john_doe.txt', company='23andMe')

    where genome_john_doe.txt is the filename for your raw data file.

  7. If you have your FTDNA Family Finder data in a file named johndoe.csv, run the following in R:
    standardize('johndoe.csv', company='ftdna')
  8. From your operating system command prompt, run the appropriate command:
    DIYDodecadWin harappaworld.par
    ./DIYDodecadLinux32 harappaworld.par
    ./DIYDodecadLinux64 harappaworld.par
  9. The program will start computing the admixture percentages. It took about 5-10 minutes on my computer.
  10. The best way to understand your results is to compare them with other populations and individuals. Do not take the component names seriously. They do not represent true ancestral populations.

You can also edit the harappaworld.par file's last line to one of genomewide/bychr/byseg/target to calculate the admixture percentages for the whole genome, by chromosome, by segment or target region respectively. Do note that the last three will have larger noise.

UPDATE: I should also point out that this DIY calculator will work better for those individuals whose genetic variation was included in computing the admixture model. Those belonging to a group not included at all in the set of samples I used might get somewhat odd results.

Related Reading:

DIY U: Edupunks, Edupreneurs, and the Coming Transformation of Higher Education
The Bust DIY Guide to Life: Making Your Way Through Every Day (Bust Magazine)
DIY Art at Home: 28 Simple Projects for Chic Decor on the Cheap
I Spy DIY Style: Find Fashion You Love and Do It Yourself
DIY Home Decorating: Weekend projects, tips and tricks that fit your budget!

HarappaWorld Admixture

Here is a new admixture calculator. This uses populations all over the world and I got the best results (i.e., lowest crossvalidation error) at K=16.

You can see the admixture results for different ethnic groups as well as results for individual (founder-only) project participants.

UPDATE: The population results have been calculated using weighted means.

The group results are also shown in the usual interactive bar chart below. You can click on the component labels to sort by that ancestral component.

Do note that the admixture components do not necessarily represent real ancestral populations. Also, the names I have chosen for the components should be thought of as mnemonics to ease discussion. I chose them based on which populations in my data these components peaked in. They do not tell anything directly about ancestral populations. The best way to look at these admixture results is by comparing individuals and populations.

I used about 188,173 SNPs for this run. The results for Henn2011 (181,223 SNPs for Hadza, Sandawe and San, 26,494 SNPs for other groups), Henn2012 (26,494 SNPs), Reich (48,967 SNPs) and Xing (18,986 SNPs) datasets reported above were however calculated using lower number of common SNPs. Hence caution should be exercised in interpreting those results.

You can also see the Fst distances between the ancestral components.

I should have HarappaWorldOracle and DIYHarappaWorld calculators out in the next few days.

Also, I am working on another calculator which will focus more closely on South Asia.

Related Reading:

The Handy Cyclopedia of Things Worth Knowing A Manual of Ready Reference
Script of Harappa & Mohenjodaro & Its Connection With Other Scripts
India Divided Religion 'Then' (1947) (East-West): 'Now' What Languages ( North-South ) ?....
The New York Times Guide to Essential Knowledge: A Desk Reference for the Curious Mind
The New York Times Guide to Essential Knowledge, Second Edition: A Desk Reference for the Curious Mind

ADMIXTURE Seed and Cross-Validation

I have been running some ADMIXTURE experiments recently. This is with my world dataset and about 180,000 SNPs.

I ran ADMIXTURE at K=15 ancestral components with different random seeds. Let's take a look at the final log likelihood and cross-validation errors I got for 11 runs.

As you can see, as the log likelihood increases, the cross-validation error decreases, though there is a fair bit of variation there.

For different runs, I got fairly different ancestral components. Remember that the only difference between the different runs was the random seed used to initialize the algorithm. Some ancestral components stayed very similar across the runs but others appeared and disappeared or switched subtly between different populations in a broad region.

The cross-validation (CV) error is important in my opinion since it gives you an idea of which run has results that generalize better. Basically, it is calculated by removing a portion of the individuals in the dataset.

At K=15, the minimum CV error I got was 0.52200 and the median was 0.52206. The maximum CV error was 0.52241, which is pretty large for this data. Let's superimpose this maximum CV value on a graph showing how CV error varies for different values of K (number of ancestral components).

The set of runs in this graph (other than the red line for the maximum CV error at K=15) used the default random seed for ADMIXTURE.

What this shows is that running ADMIXTURE only once using the default random seed (or any other seed) is fraught with problems. A better approach is to run it multiple times with different seeds so you can be sure that you have arrived at a computationally optimum solution.

Related Reading:

Time Series Analysis by State Space Methods (Oxford Statistical Science Series)
The Acid-Alkaline Diet for Optimum Health: Restore Your Health by Creating pH Balance in Your Diet
How to Write a CV with Little or No Work Experience. A guidebook for students and recent graduates.

Pan-Asian Admixture Results

I ran Reference 3 based supervised ADMIXTURE on the HUGO Pan-Asian dataset. While it used only 5,400 SNPs, it did get me curious about any relationship between Onge and Jehai and Kensiu. Unfortunately, Pan-Asian data doesn't have a good overlap even with Reich et al. So as a first exercise, I decided to run unsupervised ADMIXTURE on the Pan-Asian dataset by itself.

Here are the bar charts for the admixture results. K=12 ancestral components had the lowest cross-validation error.

You can see these results in a spreadsheet too.

Related Reading:

The Gregg Reference Manual 10e
The Handy Cyclopedia of Things Worth Knowing A Manual of Ready Reference
The United Nations in Japan's Foreign and Security Policymaking, 1945-1992: National Security, Party Politics, and International Status (Harvard East Asian Monographs)
Study Bible KJV - Scofield Reference Bible
Pan-Asian Integration: Linking East and South Asia

South Asian fineStructure Ref3 Admixture

I was wondering what the admixture patterns of the clusters fineSTRUCTURE computed were for my South Asian run. So I computed the average admixture for each cluster (total: 89) using reference 3 admixture results.

The default order of the clusters is to keep the closer clusters together.

Related Reading:

Southeast Asia: A Concise History
Lonely Planet Southeast Asia: On a Shoestring (Shoestring Travel Guide)
The Rough Guide to Southeast Asia On A Budget (Rough Guides)

Harappa Oracle

Based on the Dodecad Oracle, here is Harappa Oracle using reference 3 admixture results.

I am using Dienekes' code with a couple of changes. One of them is using weighted distance based on Fst divergences between ancestral components. Because of that it is several times slower than DodecadOracle. I plan to offer an option soon to switch between Euclidean distance and Fst-weighted distance.

You need to install R to use it. Then unzip the Oracle zip file. Double-click on the file or use the following in R:

load('HarappaOracleR3fst.RData')

In R, you can look at the 385 populations included by typing:

X[,1]

To use it to find your closest populations, you need your Harappa Reference 3 admixture results. Use them separated by commas like this (for me):

HarappaOracle(c(44,12,0,24,14,1,2,0,0,1,2))

You will get a result, with the first column showing the closest populations and the 2nd column their distance to you.

[,1] [,2]
[1,] "balochi" "8.0242"
[2,] "bene-israel" "9.2843"
[3,] "brahui" "9.5158"
[4,] "pathan" "9.7034"
[5,] "makrani" "10.1014"
[6,] "sindhi" "10.9236"
[7,] "Bhatia" "11.8441"
[8,] "Sindhi" "12.1704"
[9,] "Kashmiri" "13.4229"
[10,] "punjabi-arain" "13.9192"

You can also find out the closest populations to one of the reference populations:

HarappaOracle("punjabi-arain")

By default, the Oracle shows the 10 closest populations. You can change that:

HarappaOracle("punjabi-arain",k=20)

Also, by default, the Oracle excludes the Pan-Asian dataset since the overlap is only 5,400 SNPs. You can include Pan-Asian populations:

HarappaOracle("punjabi-arain",panasian=T)

There is also a mixed mode where the individual (or mean reference population) is compared against all pairs of populations as ancestors.

HarappaOracle("Haryana Jatt",mixedmode=T)

which has the following output:

[1,] "Haryana Jatt" "0"
[2,] "15.4% lithuanians + 84.6% Punjabi Brahmin" "1.9553"
[3,] "10.6% russian + 89.4% Rajasthani Brahmin" "2.0626"
[4,] "14.7% finnish + 85.3% Punjabi Brahmin" "2.0863"
[5,] "9.2% finnish + 90.8% Rajasthani Brahmin" "2.1142"
[6,] "89.4% Rajasthani Brahmin + 10.6% mordovians" "2.1727"
[7,] "9.6% lithuanians + 90.4% Rajasthani Brahmin" "2.1989"
[8,] "10.1% belorussian + 89.9% Rajasthani Brahmin" "2.2938"
[9,] "16.8% russian + 83.2% Punjabi Brahmin" "2.3015"
[10,] "16.2% belorussian + 83.8% Punjabi Brahmin" "2.3656"

You can of course combine any or all of the options.

Think of Harappa Oracle as a tool to help you interpret your admixture results by comparing who you are closest to. Do not think of it as giving you your real ancestry.

Related Reading:

OCA/OCP Oracle Database 11g All-in-One Exam Guide with CD-ROM: Exams 1Z0-051, 1Z0-052, 1Z0-053 (Oracle Press)
Oracle SQL By Example (4th Edition)
Oracle Database 11g Release 2 Performance Tuning Tips & Techniques (Oracle Press)
The Harappa Files
Oracle Essentials: Oracle Database 11g

Ref3 Admixture Dendrograms

I have posted the reference 3 K=11 admixture results for all populations and datasets. Here are the relevant links:

So let's try a dendrogram of all these populations' average admixture results. Instead of using regular Euclidean distance, I used some weighting based on Fst distances between admixture components, very similar to what Palisto did.

Here's a dendrogram of all datasets using complete linkage.

Since the Pan-Asian dataset had only 5,400 SNPs common with reference 3, we need to be careful interpreting the tree above. Just to make sure, here's the dendrogram excluding Pan-Asian populations.

Related Reading:

Analyzing Animal Societies: Quantitative Methods for Vertebrate Social Analysis
Legends of the middle ages, narrated with special reference to literature and art
The New York Times Guide to Essential Knowledge, Second Edition: A Desk Reference for the Curious Mind
Detecting biophysical properties of a semi-arid grassland and distinguishing burned from unburned areas with hyperspectral reflectance [An article from: Journal of Arid Environments]

Henn Ref3 K=11 Admixture

There have been two Henn et al papers since I started this project.

  1. Hunter-gatherer genomic diversity suggests a southern African origin for modern humans by Brenna M. Henn, Christopher R. Gignoux, Matthew Jobin, Julie M. Granka, J. M. Macpherson, Jeffrey M. Kidd, Laura Rodríguez-Botigué, Sohini Ramachandran, Lawrence Hon, Abra Brisbin, Alice A. Lin, Peter A. Underhill, David Comas, Kenneth K. Kidd, Paul J. Norman, Peter Parham, Carlos D. Bustamante, Joanna L. Mountain, and Marcus W. Feldman
  2. Genomic Ancestry of North Africans Supports Back-to-Africa Migrations by Brenna M. Henn, Laura R. Botigué, Simon Gravel, Wei Wang, Abra Brisbin, Jake K. Byrnes, Karima Fadhlaoui-Zid, Pierre A. Zalloua, Andres Moreno-Estrada, Jaume Bertranpetit, Carlos D. Bustamante, David Comas

The data for both is available online:

I ran reference 3 K=11 admixture on these datasets using about 48,000 SNPs.

Here is the spreadsheet with the Henn group averages for reference 3 admixture at K=11 ancestral components.

Note that the Sandawe, Hadza and San from Henn2011 were already included in Reference 3 and are not listed here.

Related Reading:

Wildflowers of Ohio, Second Edition
Pocket Ref 4th Edition
The Handy Cyclopedia of Things Worth Knowing A Manual of Ready Reference
A Review of the Evaluation and Management of Cartilage Defects in the Knee (DOI: 10.3810/psm.2011.02.1867) (The Physician and Sportsmedicine)
Diet Pills (Drugs: The Straight Facts)

Pan-Asian Ref3 K=11 Admixture

The HUGO Pan-Asian dataset covers South and East Asia with the following South Asian populations:

  • 23 Andhra Pradesh & Karnataka
  • 10 Bengali
  • 23 Bhil (Rajasthan)
  • 20 Haryana
  • 23 Kashmir Spiti
  • 12 Marathi
  • 12 Rajasthani
  • 30 Singapore Indian
  • 20 Uttaranchal
  • 13 Uttar Pradesh

Unfortunately, they do not specify ethnic or caste background for most Indian groups. Instead, their focus is on Mongoloid/Caucasoid/Australoid etc.

Also, the SNP overlap with other datasets is really small. Therefore, this reference 3 admixture run was done using only 5,400 SNPs. I recommend a big bucket of salt when interpreting these results.

Here is the spreadsheet with the Pan-Asian group averages for reference 3 admixture at K=11 ancestral components.

Related Reading:

The New York Times Guide to Essential Knowledge: A Desk Reference for the Curious Mind
Transnational Asian Identities in Pan-Pacific Cinemas: The Reel Asian Exchange (Routledge Advances in Film Studies)
Pan-Asian Express: Quick Fixes for Asian-Food Fans
Legends of the middle ages, narrated with special reference to literature and art
Paris Pan Takes the Dare