Eurasian ChromoPainter Analysis

Some months ago, I decided to run a big ChromoPainter analysis of the Eurasian samples I have. I removed from my dataset not only all Sub-Saharan Africans, but also North Africans and anyone else with more than 2% African admixture (which unfortunately included me).

Since the number of samples was still too large, I picked 25 random individuals from each non-South-Asian ethnicity while keeping all South Asians. I also tried to remove all close relatives and those with a high missing genotyping rate.

In the end, I had 254,576 SNPs for 2,001 samples belonging to 197 ethnic groups.

I ran ShapeIT to phase their genomes and then ChromoPainter and fineStructure. The whole process took about 2 months.

Then I got busy and the results sat on my computer for more than a month.

Now let's look at the ChromoPainter/fineStructure analysis. Due to my time constraints, I am going to present them in several posts.

Today, let's look at the fineStructure clustering run on the chunkcount output of ChromoPainter. It divided the individuals into 203 populations. Here's the spreadsheet containing the group and individual population clustering.

And here is the dendrogram showing the relationship of the clusters/populations computed by fineStructure.

UPDATE: Better dendrograms

Related Reading:

23andme $50 Off

23andme has a $50 off coupon sale for three days. Here's the email I got from them:

Visiting family this summer? Are they part of 23andMe? Take advantage of our summer discount: $50 OFF each kit you purchase. This offer expires in 3 days (11:59PM PDT, Sunday August 12, 2012).

To use this code, visit our online store and add an order to your cart. Click "I have a discount code" and enter the code below.

$50 off Discount code: VMQ6KG

Related Reading:

HarappaWorld HRP0250-HRP0252

I have added the HarappaWorld Admixture results for HRP0250-HRP0252 to the individual spreadsheet.

However, I have not recomputed the weighted averages for the Kashmiris or Bengali Brahmins. Also, I am not sure about Tamil Gounder. Wikipedia says they are Vellalars, but I don't know if I should report separate Gounder results or include in the Tamil Vellalar average.

Do note that the admixture components do not necessarily represent real ancestral populations. Also, the names I have chosen for the components should be thought of as mnemonics to ease discussion. I chose them based on which populations in my data these components peaked in. They do not tell anything directly about ancestral populations. The best way to look at these admixture results is by comparing individuals and populations. Finally, the standard error estimates on these results can be about 1%. Therefore, it is entirely possible that your 1% exotic admixture result is just noise.

Related Reading:

FTDNA Summer Sale

FTDNA is having a sale on its DNA tests till July 15, 2012.

Their autosomal test, Family Finder, which can be submitted to Harappa Ancestry Project, is on sale for $199 instead of a regular price of $289.

In addition, their mtDNA and Y-DNA products are also discounted till end of day July 15.

Related Reading:

HarappaWorld HRP0245-HRP0249

I have added the HarappaWorld Admixture results for HRP0245-HRP0249 to the individual spreadsheet.

I have also recomputed the weighted averages for Kurds (from 6 to 10 now).

Do note that the admixture components do not necessarily represent real ancestral populations. Also, the names I have chosen for the components should be thought of as mnemonics to ease discussion. I chose them based on which populations in my data these components peaked in. They do not tell anything directly about ancestral populations. The best way to look at these admixture results is by comparing individuals and populations. Finally, the standard error estimates on these results can be about 1%. Therefore, it is entirely possible that your 1% exotic admixture result is just noise.

Let's look at the Kurdish results from Yunusbayev (prefix: kurd), Xing (prefix: F) and Harappa (prefix: HRP). Do note that the Xing results were computed with a smaller number of SNPs and thus might be noisy.

Related Reading:

Pagani East African Dataset

Pagani et al analyzed Ethiopian genetics in their paper "Ethiopian Genetic Diversity Reveals Linguistic Stratification and Complex Influences on the Ethiopian Gene Pool". Their dataset consisting of Ethiopians and a few other East African populations is available online.

I have analyzed the Pagani dataset with my HarappaWorld admixture calculator and included the results in my regular spreadsheet.

The group (weighted mean) results are also shown in the usual interactive bar chart below. You can click on the component labels to sort by that ancestral component.

Because the East African component as computed in HarappaWorld is maximum among the Maasai and several of the Pagani dataset populations have a higher percentage of that component, we should be a bit careful with interpreting the HarappaWorld results for the Pagani groups. I'll likely include them in my next iteration of the admixture calculator.

Related Reading:

ANI-ASI Admixture Dating

Similar to an earlier conference poster, Reich Lab's Priya Moorjani et al have another poster at SMBE. Here's the abstract:

Estimating a date of mixture of ancestral South Asian populations
Linguistic and genetic studies have demonstrated that almost all groups in South Asia today descend from a mixture of two highly divergent populations: Ancestral North Indians (ANI) related to Central Asians, Middle Easterners and Europeans, and Ancestral South Indians (ASI) not related to any populations outside the Indian subcontinent. ANI and ASI have been estimated to have diverged from a common ancestor as much as 60,000 years ago, but the date of the ANI-ASI mixture is unknown. Here we analyze data from about 60 South Asian groups to estimate that major ANI-ASI mixture occurred 1,200-4,000 years ago. Some mixture may also be older—beyond the time we can query using admixture linkage disequilibrium—since it is universal throughout the subcontinent: present in every group speaking Indo-European or Dravidian languages, in all caste levels, and in primitive tribes. After the ANI-ASI mixture that occurred within the last four thousand years, a cultural shift led to widespread endogamy, decreasing the rate of additional mixture.

I bolded the portion which seems new compared to the previous abstract.

Related Reading:

FTDNA FF to PED Conversion

Someone asked about how to convert a FTDNA Family Finder csv data file to the Plink format. I threw together a very simple Unix script to do that and I am sharing it here:

if test -z "$1"
        echo "FTDNA raw data filename not supplied as argument."
        exit 0
echo "Family ID: "
read fid
echo "Individual ID: "
read id
echo "Paternal ID: "
read pid
echo "Maternal ID: "
read mid
echo "Sex (m/f/u): "
read sexchr
if [[ $sexchr == m* ]]
elif [[ $sexchr == f* ]]
echo "$fid $id $pid $mid $sex $pheno" > $id.tfam
dos2unix $1
sed '1d' $1 > $id.nocomment
awk -F, '{gsub(/"/,""); print $2,$1,"0",$3,substr($4,1,1),substr($4,2,1)}' $id.nocomment > $id.tped
rm $id.nocomment
plink --tfile $id --out $id --make-bed --missing-genotype - --output-missing-genotype 0

This script creates three files: *.bed, *.bim and *.fam, which are the binary format files for Plink. You can then use Plink to merge multiple files, filter SNPs or individuals and do other processing.

Related Reading:

HarappaWorld Ancestral South Indian

Using the same method as I used for reference 3 admixture, I decided to guesstimate the Ancestral South Indian proportions, as given by Reich et al, for my HarappaWorld admixture run.

Basically, I used the 92 (out of the 96 samples Reich et al used) to find population averages for the South Indian component. Then, I used linear regression between the South Indian component average and Reich et al's estimate of Ancestral South Indian (ASI) ancestry. Since Reich et al actually list Ancestral North Indian percentages in their paper but their model is a two-ancestry ANI+ASI one, I simply calculated the ASI percentages as 100% minus ANI.

The correlation between Reich et al ASI and my HarappaWorld South Indian component for the relevant populations turns out to be 0.99277086.

And the linear regression fit for the data is:

ASI = 2.5218942 + 0.8104836 * S_INDIAN

where both ASI (Reich et al) and S_INDIAN (HarappaWorld) are given in percentages.

Of the individuals in HarappaWorld, I kept only those who had a South Indian component of at least 20% for computing the ASI proportions.

The resulting ASI percentages can be seen in a spreadsheet.

Please note that in the Group sheet, the averages are based on the samples which met the 20% South Indian component threshold. Thus, the 20% ASI in the Romanians is the average of the two Romanians who met the threshold out of a total of 16 Romanian samples.

The individual results are available in the Individual sheet. These results are a little different from the estimates using reference 3. Thus, I would point out that these should be taken only as a rough estimate.

Related Reading:

HarappaOracle Limitations

While HarappaOracle is a great tool, it has its limitations.

First of all, do not think of the mixed mode results as showing which populations you are descended from. Use HarappaOracle to get an idea of which populations are similar to you in their admixture results. This function is especially important since admixture results should be understood in relative terms, as I have been stressing.

Sometimes, for mixed-race people, the Oracle might sometimes provide a correct result like it does for me. For others, the known ancestral mix might not show up.

There is also the fact that the Oracle calculator is sensitive to your admixture percentages and sometimes small changes can change the Oracle mixed mode results radically.

Let's look at three siblings as an example. (My thanks to them for letting me use their results for this post.) Here are their admixture results:

Sibling 1 Sibling 2 Sibling 3
NE Euro 43.8% 43.1% 43.9%
Mediterranean 27.6% 26.8% 27.0%
Baloch 11.2% 11.1% 12.2%
Caucasian 9.4% 10.8% 8.6%
S Indian 5.3% 6.5% 7.0%
SW Asian 1.5% 1.0% 0.5%
American 0.7% 0.3% 0.7%
NE Asian 0.4% 0.1% 0.1%
Beringian 0.0% 0.3% 0.0%
San 0.0% 0.1% 0.0%

Their admixture results are broadly similar, as expected. Some of you might think that 1% less or difference is very significant, but do consider what we know of DNA inheritance and the error margins in ADMIXTURE.

Now let's see their HarappaWorld Oracle results.

Sibling 1 Sibling 2 Sibling 3
romany 3.16 romany 2.85 romany 4.45
hungarian 9.47 hungarian 9.65 utahn-white 10.74
utahn-white 9.88 french 11.05 n-european 10.79
n-european 9.9 slovenian 11.09 hungarian 10.97
french 9.95 n-european 11.38 utahn-white 11.35
utahn-white 10.62 utahn-white 11.54 french 11.54
slovenian 10.95 utahn-white 12.21 british 11.94
british 11.29 british 12.99 slovenian 12.21
orcadian 13.17 orcadian 14.86 orcadian 13.51
ukranian 17.73 romanian 17.14 ukranian 18.24

Again, not unexpected. The top 10 population matches are not too different for the siblings. There are some differences, but nothing extraordinary.

Finally let's look at mixed mode Oracle, where we try to find the 10 closest matches (based on admixture results) assuming that these individuals are mixed from two populations.

Sibling 1 Sibling 2 Sibling 3
91.3% romany + 8.7% lithuanian 1.58 93.3% romany + 6.7% lithuanian 1.93 82.8% utahn-white + 17.2% bene-israel 1.99
78.4% romany + 21.6% n-european 1.67 95.4% romany + 4.6% finnish 1.99 83.7% n-european + 16.3% bene-israel 2.37
79.7% romany + 20.3% utahn-white 1.69 91.9% romany + 8.1% belorussian 2.01 83.8% utahn-white + 16.2% bene-israel 2.47
83.3% romany + 16.7% orcadian 1.76 92.4% romany + 7.6% russian 2.02 84.8% n-european + 15.2% cochin-jew 2.52
94.2% romany + 5.8% finnish 1.88 92.0% romany + 8.0% mordovian 2.06 86.0% n-european + 14.0% kerala-christian 2.93
79.8% romany + 20.2% utahn-white 1.99 90.4% romany + 9.6% ukranian 2.14 85.0% utahn-white + 15.0% cochin-jew 2.98
90.2% romany + 9.8% belorussian 2.00 88.7% romany + 11.3% slovenian 2.5 85.9% n-european + 14.1% ap-hyderabad 2.99
82.1% romany + 17.9% british 2.03 89.2% romany + 10.8% n-european 2.52 85.1% n-european + 14.9% up 3.02
91.4% romany + 8.6% russian 2.06 95.7% romany + 4.3% chuvash 2.58 85.5% n-european + 14.5% tn-brahmin 3.13
85.0% n-european + 15.0% bene-israel 2.16 90.7% romany + 9.3% utahn-white 2.58 85.5% n-european + 14.5% brahmin-tamil-nadu 3.15

Sibling 1 and Sibling 2 are again not too different from each other: Mostly Romany with some European. However, Sibling 3 is getting vastly different results. Why? No, Sibling 3 wasn't adopted! The reason is simple. Sibling 3 has more South Indian component than the average Romany in our dataset. This means that (s)he cannot be represented as a mix of Romany and a European ethnicity without a large error. Instead mostly northwest European and a little bit of Indian, especially Indian Jewish, seem to be closest to her results. However, this does not make her Jewish or Indian Jewish (who are quite mixed with the local Indian populations).

Related Reading: