Monthly Archives: June 2012

ANI-ASI Admixture Dating

Similar to an earlier conference poster, Reich Lab's Priya Moorjani et al have another poster at SMBE. Here's the abstract:

Estimating a date of mixture of ancestral South Asian populations
Linguistic and genetic studies have demonstrated that almost all groups in South Asia today descend from a mixture of two highly divergent populations: Ancestral North Indians (ANI) related to Central Asians, Middle Easterners and Europeans, and Ancestral South Indians (ASI) not related to any populations outside the Indian subcontinent. ANI and ASI have been estimated to have diverged from a common ancestor as much as 60,000 years ago, but the date of the ANI-ASI mixture is unknown. Here we analyze data from about 60 South Asian groups to estimate that major ANI-ASI mixture occurred 1,200-4,000 years ago. Some mixture may also be older--beyond the time we can query using admixture linkage disequilibrium--since it is universal throughout the subcontinent: present in every group speaking Indo-European or Dravidian languages, in all caste levels, and in primitive tribes. After the ANI-ASI mixture that occurred within the last four thousand years, a cultural shift led to widespread endogamy, decreasing the rate of additional mixture.

I bolded the portion which seems new compared to the previous abstract.

FTDNA FF to PED Conversion

Someone asked about how to convert a FTDNA Family Finder csv data file to the Plink format. I threw together a very simple Unix script to do that and I am sharing it here:

#!/bin/bash
if test -z "$1"
then
        echo "FTDNA raw data filename not supplied as argument."
        exit 0
fi
echo "Family ID: "
read fid
echo "Individual ID: "
read id
echo "Paternal ID: "
read pid
echo "Maternal ID: "
read mid
echo "Sex (m/f/u): "
read sexchr
if [[ $sexchr == m* ]]
then
        sex=1
elif [[ $sexchr == f* ]]
then
        sex=2
else
        sex=0
fi
pheno=0
 
echo "$fid $id $pid $mid $sex $pheno" > $id.tfam
 
dos2unix $1
sed '1d' $1 > $id.nocomment
awk -F, '{gsub(/"/,""); print $2,$1,"0",$3,substr($4,1,1),substr($4,2,1)}' $id.nocomment > $id.tped
rm $id.nocomment
 
plink --tfile $id --out $id --make-bed --missing-genotype - --output-missing-genotype 0

This script creates three files: *.bed, *.bim and *.fam, which are the binary format files for Plink. You can then use Plink to merge multiple files, filter SNPs or individuals and do other processing.

HarappaWorld Ancestral South Indian

Using the same method as I used for reference 3 admixture, I decided to guesstimate the Ancestral South Indian proportions, as given by Reich et al, for my HarappaWorld admixture run.

Basically, I used the 92 (out of the 96 samples Reich et al used) to find population averages for the South Indian component. Then, I used linear regression between the South Indian component average and Reich et al's estimate of Ancestral South Indian (ASI) ancestry. Since Reich et al actually list Ancestral North Indian percentages in their paper but their model is a two-ancestry ANI+ASI one, I simply calculated the ASI percentages as 100% minus ANI.

The correlation between Reich et al ASI and my HarappaWorld South Indian component for the relevant populations turns out to be 0.99277086.

And the linear regression fit for the data is:

ASI = 2.5218942 + 0.8104836 * S_INDIAN

where both ASI (Reich et al) and S_INDIAN (HarappaWorld) are given in percentages.

Of the individuals in HarappaWorld, I kept only those who had a South Indian component of at least 20% for computing the ASI proportions.

The resulting ASI percentages can be seen in a spreadsheet.

Please note that in the Group sheet, the averages are based on the samples which met the 20% South Indian component threshold. Thus, the 20% ASI in the Romanians is the average of the two Romanians who met the threshold out of a total of 16 Romanian samples.

The individual results are available in the Individual sheet. These results are a little different from the estimates using reference 3. Thus, I would point out that these should be taken only as a rough estimate.

HarappaOracle Limitations

While HarappaOracle is a great tool, it has its limitations.

First of all, do not think of the mixed mode results as showing which populations you are descended from. Use HarappaOracle to get an idea of which populations are similar to you in their admixture results. This function is especially important since admixture results should be understood in relative terms, as I have been stressing.

Sometimes, for mixed-race people, the Oracle might sometimes provide a correct result like it does for me. For others, the known ancestral mix might not show up.

There is also the fact that the Oracle calculator is sensitive to your admixture percentages and sometimes small changes can change the Oracle mixed mode results radically.

Let's look at three siblings as an example. (My thanks to them for letting me use their results for this post.) Here are their admixture results:

Sibling 1 Sibling 2 Sibling 3
NE Euro 43.8% 43.1% 43.9%
Mediterranean 27.6% 26.8% 27.0%
Baloch 11.2% 11.1% 12.2%
Caucasian 9.4% 10.8% 8.6%
S Indian 5.3% 6.5% 7.0%
SW Asian 1.5% 1.0% 0.5%
American 0.7% 0.3% 0.7%
NE Asian 0.4% 0.1% 0.1%
Beringian 0.0% 0.3% 0.0%
San 0.0% 0.1% 0.0%

Their admixture results are broadly similar, as expected. Some of you might think that 1% less or difference is very significant, but do consider what we know of DNA inheritance and the error margins in ADMIXTURE.

Now let's see their HarappaWorld Oracle results.

Sibling 1 Sibling 2 Sibling 3
romany 3.16 romany 2.85 romany 4.45
hungarian 9.47 hungarian 9.65 utahn-white 10.74
utahn-white 9.88 french 11.05 n-european 10.79
n-european 9.9 slovenian 11.09 hungarian 10.97
french 9.95 n-european 11.38 utahn-white 11.35
utahn-white 10.62 utahn-white 11.54 french 11.54
slovenian 10.95 utahn-white 12.21 british 11.94
british 11.29 british 12.99 slovenian 12.21
orcadian 13.17 orcadian 14.86 orcadian 13.51
ukranian 17.73 romanian 17.14 ukranian 18.24

Again, not unexpected. The top 10 population matches are not too different for the siblings. There are some differences, but nothing extraordinary.

Finally let's look at mixed mode Oracle, where we try to find the 10 closest matches (based on admixture results) assuming that these individuals are mixed from two populations.

Sibling 1 Sibling 2 Sibling 3
91.3% romany + 8.7% lithuanian 1.58 93.3% romany + 6.7% lithuanian 1.93 82.8% utahn-white + 17.2% bene-israel 1.99
78.4% romany + 21.6% n-european 1.67 95.4% romany + 4.6% finnish 1.99 83.7% n-european + 16.3% bene-israel 2.37
79.7% romany + 20.3% utahn-white 1.69 91.9% romany + 8.1% belorussian 2.01 83.8% utahn-white + 16.2% bene-israel 2.47
83.3% romany + 16.7% orcadian 1.76 92.4% romany + 7.6% russian 2.02 84.8% n-european + 15.2% cochin-jew 2.52
94.2% romany + 5.8% finnish 1.88 92.0% romany + 8.0% mordovian 2.06 86.0% n-european + 14.0% kerala-christian 2.93
79.8% romany + 20.2% utahn-white 1.99 90.4% romany + 9.6% ukranian 2.14 85.0% utahn-white + 15.0% cochin-jew 2.98
90.2% romany + 9.8% belorussian 2.00 88.7% romany + 11.3% slovenian 2.5 85.9% n-european + 14.1% ap-hyderabad 2.99
82.1% romany + 17.9% british 2.03 89.2% romany + 10.8% n-european 2.52 85.1% n-european + 14.9% up 3.02
91.4% romany + 8.6% russian 2.06 95.7% romany + 4.3% chuvash 2.58 85.5% n-european + 14.5% tn-brahmin 3.13
85.0% n-european + 15.0% bene-israel 2.16 90.7% romany + 9.3% utahn-white 2.58 85.5% n-european + 14.5% brahmin-tamil-nadu 3.15

Sibling 1 and Sibling 2 are again not too different from each other: Mostly Romany with some European. However, Sibling 3 is getting vastly different results. Why? No, Sibling 3 wasn't adopted! The reason is simple. Sibling 3 has more South Indian component than the average Romany in our dataset. This means that (s)he cannot be represented as a mix of Romany and a European ethnicity without a large error. Instead mostly northwest European and a little bit of Indian, especially Indian Jewish, seem to be closest to her results. However, this does not make her Jewish or Indian Jewish (who are quite mixed with the local Indian populations).

HarappaWorld HRP0240-HRP0244

From now on, instead of waiting till I have a batch of 10 new participants to compute their Admixture results, I'll run admixture at the start of the month for those who submitted their data during the previous month.

So I have added the HarappaWorld Admixture results for HRP0240-HRP0244 to the individual spreadsheet.

I have also recomputed the weighted averages for Bengalis (from 3 to 5 now), Kerala Muslims (from 1 to 2), and Georgians (from 3 to 4) while adding a new one for our first North Ossetian participant.

Do note that the admixture components do not necessarily represent real ancestral populations. Also, the names I have chosen for the components should be thought of as mnemonics to ease discussion. I chose them based on which populations in my data these components peaked in. They do not tell anything directly about ancestral populations. The best way to look at these admixture results is by comparing individuals and populations. Finally, the standard error estimates on these results can be about 1%. Therefore, it is entirely possible that your 1% exotic admixture result is just noise.