Monthly Archives: February 2011 - Page 3

Reference II Admixture Analysis K=2-5

Our Reference II Dataset has 3,161 samples with 544 South Asians belonging to 24 ethnic groups. Unfortunately, we can do our admixture analysis on about 23,000 SNPs.

The ancestral population averages for each ethnic group from the admixture analysis can be seen in this spreadsheet. I have also calculated the standard deviation of the ancestral components for the samples in each ethnic group.

Here are the results for K=2.

Reference II Admixture K=2

For K=3, we get the ancestral populations: European, E Asian, African.

Reference II Admixture K=3

For K=4, the ancestral populations are South Asian, European, East Asian and African.

Reference II Admixture K=4

Let's compare the results of K=4 admixture analysis of Reference I and Reference II datasets.

While there is some difference in the average percentages of ancestral components computed with the two reference datasets, most of the differences are 1% or less. The mean absolute difference for the four components is as follows:

Ancestral Component Mean(Abs(Ref1-Ref2))
South Asian (C1) 0.92%
European (C2) 0.58%
East Asian (C3) 0.52%
African (C4) 0.32%

I have highlighted the larger differences which affect: Balochi, Kalash, Malayan, Melanesian, Papuan, and Samaritians. Even then the largest change is about 5%.

Let's also look at the Fst divergences. Here's for Reference I admixture results:

C1 C2 C3
C2 0.071
C3 0.083 0.109
C4 0.152 0.152 0.184

And for Reference II:

C1 C2 C3
C2 0.074
C3 0.086 0.118
C4 0.156 0.159 0.194

The Fst numbers for Reference II are somewhat higher.

Considering that Reference II has only one-eighth of the SNPs of Reference I, the results are fairly good.

Here's K=5 admixture analysis for Reference II:

Reference II Admixture K=5

Higher K values to follow.

Admixture K=4,7,9, HRP0011 to HRP0020

We'll go to higher values of K (number of ancestral populations) for batch 1 later, but let's not keep the other batches waiting.

Here's the spreadsheet with their admixture results. And you can check their ethnic backgrounds.

You might also want to refer to the reference dataset I admixture analyses for K=2-5 and K=6-9.

I did not run admixture for all values of K this time. So let's start with K=4. For quick reference,

C1 South Asian
C2 European
C3 East Asian
C4 African

Batch 2 Admixture K=4

Now, for K=7, the ancestral components are:

C1 South Asian
C2 European
C3 Southeast Asian
C4 Southwest Asian
C5 Papuan
C6 Northeast Asian
C7 African

Batch 2 Admixture K=7

And finally, here's K=9.

C1 South Asian
C2 Kalash
C3 Southwest Asian
C4 Southeast Asian
C5 European
C6 Papuan
C7 Northeast Asian
C8 West African
C9 East African

Batch 2 Admixture K=9

What do you guys think?

Higher values of K will be coming when admixture is done taking it sweet time to run. But more analysis and results are coming fast and furious now.

Reference Admixture Analysis K=6-9

Continuing the admixture analysis on my reference dataset I, let's look at K=6 ancestral components.

As before, all the results are listed in a spreadsheet.

For K=6, we get the following plot:

Admixture: Reference populations K=6

C1 (red) is the South Asian ancestral component. However, the Austronesian (Papuan/Melanesian) component has now separated from it as C5 (blue). You can see small proportions of the Papuan component among South Indian and Southeast Asian (Malay and Cambodian) populations.

C3 (green) is exactly the same as C3 in K=5 run and represents East Asian populations. C6 (magenta) is exactly the same component as C5 in the K=5 run and represents African ancestry. C4 (cyan) is the same as C4 in the K=5 run and represents Southwest/West Asia.

C2 (yellow) is the European component (maximum among North Europeans) almost the same as C2 in K=5 analysis. The major difference is that C2 (in K=6) is reduced among South Asians as compared to K=5. This is due to the South Asian component being higher for them.

Fst divergences between estimated populations for K=6:

C1 C2 C3 C4 C5
C2 0.053
C3 0.084 0.114
C4 0.068 0.052 0.130
C5 0.178 0.205 0.184 0.218
C6 0.148 0.165 0.186 0.157 0.260

When we increase the ancestral components to K=7,

Admixture: Reference populations K=7

The South Asian component (C1/red) is the same. Note that there is a significant drop from about 51% to 29% from Makranis to Iranians (ignore the Paniya as there are only 4 samples with one being very different). Looking at the 19 individual Iranian samples from our reference dataset, their South Asian ancestral component values vary from 17% to 33%.

The Southwest/West Asian component (C2/yellow) is now higher among West Asians and lower among East Africans compared to K=6 run. C3/green is the European component which now almost disappears from the Southwest Asian populations.

The East Asian component (C4/bluish green) is the same as before as is the Papuan C5/light blue component.

The African ancestry breaks into West African (C6/blue) and East African (C7/magenta).

Note that the split here is different from the batch 1 run where the East Asian split into two for K=7 and the African split happened at K=8, the opposite of what happened here.

Fst divergences between estimated populations for K=7:

C1 C2 C3 C4 C5 C6
C2 0.058
C3 0.052 0.034
C4 0.082 0.122 0.113
C5 0.176 0.210 0.204 0.184
C6 0.152 0.159 0.167 0.190 0.264
C7 0.112 0.113 0.122 0.153 0.229 0.037

At K=8, the East Asian components forks into two: A Southeast Asian (C4/bright green) one that is highest among the Dai, Malay, Cambodians and Lahu; and a Northeast Asian one (C6/blue) that is maximum among the Yakut, Oroqen, Japanese, Hezhen and Daur.

Admixture: Reference populations K=8

Among most South Asian groups in our reference dataset, the Southeast Asian component is much more common than the Northeast Asian one.

Fst divergences between estimated populations for K=8:

C1 C2 C3 C4 C5 C6 C7
C2 0.058
C3 0.052 0.034
C4 0.096 0.133 0.125
C5 0.177 0.211 0.205 0.201
C6 0.093 0.131 0.122 0.046 0.195
C7 0.152 0.159 0.167 0.200 0.266 0.201
C8 0.113 0.113 0.113 0.163 0.231 0.163 0.037

Here's the plot for K=9 ancestral components:

Reference Populations Admixture K=9

The new component here is the Kalash component which is at 94% among the Kalash but is in the 30-40% range for Caucasian and Pakistani populations. It is also present among West Asians, Europeans and Central Asians to a small degree.

Looking at the Kalash samples, they seem fairly uniform and mostly with only a little of the other ancestral components except for one sample which has 70% Kalash component.

Kalash Admixture K=9

Fst divergences between estimated populations for K=9:

C1 C2 C3 C4 C5 C6 C7 C8
C2 0.056
C3 0.064 0.072
C4 0.088 0.126 0.136
C5 0.064 0.061 0.039 0.131
C6 0.167 0.208 0.214 0.202 0.211
C7 0.084 0.124 0.133 0.045 0.127 0.195
C8 0.152 0.173 0.161 0.201 0.172 0.266 0.200
C9 0.115 0.133 0.117 0.165 0.129 0.233 0.164 0.036

From the Fst values, we can see that the Kalash component is closest to the South Asian component and then to the European component.

To summarize, here are the ancestral components inferred for different values of K.

K=2 K=3 K=4 K=5 K=6 K=7 K=8 K=9
Eurasian European S Asian S Asian S Asian S Asian S Asian S Asian
African E Asian European European European SW Asian SW Asian Kalash
African E Asian E Asian E Asian European European SW Asian
African SW Asian SW Asian E Asian SE Asian SE Asian
African Papuan Papuan Papuan European
African W African NE Asian Papuan
E African W African NE Asian
E African W African
E African

Note that for a specific value of K, they are listed in approximately decreasing average percentage among the South Asian samples in our reference dataset I.

Admixture K=6-9, HRP0001 to HRP0010

Let's continue our admixture analysis of the first 10 Harappa Project participants.

Here are their ethnic backgrounds and their admixture analysis results.

You might want to refer to the admixture analysis of the reference dataset.

Let's look at K=6 ancestral components. As seen in the reference admixture results, we got a Papuan ancestral component (C5/blue).

Batch 1 Admixture K=6

You can see the increase in C1/red South Asian component in all the participants. The Papuan component (C5/blue) is present is all except our Assyrian sample. It is lower among the Punjabis though.

The East Asian (C3/green) is about the same as in K=5 analysis. C6/magenta, the African component, is only present in HRP0001 (me) at the same proportion as K=5. The Southwest/West Asian component (C4/cyan) is the same as C4 in K=5 with no changes.

The European component (C2/yellow) reduced in magnitude among the South Asian participants by about 14-19%. My guess about that is that the South Asian component became more "pure" for K=6 due to the separate Papuan component which was merged in the South Asian one in K=5. So it better represents the South Asians now compared to K=5, thus reducing the European proportion.

Batch 1 Admixture K=7

For K=7, C1 is South Asian, C2 European, C4 Southwest/West Asian, C5 Papuan and C7 African. These are all same as before.

The East Asian component has split into two: C3 Southeast Asian and C6 Northeast Asian. For this batch of Harappa participants, most of their East Asian ancestry falls into the Southeast Asian component.

Batch 1 Admixture K=8

For K=8, C1 is South Asian, C4 is Southeast Asian, C5 is Papuan, C6 is Northeast Asian and these have stayed about the same.

C2 (Southwest/West Asian) component has increased for most Harappa members, especially for HRP0010 (Assyrian Iranian). This change in West Asian component is balanced a bit by a decrease in C3 (European) component but the main reason for the West Asian change is that East African component has split from the Southwest/West Asian and the African components.

The African component has split into C7 West African and C8 East African. As usual, HRP0001 (me) is the only one with any West or East African component, though I have more of East African than West which makes sense due to my (part-)Egyptian ancestry.

Batch 1 Admixture K=9

For K=9, C1 is the South Asian component and it decreased in all project members except for South Indians and Bengalis. It even decreased in the Bihari sample (HRP0003) and almost disappeared from the Assyrian Iranian one (HRP0010).

The reason is the appearance of what I am calling the Kalash ancestral component (C2). This component is at 94% among the Kalash reference populaton, followed by 41% among Lezgin (a Caucasian group). It is also high among the Pakistani reference populations and other Caucasian populations. Among our first batch of Harappa participants, this Kalash component is high (27-31%) among the Punjabis and Assyrian Iranian.

C3 is the Southwest/West Asian component which hasn't changed a lot among the project members. The Southeast Asian component (C4) has decreased, as has C5 (European).

The Papuan component (C6) has remained small.

C7 (Northeast Asian), C8 (West African), and C9 (East African) have stayed the same.

I am running admixture for even higher values of K, but it takes a long time. While those are running, I am going to go ahead and start the 2nd batch (HRP0011 to HRP0020). For those, I am not going to run all K values. Instead I'll do only a few. If you have any suggestions on which specific K values I should focus on for the latter batches, please let me know.

PS. I have added the names of components to the spreadsheet for ease of use, but these should be thought of as useful mnemonics rather than these components representing some "pure" ancient population. Also remember that the South Asian (or other) component from one K value to the next might not be the same.

HapMap Gujaratis

Razib is wondering what's going on with the HapMap Houston Gujaratis.

As you can see, the Chinese simply do not vary much, and are a tight cluster. But, there is a somewhat equivalent Gujarati cluster too! The HapMap sample was collected from Gujaratis in Houston. To me, it looks like that Houston population can be divided into two groups: one of the tight cluster, and the rest of the population, which is all over the place. [...] What’s more interesting is to try and understanding what’s going on with Houston Gujaratis. Anyone in the audience know?

And his 3-dimensional PCA plot: (Those on the right are Gujaratis)
PCA Plot of Gujaratis and Chinese

So I thought I would share the admixture results for the Gujaratis for K=8. Here's the spreadsheet of the admixture proportions for Gujaratis. And here is the plot:

Gujaratis Admixture K=8

The ancestral components and their statistics are as follows:

Population Range Mean Median
C1 South Asian 64-89% 81.9% 85.8%
C2 West Asian 0-13% 2.3% 1.6%
C3 European 2-22% 7.6% 5.0%
C4 Southeast Asian 0-9% 4.9% 5.0%
C5 Austronesian 1-6% 2.8% 2.9%
C6 Northeast Asian 0-3% 0.4% 0.0%
C7 West African 0-1% 0.0% 0.0%
C8 East African 0-0% 0.0% 0.0%

It looks like a majority of the Gujarati samples have mostly South Asian ancestral component with small amounts of West Asian, European and Southeast Asian, but some Gujarati samples have much larger West Asian and/or European ancestral components.

Latest on Participants

I have a total of 31 participants in the project right now who have sent me their raw data. The following groups are represented:

  • Punjab: 7
  • Tamil: 4
  • Iran: 4
  • Andhra Pradesh: 2
  • Bengal: 2
  • Bihar: 2
  • Karnataka: 2
  • Caribbean Indian: 2
  • Anglo-Indian: 1
  • Roma: 1
  • Kashmir: 1
  • Goa: 1
  • Uttar Pradesh: 1
  • Sri Lankan: 1

Keep them coming!

I am going to get some admixture analysis on the second batch (HRP0011 to HRP0020) done this week.

HGDP to PED Conversion

For converting the HGDP data (from Stanford University) to Plink PED format, I used the following code.

dos2unix HGDP_FinalReport_Forward.txt
dos2unix HGDP_Map.txt
dos2unix SampleInformation.txt
head --lines=1 HGDP_FinalReport_Forward.txt > header.txt
awk '{for (i=1;i<=NF;i++) print "0",$i,"0","0"}' header.txt > hgdp_nosex.tfam
sed '1d' HGDP_FinalReport_Forward.txt > HGDP_Data_NoHeader.txt
sort -k 1b,1 HGDP_Data_NoHeader.txt > HGDP_Data_Sorted.txt
sort -k 1b,1 HGDP_Map.txt > HGDP_Map_Sorted.txt
join -j 1 HGDP_Map_Sorted.txt HGDP_Data_Sorted.txt > HGDP_compound.txt
awk '{if ($2=="M") $2="MT";printf("%s %s 0 %s ",$2,$1,$3);
    for (i=4;i<=NF;i++)
        printf("%s %s ",substr($i,1,1),substr($i,2,1));
    printf("\n")}' HGDP_compound.txt > hgdp.tped
# Add sex info
sed '1d' SampleInformation.txt > temp.txt
sed '$d' temp.txt > SampleInfo_noheader.txt
awk '{printf("HGDP%05d ",$1);
    if ($6=="m") print "1";
    else if ($6=="f") print "2";
    else print "0";}' SampleInfo_noheader.txt > Sample_sex.txt
awk 'BEGIN {
	while ((getline < "Sample_sex.txt") > 0)
		f2array[$1] = $2}
	{if (f2array[$2])
		print $1, $2, $3, $4, f2array[$2], "0"
		print $2 "not listed in file2" > "unmatched"
	}' hgdp_nosex.tfam > hgdp.tfam
# convert to ped
plink --tfile hgdp --out hgdp --make-bed --missing-genotype - --output-missing-genotype 0
# Filter to 952 (or 940) people using the SampleInformation.txt file
awk '{if ($16=="1") printf("0 HGDP%05d\n",$1);}' SampleInfo_noheader.txt > Sample_keep.txt
plink --bfile hgdp --keep Sample_keep.txt --make-bed --out hgdp940

This (and more) could easily be done in Perl. You can look at the SPSmart code for some help along these lines.

Changes due to San/Pygmy Removal

As mentioned earlier, I removed San and Pygmy groups from my reference datasets.

For the admixture runs on Reference Dataset I, the only major changes are for K=2 ancestral components where most European, Middle Eastern and South/Central Asian groups increase their African component. The changes for K=3,4,5 were minor as shown by these statistics:

K Median Abs Maximum Abs
3 0.01% 0.22%
4 0.02% 0.26%
5 0.02% 0.71%

I have updated the spreadsheet and the plots in the original post.

Looking at the changes in the admixture results I already posted for Harappa Project participants HRP0001 to HRP0010, there is major change for K=2. The African compoent (C1/red) increased by a lot among all project participants. This seems to be due to the African component best representing West Africans now instead of Pygmies as it did before.

For K=3,4,5, the changes are very minor. Let's look at the absolute value of the changes in the percentages of ancestral components for the ten project participants.

K Median Abs Maximum Abs
3 0.05% 0.19%
4 0.05% 0.22%
5 0.09% 0.60%

I have updated the spreadsheets and the charts in the original post.

23andme Conversion to PED

Someone asked about how to convert a 23andme raw data file to the Plink format. I threw together a very simple Unix script to do that and I am sharing it here:

if test -z "$1"
	echo "23andme raw data filename not supplied as argument."
	exit 0
echo "Family ID: "
read fid
echo "Individual ID: "
read id
echo "Paternal ID: "
read pid
echo "Maternal ID: "
read mid
echo "Sex (m/f/u): "
read sexchr
if [[ $sexchr == m* ]]
elif [[ $sexchr == f* ]]
echo "$fid $id $pid $mid $sex $pheno" > $id.tfam
dos2unix $1
sed '/^\#/d' $1 > $id.nocomment
awk '{ if (length($4)==1) print $2,$1,"0",$3,substr($4,1,1),substr($4,1,1); else
    print $2,$1,"0",$3,substr($4,1,1),substr($4,2,1) }' $id.nocomment > $id.tped
plink --tfile $id --out $id --make-bed --missing-genotype - --output-missing-genotype 0

Well that's it! You can easily create a Perl script to do the same but this was faster for me.

This script creates three files: *.bed, *.bim and *.fam, which are the binary format files for Plink. You can then use Plink to merge multiple files, filter SNPs or individuals and do other processing.

UPDATE: A Perl script to do the same:

#!/usr/bin/perl -w
$numArgs = $#ARGV + 1;
if ($numArgs < 1) {
        print "23andme raw data filename not provided.\n";
        exit 0;
$file = $ARGV[0];
print "Family ID: ";
$fid = <STDIN>;
chomp $fid;
print "Individual ID: ";
$id = <STDIN>;
chomp $id;
print "Paternal ID: ";
$pid = <STDIN>;
chomp $pid;
print "Maternal ID: ";
$mid = <STDIN>;
chomp $mid;
print "Sex (m/f/u): ";
$sexchr = <STDIN>;
if (lc(substr($sexchr,0,1)) eq "m") {
        $sex = 1; }
elsif (lc(substr($sexchr,0,1)) eq "f") {
        $sex = 2; }
else {
        $sex = 0; }
$pheno = 0;
$tfamname = ">" . $id . ".tfam";
open(TFAM, $tfamname);
print TFAM "$fid $id $pid $mid $sex $pheno";
close TFAM;
open(TPED,">" . $id . ".tped");
while (<INFILE>) {
        next if /#.*/;
        my($rsid,$chr,$pos,$geno) = split(/\s/);
        if (length($geno)==1) {
                $geno1 = $geno;
                $geno2 = $geno;
        else {
                $geno1 = substr($geno,0,1);
                $geno2 = substr($geno,1,1);
        print TPED "$chr $rsid 0 $pos $geno1 $geno2\n";
close TPED;
close INFILE;

That should work on Linux, Microsoft Windows and Mac OS X.

San and Pygmy

I have removed San and Pygmy groups from my reference datasets. That meant removing 39 samples from Reference Data I and 61 samples from Reference Data II.

The presence of those groups was creating some weird effects in admixture runs at K=8,9. Basically, the ancestral components for Africans I was getting were not stable. Instead they were varying with/without different Harappa participant batches. Also, at K=10,11, there were too many Africa-only ancestral components, forcing me to run even higher values of K.

Since we are not really interested in African diversity in this project and any African admixture among South Asians is most likely to be East, West or North African instead of Pygmy or San, the removal of these groups should not have any implications for the Harappa Ancestry Project.

To make sure that the above assertion is true, I'll re-run admixture analysis for K=2-5 and update later with the results.