HGDP to PED Conversion

For converting the HGDP data (from Stanford University) to Plink PED format, I used the following code.

dos2unix HGDP_FinalReport_Forward.txt
dos2unix HGDP_Map.txt
dos2unix SampleInformation.txt
head --lines=1 HGDP_FinalReport_Forward.txt > header.txt
awk '{for (i=1;i<=NF;i++) print "0",$i,"0","0"}' header.txt > hgdp_nosex.tfam
sed '1d' HGDP_FinalReport_Forward.txt > HGDP_Data_NoHeader.txt
sort -k 1b,1 HGDP_Data_NoHeader.txt > HGDP_Data_Sorted.txt
sort -k 1b,1 HGDP_Map.txt > HGDP_Map_Sorted.txt
join -j 1 HGDP_Map_Sorted.txt HGDP_Data_Sorted.txt > HGDP_compound.txt
awk '{if ($2=="M") $2="MT";printf("%s %s 0 %s ",$2,$1,$3);
    for (i=4;i<=NF;i++)
        printf("%s %s ",substr($i,1,1),substr($i,2,1));
    printf("\n")}' HGDP_compound.txt > hgdp.tped
# Add sex info
sed '1d' SampleInformation.txt > temp.txt
sed '$d' temp.txt > SampleInfo_noheader.txt
awk '{printf("HGDP%05d ",$1);
    if ($6=="m") print "1";
    else if ($6=="f") print "2";
    else print "0";}' SampleInfo_noheader.txt > Sample_sex.txt
awk 'BEGIN {
	while ((getline < "Sample_sex.txt") > 0)
		f2array[$1] = $2}
	{if (f2array[$2])
		print $1, $2, $3, $4, f2array[$2], "0"
		print $2 "not listed in file2" > "unmatched"
	}' hgdp_nosex.tfam > hgdp.tfam
# convert to ped
plink --tfile hgdp --out hgdp --make-bed --missing-genotype - --output-missing-genotype 0
# Filter to 952 (or 940) people using the SampleInformation.txt file
awk '{if ($16=="1") printf("0 HGDP%05d\n",$1);}' SampleInfo_noheader.txt > Sample_keep.txt
plink --bfile hgdp --keep Sample_keep.txt --make-bed --out hgdp940

This (and more) could easily be done in Perl. You can look at the SPSmart code for some help along these lines.


  1. Thank you for sharing this, Zack 🙂 .

    If you find the time, could you also do one for the Behar et al. dataset? Since that one is tad more difficult to convert.

  2. Please also help to convert PAn Asian data in to plink ped format.


    • I have converted it. But it has only 18,000 SNPs in common with 23andme. So I won't use it for Admixture. Let's see if I can get some value out of it for PCA.

  3. Pardon this ignorant question, but what language is this? I tried to run this shell script in terminal on my mac and it didn't recognize the first 4 or so commands... Also, where did you get a SampleInformation.txt file? The only comparable file I saw when I downloaded the Stanford HGDP data was SampleList.txt...but that's just a list of IDs...

    If you could reply I would really appreciate it!

    • I can't guarantee it will work on a Mac. The dos2unix utility (which you are likely missing) just changes line endings from DOS to Unix format so later parsing doesn't run into any problems. You can easily change the files to Mac endings if needed.

      The sample list file can be downloaded here.

  4. Thanks for the quick reply! I ran your program from the UNIX cluster at my school rather than my desktop and it worked just fine. Also that file was key. MANY thanks! =)

  5. harpreet kaur

    Please tell me how to download hgdp imputed snps data.

  6. I know this post is from last year but if I can speak for those noobs trying to catch up, MANY THANKS ZACK!!!

  7. Thank you for sharing your script Zack. I tried it but it blocks at the conversion to bed (i changed: SampleInformation.txt --> HGDP_SampleList.txt). Thanks