Tag Archives: plink

FTDNA FF to PED Conversion

Someone asked about how to convert a FTDNA Family Finder csv data file to the Plink format. I threw together a very simple Unix script to do that and I am sharing it here:

#!/bin/bash
if test -z "$1"
then
        echo "FTDNA raw data filename not supplied as argument."
        exit 0
fi
echo "Family ID: "
read fid
echo "Individual ID: "
read id
echo "Paternal ID: "
read pid
echo "Maternal ID: "
read mid
echo "Sex (m/f/u): "
read sexchr
if [[ $sexchr == m* ]]
then
        sex=1
elif [[ $sexchr == f* ]]
then
        sex=2
else
        sex=0
fi
pheno=0
 
echo "$fid $id $pid $mid $sex $pheno" > $id.tfam
 
dos2unix $1
sed '1d' $1 > $id.nocomment
awk -F, '{gsub(/"/,""); print $2,$1,"0",$3,substr($4,1,1),substr($4,2,1)}' $id.nocomment > $id.tped
rm $id.nocomment
 
plink --tfile $id --out $id --make-bed --missing-genotype - --output-missing-genotype 0

This script creates three files: *.bed, *.bim and *.fam, which are the binary format files for Plink. You can then use Plink to merge multiple files, filter SNPs or individuals and do other processing.

Related Reading:

FF, Vol. 3
FF - Volume 2: Family Freakout (Marvel Now) (Fantastic Four)
A Twisted Bard's Tale
FF by Jonathan Hickman - Volume 4 (Fantastic Four)
THE KING'S SON (The Evidence) (Volume 2)

Pan-Asian to PED Conversion

Even though the Pan-Asian dataset is not public, there was a request for my script to convert the data to Plink's PED format.

Here is how I convert the Pan-Asian data to Plink's transposed file format.

#!/usr/bin/perl -w
 
$file="Genotypes_All.txt";
 
open(INFILE,"<",$file);
open(TFAM,">","panasian.tfam");
open(TPED,">","panasian.tped");
 
$line = <INFILE>;
chomp $line;
@first = split('\t',$line);
foreach my $sample (5..$#first) {
        print TFAM "0 $first[$sample] 0 0 0 -9\n";
}
 
my $alleles;
 
while(<INFILE>) {
        chomp;
        @lines = split('\t',$_);
        my ($major,$minor) = split('/',$lines[4]);
        print TPED "$lines[2] $lines[1] 0 $lines[3]";
        foreach my $snp (5..$#lines) {
                if ($lines[$snp] == 0) {
                        $alleles = "$major $major";}
                elsif ($lines[$snp] == 1) {
                        $alleles = "$major $minor";}
                elsif ($lines[$snp] == 2) {
                        $alleles = "$minor $minor";}
                else {
                        $alleles = "0 0";}
                print TPED " $alleles";
        }
        print TPED "\n";
}
 
close(INFILE);
close(TFAM);
close(TPED);

Again, no guarantees! It's Perl though, so it should be more stable across various operating systems.

Related Reading:

Practical Paleo: A Customized Approach to Health and a Whole-Foods Lifestyle
Asian American Panethnicity: Bridging Institutions and Identities
All-In-One Quilter's Reference Tool Easy-To-Follow Charts, Tables and Illustrations, Yardage Requirements, Cutting Instructions, Setting Secrets, Choosing ... Piecing Techniques, Number Conversions
Houndsley and Catina Plink and Plunk: Candlewick Sparks

Xing to PED Conversion

Following mallu's request, here is the code I used to convert Xing et al data to Plink's PED format.

#!/bin/bash
dos2unix *.csv
 
head --lines=1  JHS_Genotype.csv > header.txt
awk -F, '{for (i=2;i<=NF;i++) print "0",$i,"0","0","0", "0"}' header.txt > xing.tfam
sed '1d' JHS_Genotype.csv > genotype.csv
sort -t',' -k 1b,1 genotype.csv > genotype_sorted.csv
sort -t',' -k 1b,1 JHS_SNP.csv > snp_sorted.csv
join -t',' -j 1 snp_sorted.csv genotype_sorted.csv > xing_compound.csv
awk -F, '{printf("%s %s 0 %s ",substr($2,4),$1,$3); 
        for (i=6;i<=NF;i++)
                printf("%s %s ",substr($i,1,1),substr($i,2,1));
        printf("\n")}' xing_compound.csv > xing.tped
 
plink --tfile xing --out xing --make-bed --missing-genotype N --output-missing-genotype 0

I make no guarantees that it will work for you. I used it on my Ubuntu box, but I am sure it'll have trouble on Mac OS.

Related Reading:

The Chinese Thought of It: Amazing Inventions and Innovations (We Thought Of It)
Nei Jia Quan: Internal Martial Arts Teachers of Tai Ji Quan, Xing Yi Quan, and Ba Gua Zhang
Plink, Plank, Plunk Piano Solo
A Little Town called Plink...and other stories to read with a silly accent
Houndsley and Catina Plink and Plunk: Candlewick Sparks

HGDP to PED Conversion

For converting the HGDP data (from Stanford University) to Plink PED format, I used the following code.

#!/bin/bash
dos2unix HGDP_FinalReport_Forward.txt
dos2unix HGDP_Map.txt
dos2unix SampleInformation.txt
head --lines=1 HGDP_FinalReport_Forward.txt > header.txt
awk '{for (i=1;i<=NF;i++) print "0",$i,"0","0"}' header.txt > hgdp_nosex.tfam
sed '1d' HGDP_FinalReport_Forward.txt > HGDP_Data_NoHeader.txt
sort -k 1b,1 HGDP_Data_NoHeader.txt > HGDP_Data_Sorted.txt
sort -k 1b,1 HGDP_Map.txt > HGDP_Map_Sorted.txt
join -j 1 HGDP_Map_Sorted.txt HGDP_Data_Sorted.txt > HGDP_compound.txt
 
awk '{if ($2=="M") $2="MT";printf("%s %s 0 %s ",$2,$1,$3);
    for (i=4;i<=NF;i++)
        printf("%s %s ",substr($i,1,1),substr($i,2,1));
    printf("\n")}' HGDP_compound.txt > hgdp.tped
 
# Add sex info
sed '1d' SampleInformation.txt > temp.txt
sed '$d' temp.txt > SampleInfo_noheader.txt
awk '{printf("HGDP%05d ",$1);
    if ($6=="m") print "1";
    else if ($6=="f") print "2";
    else print "0";}' SampleInfo_noheader.txt > Sample_sex.txt
awk 'BEGIN {
	while ((getline < "Sample_sex.txt") > 0)
		f2array[$1] = $2}
	{if (f2array[$2])
		print $1, $2, $3, $4, f2array[$2], "0"
	else
		print $2 "not listed in file2" > "unmatched"
	}' hgdp_nosex.tfam > hgdp.tfam
 
# convert to ped
plink --tfile hgdp --out hgdp --make-bed --missing-genotype - --output-missing-genotype 0
 
# Filter to 952 (or 940) people using the SampleInformation.txt file
awk '{if ($16=="1") printf("0 HGDP%05d\n",$1);}' SampleInfo_noheader.txt > Sample_keep.txt
plink --bfile hgdp --keep Sample_keep.txt --make-bed --out hgdp940

This (and more) could easily be done in Perl. You can look at the SPSmart code for some help along these lines.

Related Reading:

Conversion: The Last Great Retail Metric
Houndsley and Catina Plink and Plunk: Candlewick Sparks
Mapping Human History: Discovering the Past Through Our Genes
The Routledge Companion to Race and Ethnicity (Routledge Companions)

23andme Conversion to PED

Someone asked about how to convert a 23andme raw data file to the Plink format. I threw together a very simple Unix script to do that and I am sharing it here:

#!/bin/bash
if test -z "$1"
then
	echo "23andme raw data filename not supplied as argument."
	exit 0
fi
echo "Family ID: "
read fid
echo "Individual ID: "
read id
echo "Paternal ID: "
read pid
echo "Maternal ID: "
read mid
echo "Sex (m/f/u): "
read sexchr
if [[ $sexchr == m* ]]
then
	sex=1
elif [[ $sexchr == f* ]]
then
	sex=2
else
	sex=0
fi
pheno=0
echo "$fid $id $pid $mid $sex $pheno" > $id.tfam
dos2unix $1
sed '/^\#/d' $1 > $id.nocomment
awk '{ if (length($4)==1) print $2,$1,"0",$3,substr($4,1,1),substr($4,1,1); else
    print $2,$1,"0",$3,substr($4,1,1),substr($4,2,1) }' $id.nocomment > $id.tped
plink --tfile $id --out $id --make-bed --missing-genotype - --output-missing-genotype 0

Well that's it! You can easily create a Perl script to do the same but this was faster for me.

This script creates three files: *.bed, *.bim and *.fam, which are the binary format files for Plink. You can then use Plink to merge multiple files, filter SNPs or individuals and do other processing.

UPDATE: A Perl script to do the same:

#!/usr/bin/perl -w
 
$numArgs = $#ARGV + 1;
if ($numArgs < 1) {
        print "23andme raw data filename not provided.\n";
        exit 0;
}
 
$file = $ARGV[0];
 
print "Family ID: ";
$fid = <STDIN>;
chomp $fid;
print "Individual ID: ";
$id = <STDIN>;
chomp $id;
print "Paternal ID: ";
$pid = <STDIN>;
chomp $pid;
print "Maternal ID: ";
$mid = <STDIN>;
chomp $mid;
print "Sex (m/f/u): ";
$sexchr = <STDIN>;
if (lc(substr($sexchr,0,1)) eq "m") {
        $sex = 1; }
elsif (lc(substr($sexchr,0,1)) eq "f") {
        $sex = 2; }
else {
        $sex = 0; }
$pheno = 0;
$tfamname = ">" . $id . ".tfam";
open(TFAM, $tfamname);
print TFAM "$fid $id $pid $mid $sex $pheno";
close TFAM;
 
open(INFILE,$file);
open(TPED,">" . $id . ".tped");
while (<INFILE>) {
        next if /#.*/;
        chomp;
        my($rsid,$chr,$pos,$geno) = split(/\s/);
        if (length($geno)==1) {
                $geno1 = $geno;
                $geno2 = $geno;
        }
        else {
                $geno1 = substr($geno,0,1);
                $geno2 = substr($geno,1,1);
        }
        print TPED "$chr $rsid 0 $pos $geno1 $geno2\n";
}
close TPED;
close INFILE;

That should work on Linux, Microsoft Windows and Mac OS X.

Related Reading:

Conversion Optimization: The Art and Science of Converting Prospects to Customers
Plink, Plank, Plunk!
Automate This: How Algorithms Came to Rule Our World
Landing Page Optimization: The Definitive Guide to Testing and Tuning for Conversions

Admixture: Reference Population

For regular admixture analysis, I am using HapMap, HGDP, SGVP and Behar datasets with some samples removed as I wrote earlier.

For each of these datasets,

  1. I first filtered to keep only the list of SNPs present in 23andme v2 chip.
    plink --bfile data --extract 23andmev2.snplist
  2. I also filtered for founders:
    plink --bfile data --filter-founders
  3. And excluded SNPs with missing rates greater than 1%:
    plink --bfile data --geno 0.01

Then, I merged the datasets one by one. The reason for doing it one by one was that there were conflicts of strand orientation (forward or reverse) between the different datasets. If the merge operation gave an error, I had to flip those strands in one dataset and try the merge again.

plink --bfile data1 --bmerge data2.bed data2.bim data2.fam --make-bed
plink --bfile data2 --flip plink.missnp --make-bed --out data2flip
plink --bfile data1 --bmerge data2flip.bed data2flip.bim data2flip.fam --make-bed

Once all the four datasets were merged, I processed the combined data file:

  1. Removed SNPs with a missing rate of more than 1% in the combined dataset
    plink --bfile data --geno 0.01
  2. Then i performed linkage disequilibrium based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3:
    plink --bfile data --indep-pairwise 50 5 0.3
    plink --bfile data --extract plink.prune.in --make-bed

This gave me a reference population of 2,693 2,654 individuals with each sample having about 186,000 SNPs. Out of these 2,693 2,654 individuals, we have a total of 398 South Asians belonging to 16 ethnic groups.

Finally, it's time to start having some fun!

UPDATE: I removed 39 Pygmy and San samples because they were causing some trouble with African ancestral components. Since we are not interested in detailed African ancestry and African admixture among South Asians is not likely to be pygmy or San, I decided it would be best to remove them.

Related Reading:

Pocket Ref 4th Edition
It's Not About the Coffee: Lessons on Putting People First from a Life at Starbucks
Traveling Heavy: A Memoir in between Journeys
240 Vocabulary Words Kids Need to Know: Grade 2: 24 Ready-to-Reproduce Packets That Make Vocabulary Building Fun & Effective
The Routledge Companion to Race and Ethnicity (Routledge Companions)