Tag Archives: conversion

FTDNA FF to PED Conversion

Someone asked about how to convert a FTDNA Family Finder csv data file to the Plink format. I threw together a very simple Unix script to do that and I am sharing it here:

#!/bin/bash
if test -z "$1"
then
        echo "FTDNA raw data filename not supplied as argument."
        exit 0
fi
echo "Family ID: "
read fid
echo "Individual ID: "
read id
echo "Paternal ID: "
read pid
echo "Maternal ID: "
read mid
echo "Sex (m/f/u): "
read sexchr
if [[ $sexchr == m* ]]
then
        sex=1
elif [[ $sexchr == f* ]]
then
        sex=2
else
        sex=0
fi
pheno=0
 
echo "$fid $id $pid $mid $sex $pheno" > $id.tfam
 
dos2unix $1
sed '1d' $1 > $id.nocomment
awk -F, '{gsub(/"/,""); print $2,$1,"0",$3,substr($4,1,1),substr($4,2,1)}' $id.nocomment > $id.tped
rm $id.nocomment
 
plink --tfile $id --out $id --make-bed --missing-genotype - --output-missing-genotype 0

This script creates three files: *.bed, *.bim and *.fam, which are the binary format files for Plink. You can then use Plink to merge multiple files, filter SNPs or individuals and do other processing.

Pan-Asian to PED Conversion

Even though the Pan-Asian dataset is not public, there was a request for my script to convert the data to Plink's PED format.

Here is how I convert the Pan-Asian data to Plink's transposed file format.

#!/usr/bin/perl -w
 
$file="Genotypes_All.txt";
 
open(INFILE,"<",$file);
open(TFAM,">","panasian.tfam");
open(TPED,">","panasian.tped");
 
$line = <INFILE>;
chomp $line;
@first = split('\t',$line);
foreach my $sample (5..$#first) {
        print TFAM "0 $first[$sample] 0 0 0 -9\n";
}
 
my $alleles;
 
while(<INFILE>) {
        chomp;
        @lines = split('\t',$_);
        my ($major,$minor) = split('/',$lines[4]);
        print TPED "$lines[2] $lines[1] 0 $lines[3]";
        foreach my $snp (5..$#lines) {
                if ($lines[$snp] == 0) {
                        $alleles = "$major $major";}
                elsif ($lines[$snp] == 1) {
                        $alleles = "$major $minor";}
                elsif ($lines[$snp] == 2) {
                        $alleles = "$minor $minor";}
                else {
                        $alleles = "0 0";}
                print TPED " $alleles";
        }
        print TPED "\n";
}
 
close(INFILE);
close(TFAM);
close(TPED);

Again, no guarantees! It's Perl though, so it should be more stable across various operating systems.

Xing to PED Conversion

Following mallu's request, here is the code I used to convert Xing et al data to Plink's PED format.

#!/bin/bash
dos2unix *.csv
 
head --lines=1  JHS_Genotype.csv > header.txt
awk -F, '{for (i=2;i<=NF;i++) print "0",$i,"0","0","0", "0"}' header.txt > xing.tfam
sed '1d' JHS_Genotype.csv > genotype.csv
sort -t',' -k 1b,1 genotype.csv > genotype_sorted.csv
sort -t',' -k 1b,1 JHS_SNP.csv > snp_sorted.csv
join -t',' -j 1 snp_sorted.csv genotype_sorted.csv > xing_compound.csv
awk -F, '{printf("%s %s 0 %s ",substr($2,4),$1,$3); 
        for (i=6;i<=NF;i++)
                printf("%s %s ",substr($i,1,1),substr($i,2,1));
        printf("\n")}' xing_compound.csv > xing.tped
 
plink --tfile xing --out xing --make-bed --missing-genotype N --output-missing-genotype 0

I make no guarantees that it will work for you. I used it on my Ubuntu box, but I am sure it'll have trouble on Mac OS.

HGDP to PED Conversion

For converting the HGDP data (from Stanford University) to Plink PED format, I used the following code.

#!/bin/bash
dos2unix HGDP_FinalReport_Forward.txt
dos2unix HGDP_Map.txt
dos2unix SampleInformation.txt
head --lines=1 HGDP_FinalReport_Forward.txt > header.txt
awk '{for (i=1;i<=NF;i++) print "0",$i,"0","0"}' header.txt > hgdp_nosex.tfam
sed '1d' HGDP_FinalReport_Forward.txt > HGDP_Data_NoHeader.txt
sort -k 1b,1 HGDP_Data_NoHeader.txt > HGDP_Data_Sorted.txt
sort -k 1b,1 HGDP_Map.txt > HGDP_Map_Sorted.txt
join -j 1 HGDP_Map_Sorted.txt HGDP_Data_Sorted.txt > HGDP_compound.txt
 
awk '{if ($2=="M") $2="MT";printf("%s %s 0 %s ",$2,$1,$3);
    for (i=4;i<=NF;i++)
        printf("%s %s ",substr($i,1,1),substr($i,2,1));
    printf("\n")}' HGDP_compound.txt > hgdp.tped
 
# Add sex info
sed '1d' SampleInformation.txt > temp.txt
sed '$d' temp.txt > SampleInfo_noheader.txt
awk '{printf("HGDP%05d ",$1);
    if ($6=="m") print "1";
    else if ($6=="f") print "2";
    else print "0";}' SampleInfo_noheader.txt > Sample_sex.txt
awk 'BEGIN {
	while ((getline < "Sample_sex.txt") > 0)
		f2array[$1] = $2}
	{if (f2array[$2])
		print $1, $2, $3, $4, f2array[$2], "0"
	else
		print $2 "not listed in file2" > "unmatched"
	}' hgdp_nosex.tfam > hgdp.tfam
 
# convert to ped
plink --tfile hgdp --out hgdp --make-bed --missing-genotype - --output-missing-genotype 0
 
# Filter to 952 (or 940) people using the SampleInformation.txt file
awk '{if ($16=="1") printf("0 HGDP%05d\n",$1);}' SampleInfo_noheader.txt > Sample_keep.txt
plink --bfile hgdp --keep Sample_keep.txt --make-bed --out hgdp940

This (and more) could easily be done in Perl. You can look at the SPSmart code for some help along these lines.

23andme Conversion to PED

Someone asked about how to convert a 23andme raw data file to the Plink format. I threw together a very simple Unix script to do that and I am sharing it here:

#!/bin/bash
if test -z "$1"
then
	echo "23andme raw data filename not supplied as argument."
	exit 0
fi
echo "Family ID: "
read fid
echo "Individual ID: "
read id
echo "Paternal ID: "
read pid
echo "Maternal ID: "
read mid
echo "Sex (m/f/u): "
read sexchr
if [[ $sexchr == m* ]]
then
	sex=1
elif [[ $sexchr == f* ]]
then
	sex=2
else
	sex=0
fi
pheno=0
echo "$fid $id $pid $mid $sex $pheno" > $id.tfam
dos2unix $1
sed '/^\#/d' $1 > $id.nocomment
awk '{ if (length($4)==1) print $2,$1,"0",$3,substr($4,1,1),substr($4,1,1); else
    print $2,$1,"0",$3,substr($4,1,1),substr($4,2,1) }' $id.nocomment > $id.tped
plink --tfile $id --out $id --make-bed --missing-genotype - --output-missing-genotype 0

Well that's it! You can easily create a Perl script to do the same but this was faster for me.

This script creates three files: *.bed, *.bim and *.fam, which are the binary format files for Plink. You can then use Plink to merge multiple files, filter SNPs or individuals and do other processing.

UPDATE: A Perl script to do the same:

#!/usr/bin/perl -w
 
$numArgs = $#ARGV + 1;
if ($numArgs < 1) {
        print "23andme raw data filename not provided.\n";
        exit 0;
}
 
$file = $ARGV[0];
 
print "Family ID: ";
$fid = <STDIN>;
chomp $fid;
print "Individual ID: ";
$id = <STDIN>;
chomp $id;
print "Paternal ID: ";
$pid = <STDIN>;
chomp $pid;
print "Maternal ID: ";
$mid = <STDIN>;
chomp $mid;
print "Sex (m/f/u): ";
$sexchr = <STDIN>;
if (lc(substr($sexchr,0,1)) eq "m") {
        $sex = 1; }
elsif (lc(substr($sexchr,0,1)) eq "f") {
        $sex = 2; }
else {
        $sex = 0; }
$pheno = 0;
$tfamname = ">" . $id . ".tfam";
open(TFAM, $tfamname);
print TFAM "$fid $id $pid $mid $sex $pheno";
close TFAM;
 
open(INFILE,$file);
open(TPED,">" . $id . ".tped");
while (<INFILE>) {
        next if /#.*/;
        chomp;
        my($rsid,$chr,$pos,$geno) = split(/\s/);
        if (length($geno)==1) {
                $geno1 = $geno;
                $geno2 = $geno;
        }
        else {
                $geno1 = substr($geno,0,1);
                $geno2 = substr($geno,1,1);
        }
        print TPED "$chr $rsid 0 $pos $geno1 $geno2\n";
}
close TPED;
close INFILE;

That should work on Linux, Microsoft Windows and Mac OS X.