23andme Conversion to PED

Someone asked about how to convert a 23andme raw data file to the Plink format. I threw together a very simple Unix script to do that and I am sharing it here:

#!/bin/bash
if test -z "$1"
then
	echo "23andme raw data filename not supplied as argument."
	exit 0
fi
echo "Family ID: "
read fid
echo "Individual ID: "
read id
echo "Paternal ID: "
read pid
echo "Maternal ID: "
read mid
echo "Sex (m/f/u): "
read sexchr
if [[ $sexchr == m* ]]
then
	sex=1
elif [[ $sexchr == f* ]]
then
	sex=2
else
	sex=0
fi
pheno=0
echo "$fid $id $pid $mid $sex $pheno" > $id.tfam
dos2unix $1
sed '/^\#/d' $1 > $id.nocomment
awk '{ if (length($4)==1) print $2,$1,"0",$3,substr($4,1,1),substr($4,1,1); else
    print $2,$1,"0",$3,substr($4,1,1),substr($4,2,1) }' $id.nocomment > $id.tped
plink --tfile $id --out $id --make-bed --missing-genotype - --output-missing-genotype 0

Well that's it! You can easily create a Perl script to do the same but this was faster for me.

This script creates three files: *.bed, *.bim and *.fam, which are the binary format files for Plink. You can then use Plink to merge multiple files, filter SNPs or individuals and do other processing.

UPDATE: A Perl script to do the same:

#!/usr/bin/perl -w
 
$numArgs = $#ARGV + 1;
if ($numArgs < 1) {
        print "23andme raw data filename not provided.\n";
        exit 0;
}
 
$file = $ARGV[0];
 
print "Family ID: ";
$fid = <STDIN>;
chomp $fid;
print "Individual ID: ";
$id = <STDIN>;
chomp $id;
print "Paternal ID: ";
$pid = <STDIN>;
chomp $pid;
print "Maternal ID: ";
$mid = <STDIN>;
chomp $mid;
print "Sex (m/f/u): ";
$sexchr = <STDIN>;
if (lc(substr($sexchr,0,1)) eq "m") {
        $sex = 1; }
elsif (lc(substr($sexchr,0,1)) eq "f") {
        $sex = 2; }
else {
        $sex = 0; }
$pheno = 0;
$tfamname = ">" . $id . ".tfam";
open(TFAM, $tfamname);
print TFAM "$fid $id $pid $mid $sex $pheno";
close TFAM;
 
open(INFILE,$file);
open(TPED,">" . $id . ".tped");
while (<INFILE>) {
        next if /#.*/;
        chomp;
        my($rsid,$chr,$pos,$geno) = split(/\s/);
        if (length($geno)==1) {
                $geno1 = $geno;
                $geno2 = $geno;
        }
        else {
                $geno1 = substr($geno,0,1);
                $geno2 = substr($geno,1,1);
        }
        print TPED "$chr $rsid 0 $pos $geno1 $geno2\n";
}
close TPED;
close INFILE;

That should work on Linux, Microsoft Windows and Mac OS X.

13 Comments.

  1. Thanks for posting this! However, when I use my 23andMe file plink complains about my X chromosome which has only 5 columns. It looks like:
    X rs5939319 0 2710157 A
    Any ideas on how to fix this?

  2. Which OS are you running it on?

  3. i have a perl script which filters X, Y & mtDNA. why u want dat anyway?

  4. Hi,

    I'm getting the following error when I run admixture on my raw data pre-processed with your format-conversion code and plink:

    'Error: detected that all genotypes are missing for a SNP locus.
    Please apply quality-control filters to remove such loci.'

    I would like to try filtering my data following the instructions you've given in your post on Jan 29, but could you tell me where I can get the SNP list file for 23andMe?

    Thanks!

    PS: I'm not a geneticist, so I have no clue what most of the technical terms mean!

    • Which of the two scripts are you using, the first or the Perl one?

      Is that an error from Admixture or plink? Can you show me plink's output?

      • Hi,

        I used the Perl script for this run. The given error is from admixture.

        This is how I tried to do the run:

        1. Ran the Perl script to create file 'b.tped'.

        2. Ran plink to create 'b.bed':

        plink --tfile b --out b --make-bed --missing-genotype - --output-missing-genotype 0

        Output:

        ...

        Web-based version check ( --noweb to skip )
        Recent cached web-check found...Problem connecting to web

        Writing this text to log file [ b.log ]
        Analysis started: Sun Feb 27 19:06:16 2011

        Options in effect:
        --tfile b
        --out b
        --make-bed
        --missing-genotype -
        --output-missing-genotype 0

        Reading pedigree information from [ b.tfam ]
        1 individuals read from [ b.tfam ]
        0 individuals with nonmissing phenotypes
        Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
        Missing phenotype value is also -9
        0 cases, 0 controls and 1 missing
        1 males, 0 females, and 0 of unspecified sex
        966977 (of 966977) markers to be included from [ b.tped ]
        Before frequency and genotyping pruning, there are 966977 SNPs
        0 founders and 1 non-founders found
        168 heterozygous haploid genotypes; set to missing
        Writing list of heterozygous haploid genotypes to [ b.hh ]
        966977 SNPs with no founder genotypes observed
        Warning, MAF set to 0 for these SNPs (see --nonfounders)
        Writing list of these SNPs to [ b.nof ]
        Total genotyping rate in remaining individuals is 0.994775
        0 SNPs failed missingness test ( GENO > 1 )
        0 SNPs failed frequency test ( MAF < 0 )
        After frequency and genotyping pruning, there are 966977 SNPs
        After filtering, 0 cases, 0 controls and 1 missing
        After filtering, 1 males, 0 females, and 0 of unspecified sex
        Writing pedigree information to [ b.fam ]
        Writing map (extended format) information to [ b.bim ]
        Writing genotype bitfile to [ b.bed ]
        Using (default) SNP-major mode

        Analysis finished: Sun Feb 27 19:06:28 2011

        3. Ran admixture:

        admixture b.bed 1

        ...
        Random seed: 43
        Point estimation method: Block relaxation algorithm
        Convergence acceleration algorithm: QuasiNewton, 3 secant conditions
        Point estimation will terminate when objective function delta < 0.0001
        Estimation of standard errors disabled; will compute point estimates only.
        Error: detected that all genotypes are missing for a SNP locus.
        Please apply quality-control filters to remove such loci.

        Thanks!

        • Admixture needs a file with multiple individuals, not a single individual.

          Also K is the number of ancestral populations Admixture should find, so setting it to 1 is not useful at all.

          If you want an an admixture analysis, have you sent me your data?

  5. Okay, I'll try out your suggestions. Thanks!

    I haven't sent you my data yet - I was thinking of playing around with admixture, figure out how it works, etc. But I might, in the near future. 🙂

  6. Hi Zack, I know this is an old post but I've ran out of luck and need help. I'm trying to merge my SNPs to the Pan-Asian data. Basically:

    1. Converted 23nMe to bed, bim, bam
    2. Converted Pan-Asian to bed, bim, bam
    3. -extracted the Pan-Asian SNP
    4. Filtered my 23nMe using Pan-Asian.snplist
    5. This is where the problem begins. I tried to merged 23nMe to Pan-Asian.
    ==> plink --bfile PASNP --bmerge 23nMeFil.bed 23nMeFil.bim 23nMeFil.fam --make-bed --out PAnMe

    Error comes outlike this:
    *********************
    Found 479 SNPs that do not match in terms of allele codes
    Might include strand flips, although flipped A/T and C/G SNPs will be undetected)
    Writing problem SNPs to [ PAnGlenn.missnp ]
    479 markers with 3+ alleles present.
    *********************

    6. I tried to --flip option
    ==> plink --bfile 23nMe --flip PAnMe-merge.missnp --make-bed --out 23nMeFlip

    Tried --bmerge again and the same exact error.

    Tried --exclude option and the same exact error.

    Any help would be greatly appreciated. Also, thank you very much for all your blogs. It's a gold mine!

    • It's working now. I tried to filter the PASNP data before combining both and it worked.