23andme Conversion to PED

Posted by Zack on February 4, 2011

Someone asked about how to convert a 23andme raw data file to the Plink format. I threw together a very simple Unix script to do that and I am sharing it here:

#!/bin/bash
if test -z "$1"
then
	echo "23andme raw data filename not supplied as argument."
	exit 0
fi
echo "Family ID: "
read fid
echo "Individual ID: "
read id
echo "Paternal ID: "
read pid
echo "Maternal ID: "
read mid
echo "Sex (m/f/u): "
read sexchr
if [[ $sexchr == m* ]]
then
	sex=1
elif [[ $sexchr == f* ]]
then
	sex=2
else
	sex=0
fi
pheno=0
echo "$fid $id $pid $mid $sex $pheno" > $id.tfam
dos2unix $1
sed '/^\#/d' $1 > $id.nocomment
awk '{ if (length($4)==1) print $2,$1,"0",$3,substr($4,1,1),substr($4,1,1); else
    print $2,$1,"0",$3,substr($4,1,1),substr($4,2,1) }' $id.nocomment > $id.tped
plink --tfile $id --out $id --make-bed --missing-genotype - --output-missing-genotype 0

Well that's it! You can easily create a Perl script to do the same but this was faster for me.

This script creates three files: *.bed, *.bim and *.fam, which are the binary format files for Plink. You can then use Plink to merge multiple files, filter SNPs or individuals and do other processing.

UPDATE: A Perl script to do the same:

#!/usr/bin/perl -w
 
$numArgs = $#ARGV + 1;
if ($numArgs < 1) {
        print "23andme raw data filename not provided.\n";
        exit 0;
}
 
$file = $ARGV[0];
 
print "Family ID: ";
$fid = <STDIN>;
chomp $fid;
print "Individual ID: ";
$id = <STDIN>;
chomp $id;
print "Paternal ID: ";
$pid = <STDIN>;
chomp $pid;
print "Maternal ID: ";
$mid = <STDIN>;
chomp $mid;
print "Sex (m/f/u): ";
$sexchr = <STDIN>;
if (lc(substr($sexchr,0,1)) eq "m") {
        $sex = 1; }
elsif (lc(substr($sexchr,0,1)) eq "f") {
        $sex = 2; }
else {
        $sex = 0; }
$pheno = 0;
$tfamname = ">" . $id . ".tfam";
open(TFAM, $tfamname);
print TFAM "$fid $id $pid $mid $sex $pheno";
close TFAM;
 
open(INFILE,$file);
open(TPED,">" . $id . ".tped");
while (<INFILE>) {
        next if /#.*/;
        chomp;
        my($rsid,$chr,$pos,$geno) = split(/\s/);
        if (length($geno)==1) {
                $geno1 = $geno;
                $geno2 = $geno;
        }
        else {
                $geno1 = substr($geno,0,1);
                $geno2 = substr($geno,1,1);
        }
        print TPED "$chr $rsid 0 $pos $geno1 $geno2\n";
}
close TPED;
close INFILE;

That should work on Linux, Microsoft Windows and Mac OS X.

Code23andme, conversion, plink

← San and Pygmy

Changes due to San/Pygmy Removal →

13 Comments.

Aaron February 4, 2011 at 8:04 pm

Thanks for posting this! However, when I use my 23andMe file plink complains about my X chromosome which has only 5 columns. It looks like:
X rs5939319 0 2710157 A
Any ideas on how to fix this?
Zack February 4, 2011 at 10:23 pm

Which OS are you running it on?
- Aaron February 5, 2011 at 5:24 pm
  
  OS X 10.6
  - Zack February 5, 2011 at 6:54 pm
    
    Aaron: The different line ending characters of Unix, Mac OS and DOS/Windows trip up length($4) in awk.
    
    Once my kid's asleep, I can quickly throw together a Perl script which shouldn't have this problem.
    - Aaron February 7, 2011 at 5:30 pm
      
      Thanks!
razib February 5, 2011 at 6:15 pm

i have a perl script which filters X, Y & mtDNA. why u want dat anyway?
Jayanth February 26, 2011 at 7:08 pm

Hi,

I'm getting the following error when I run admixture on my raw data pre-processed with your format-conversion code and plink:

'Error: detected that all genotypes are missing for a SNP locus.
Please apply quality-control filters to remove such loci.'

I would like to try filtering my data following the instructions you've given in your post on Jan 29, but could you tell me where I can get the SNP list file for 23andMe?

Thanks!

PS: I'm not a geneticist, so I have no clue what most of the technical terms mean!
- Zack February 26, 2011 at 10:41 pm
  
  Which of the two scripts are you using, the first or the Perl one?
  
  Is that an error from Admixture or plink? Can you show me plink's output?
  - Jayanth February 27, 2011 at 7:11 pm
    
    Hi,
    
    I used the Perl script for this run. The given error is from admixture.
    
    This is how I tried to do the run:
    
    1. Ran the Perl script to create file 'b.tped'.
    
    2. Ran plink to create 'b.bed':
    
    plink --tfile b --out b --make-bed --missing-genotype - --output-missing-genotype 0
    
    Output:
    
    ...
    
    Web-based version check ( --noweb to skip )
    Recent cached web-check found...Problem connecting to web
    
    Writing this text to log file [ b.log ]
    Analysis started: Sun Feb 27 19:06:16 2011
    
    Options in effect:
    --tfile b
    --out b
    --make-bed
    --missing-genotype -
    --output-missing-genotype 0
    
    Reading pedigree information from [ b.tfam ]
    1 individuals read from [ b.tfam ]
    0 individuals with nonmissing phenotypes
    Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
    Missing phenotype value is also -9
    0 cases, 0 controls and 1 missing
    1 males, 0 females, and 0 of unspecified sex
    966977 (of 966977) markers to be included from [ b.tped ]
    Before frequency and genotyping pruning, there are 966977 SNPs
    0 founders and 1 non-founders found
    168 heterozygous haploid genotypes; set to missing
    Writing list of heterozygous haploid genotypes to [ b.hh ]
    966977 SNPs with no founder genotypes observed
    Warning, MAF set to 0 for these SNPs (see --nonfounders)
    Writing list of these SNPs to [ b.nof ]
    Total genotyping rate in remaining individuals is 0.994775
    0 SNPs failed missingness test ( GENO > 1 )
    0 SNPs failed frequency test ( MAF < 0 )
    After frequency and genotyping pruning, there are 966977 SNPs
    After filtering, 0 cases, 0 controls and 1 missing
    After filtering, 1 males, 0 females, and 0 of unspecified sex
    Writing pedigree information to [ b.fam ]
    Writing map (extended format) information to [ b.bim ]
    Writing genotype bitfile to [ b.bed ]
    Using (default) SNP-major mode
    
    Analysis finished: Sun Feb 27 19:06:28 2011
    
    3. Ran admixture:
    
    admixture b.bed 1
    
    ...
    Random seed: 43
    Point estimation method: Block relaxation algorithm
    Convergence acceleration algorithm: QuasiNewton, 3 secant conditions
    Point estimation will terminate when objective function delta < 0.0001
    Estimation of standard errors disabled; will compute point estimates only.
    Error: detected that all genotypes are missing for a SNP locus.
    Please apply quality-control filters to remove such loci.
    
    Thanks!
    - Zack February 27, 2011 at 7:32 pm
      
      Admixture needs a file with multiple individuals, not a single individual.
      
      Also K is the number of ancestral populations Admixture should find, so setting it to 1 is not useful at all.
      
      If you want an an admixture analysis, have you sent me your data?
Jayanth March 3, 2011 at 8:56 am

Okay, I'll try out your suggestions. Thanks!

I haven't sent you my data yet - I was thinking of playing around with admixture, figure out how it works, etc. But I might, in the near future. 🙂
manglalayag June 7, 2013 at 12:34 am

Hi Zack, I know this is an old post but I've ran out of luck and need help. I'm trying to merge my SNPs to the Pan-Asian data. Basically:

1. Converted 23nMe to bed, bim, bam
2. Converted Pan-Asian to bed, bim, bam
3. -extracted the Pan-Asian SNP
4. Filtered my 23nMe using Pan-Asian.snplist
5. This is where the problem begins. I tried to merged 23nMe to Pan-Asian.
==> plink --bfile PASNP --bmerge 23nMeFil.bed 23nMeFil.bim 23nMeFil.fam --make-bed --out PAnMe

Error comes outlike this:
*********************
Found 479 SNPs that do not match in terms of allele codes
Might include strand flips, although flipped A/T and C/G SNPs will be undetected)
Writing problem SNPs to [ PAnGlenn.missnp ]
479 markers with 3+ alleles present.
*********************

6. I tried to --flip option
==> plink --bfile 23nMe --flip PAnMe-merge.missnp --make-bed --out 23nMeFlip

Tried --bmerge again and the same exact error.

Tried --exclude option and the same exact error.

Any help would be greatly appreciated. Also, thank you very much for all your blogs. It's a gold mine!
- manglalayag June 7, 2013 at 2:57 pm
  
  It's working now. I tried to filter the PASNP data before combining both and it worked.

Harappa Ancestry Project

Genetics and South Asia