A first look at my exome variants from 23andMe

About 5 months after I sent my saliva to 23andMe, I received an email that my exome results were ready. The data were in a large (4.2 GB) encrypted file folder, that could be opened only after I had downloaded and installed TrueCrypt. Eventually I was able to download and unpackage my data. These data consist of 4 files, all labeled with my identifier: “LF1396” and ending with a .bam, .bai, .vcf.gz, and .report.pdf. The .bam file contains the alignments of the Illumina reads to the human reference sequence, the hg19 release. The .bai file is an index file of the read alignments. The .vcf.gz is a zipped .vcf  file, for “variant call format” developed by the 1000 Genomes Project, in the latest version 4.1. And the pdf report is a 17-page summary explaining the file formats, my “exome at a glance” summary statistics, and a description of the filtering scheme used to select 21 variants of interest. The rest of the report describes each of the 21 variants, sequentially filtered for high or moderate predicted effect, occurring at low frequency (<1%), in genes involved in Mendelian disorders.

Figre 1. Bases sequenced and exome coverage. A: number of bases sequenced; top line indicates total coverage of 117X. B: Number of called bases in exome. Small red sliver indicates variants from reference genome (hg19).

Figure 1A shows that a little under 4 billion bases align to or near the targeted exons. These on-target and near-target bases map to about 120 million exonic positions. The vast majority of the exonic base calls are identical to the human reference genome.

Figure 1C: Variant calls listed in the vcf file.

About 0.1% of the exonic base calls are variant compared to the reference sequence. Figure 1C shows that these variants consist of about 100,000 single-nucleotide polymorphisms (SNPs) and 10,000 insertions/deletions (indels). These numbers are consistent with unrelated humans sharing 99.9% DNA sequence identity.

Given over 100,000 total variants, which should I look at first? Which of these are most likely to influence my health or appearance or behavior? Which of these have the most impact on me being me? Although 23andMe specifically stated that no consumer-level interpretation would be provided as part of their pilot exome sequencing project, they do provide annotation of the variant calls, in the vcf file.

Figure 2. Classification of variants by predicted impact on gene function.

Figure 2 from the 23andMe exome report shows the distribution of my approximately 110,000 variants categorized according to their predicted impact on gene function.

  • High impact variants include gain of premature stop codons (nonsense mutations), frameshifts, splice site alterations, and loss of stop codons. My exome sequence contains 634 of these.
  • Moderate impact variants include non-synonymous substitutions (amino acid changes) and codon insertions and deletions (addition or deletion of amino acid residues). My exome sequence contains 11,504 of these.
  • Low impact variants include synonymous substitutions (no change in amino acid sequence) or gain of a start codon.
  • Unknown impact variants are those “unlikely to affect gene products” – presumably because they occur in non-exonic (intergenic or intronic) sequences.

Another way to look at these variants is by frequency in the human population. Variants that occur at high frequency are less likely to have serious adverse consequences. Conversely, it’s tempting to think that rare and unique variants may contribute to me being such a unique and rare individual.

Figure 3. Variant frequencies.

Figure 3 shows that about 15% of my exome variants are rare (occur at <1% frequency) or previously unidentified (unique). As more exomes and whole genomes are sequenced, the proportion of “unique” variants will diminish, but the 15% proportion of rare variants is unlikely to shift significantly. After all, you have to figure that most of the common variants have already been identified.

These classifications can be combined to filter the variants, first by predicted effect, then by frequency, to identify those variants with high or moderate predicted impact, that are rare. Then 23andMe asked whether any of these filtered variants occur among a list of 592 genes “involved in Mendelian disorders” (Figure 4).

Figure 4. Variant filtering process.

This filtering scheme resulted in a list of 21 variants. All 21 on my report were predicted to have “moderate” impact, and all were non-synonymous substitutions. But even a cursory look through these 21 amino acid changes reveals that some are more likely to affect protein structure or function than others. For example, some are conservative amino acid changes, where the variant amino acid has similar physico-chemical properties as the original amino acid. Examples are L25V (leucine at amino acid position 25 changed to valine; both have hydrophobic side chains) and I929V (again, isoleucine and valine are both hydrophobic). Other changes are more potentially disruptive, where the variant amino acid has very different properties from the original. Examples are R1125W, with arginine (a positively charged side chain) replaced by tryptophan (large hydrophobic side chain); E158K and E482K, which substitute positively charged lysine for a negatively charged glutamic acid; and R150C, which puts cysteine in place of arginine.

The report does not say whether I am homozygous or heterozygous for any of these 21 variants. I presume that I am heterozygous for all of them (23andMe excluded X and Y chromosome genes). I can check these myself by looking them up in the vcf file (that will be a later post).

This post then gives curious readers what they can expect at this point if they have their exome sequenced by 23andMe. Clearly, this barely scrapes the surface of one tiny corner of the exome sequence data. In my next post, I will present some open-source tools for looking at and sifting through the data yourself. In the meantime, I am making my vcf file publicly available here: http://dl.dropbox.com/u/69564734/LF1396.vcf.gz


About jchoigt

I'm an Associate Professor in the School of Biology at Georgia Tech, and Faculty Coordinator of the Professional MS Bioinformatics degree program.
This entry was posted in human genetics. Bookmark the permalink.

23 Responses to A first look at my exome variants from 23andMe

  1. Pingback: Quora

  2. Ben Darbro says:

    Would you be willing to share your BAM file?

  3. MSamuels says:

    Very interesting analysis, and brave of you to post these results and make the vcf files public. I am a geneticist working with lots of exome datasets these days. The number you cite of over 5000 novel variants is high, either their software is overcalling due to either sequencing or alignment errors, or possibly they are including lots of neighboring intronic sequence – we usually include about 5 bases to look for splice site mutations. Also we find it very important to subtract against other unrelated exomes to deal with intransigent alignment problems due to lots of duplications throughout the human genome many of which include parts of genes. Good luck making sense of this!

    • jchoigt says:

      I think the majority of these variants are outside the exomes, in the neighboring intronic sequences. My impression is that they aligned all the reads and called the variants after filtering for alignment quality, but did not restrict their variant calls to just the protein coding sequences. Thus many variants in the vcf file are intronic.

  4. Jason Bobe says:

    We would love to have you join the PGP, especially if you would like to donate your exome sequence to our public repository of genomes. We have several folks who have donated their 23andMe exomes already. See for example:


    You may sign-up here:

    Let me know if you have questions!

    Best wishes,

    • jchoigt says:

      I actually started the enrollment process over a year ago, and paused in the middle of the consent form to discuss with my family. Just did not get back to complete it. I’ll be happy to donate my exome sequence.

  5. Paul G. says:


    I’d be interested in even just having the bam header info, so we can see how the alignment was done. Could you post the contents of “samtools view -H baffle”?


    Paul G.

  6. Angie Hinrichs says:

    Thanks for sharing! One cool thing about the VCF file is that its headers include some info about how it was created, e.g. variant calls by GATK (http://www.broadinstitute.org/gsa/wiki/index.php/Home_Page) and predicted effects by snpEff (http://snpeff.sourceforge.net/). Both are open software packages; GATK development seems heavily intertwined with the 1000 Genomes project (http://1000genomes.org).

  7. Deniz Kural says:


    It looks like through having more exome data we will all learn a lot about LOF variants w/o catastrophic consequences ( http://www.sciencemag.org/content/335/6070/823 ), I’m waiting patiently for my own exome results from 23&Me! 🙂

    We’d love to help analyze the raw data further – we can align with different aligners, use other variant callers… and you can also share the results with others (who can also analyze it). I’d be happy to add you to our early access users: http://www.sbgenomics.com

    We’d also love to host your BAM file & other files. In a few months; one will be able to publish them for others without a login wall to view – but for now happy to invite Ben and any others who’d like to have access & download it — you can invite anyone you’d like as well.

    Best Wishes,

    • jchoigt says:

      Hi Deniz,
      I just went to the site and asked for the invitation. No objections to you hosting my bam file.

  8. You do have X, Y and full MT results hidden in your BAM file.
    samtools view -h LFxxxxx.bam MT > LFxxxxx.chrMT.sam
    samtools view -h LFxxxxx.bam Y > LFxxxxx.chrY.sam
    samtools view -bS LFxxxxx.chrMT.sam > LFxxxxx.chrMT.bam
    samtools view -bS LFxxxxx.chrY.sam > LFxxxxx.chrY.bam

    I am also curious about the distribution of reads (depth) that are present in your results. There are some wide differences in read distributions between individuals.

    Some of the individuals who have been mining the 1000 Genomes data set for new y-SNPs would like to review the y-chromosome data from the exome results. The exome results did include my surname private y-SNP L188. There were also at least 2 novel y-SNPs identified in the first look at my data.

    Can you extract out the Y information and post it? I do have a spot where we can transfer larger files between us. E-mail me……

  9. m86 says:

    Just FYI, I added a link to this page in my personal 23andMe dataset spreadsheet ( http://bit.ly/cNDHf3 ) which is including public exome datasets.

  10. We have a tool for ranking variants from exomes based on lots of functional anntation, including HGMD disease mutations. Is it OK to use your data to demo this? If you are interested in the results, just contact me by mail or via @frankschacherer, I’d be happy to share them

    • jchoigt says:

      Hi Frank,
      The more the merrier. I’m absolutely happy for you to use my data. Please do let me know if you find anything interesting.

  11. Hi,

    Thanks for posting this. We’d just like to point out that the pdf report does include whether or not you are homozygous for a variant on the first line of the variant report. If you are having any trouble let us know at exome@23andme.com.

    • jchoigt says:

      Yes, I did realize that. At first I thought that was just showing how the mutation differed from the reference, but I now see that it is my heterozygous genotype.

  12. Pingback: DNA Sequencing and Personal Genomics Case Study for Intro Biology | Jung's Biology Blog

  13. Pingback: My 23andMe Exome Trios Arrived: Sneak Peek | Our 2 SNPs…®

  14. Hi,
    May I use the translation of your blog posts? If you don’t want will be deleted my blog post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s