About 5 months after I sent my saliva to 23andMe, I received an email that my exome results were ready. The data were in a large (4.2 GB) encrypted file folder, that could be opened only after I had downloaded and installed TrueCrypt. Eventually I was able to download and unpackage my data. These data consist of 4 files, all labeled with my identifier: “LF1396″ and ending with a .bam, .bai, .vcf.gz, and .report.pdf. The .bam file contains the alignments of the Illumina reads to the human reference sequence, the hg19 release. The .bai file is an index file of the read alignments. The .vcf.gz is a zipped .vcf file, for “variant call format” developed by the 1000 Genomes Project, in the latest version 4.1. And the pdf report is a 17-page summary explaining the file formats, my “exome at a glance” summary statistics, and a description of the filtering scheme used to select 21 variants of interest. The rest of the report describes each of the 21 variants, sequentially filtered for high or moderate predicted effect, occurring at low frequency (<1%), in genes involved in Mendelian disorders.
Figure 1A shows that a little under 4 billion bases align to or near the targeted exons. These on-target and near-target bases map to about 120 million exonic positions. The vast majority of the exonic base calls are identical to the human reference genome.
About 0.1% of the exonic base calls are variant compared to the reference sequence. Figure 1C shows that these variants consist of about 100,000 single-nucleotide polymorphisms (SNPs) and 10,000 insertions/deletions (indels). These numbers are consistent with unrelated humans sharing 99.9% DNA sequence identity.
Given over 100,000 total variants, which should I look at first? Which of these are most likely to influence my health or appearance or behavior? Which of these have the most impact on me being me? Although 23andMe specifically stated that no consumer-level interpretation would be provided as part of their pilot exome sequencing project, they do provide annotation of the variant calls, in the vcf file.
Figure 2 from the 23andMe exome report shows the distribution of my approximately 110,000 variants categorized according to their predicted impact on gene function.
- High impact variants include gain of premature stop codons (nonsense mutations), frameshifts, splice site alterations, and loss of stop codons. My exome sequence contains 634 of these.
- Moderate impact variants include non-synonymous substitutions (amino acid changes) and codon insertions and deletions (addition or deletion of amino acid residues). My exome sequence contains 11,504 of these.
- Low impact variants include synonymous substitutions (no change in amino acid sequence) or gain of a start codon.
- Unknown impact variants are those “unlikely to affect gene products” – presumably because they occur in non-exonic (intergenic or intronic) sequences.
Another way to look at these variants is by frequency in the human population. Variants that occur at high frequency are less likely to have serious adverse consequences. Conversely, it’s tempting to think that rare and unique variants may contribute to me being such a unique and rare individual.
Figure 3 shows that about 15% of my exome variants are rare (occur at <1% frequency) or previously unidentified (unique). As more exomes and whole genomes are sequenced, the proportion of “unique” variants will diminish, but the 15% proportion of rare variants is unlikely to shift significantly. After all, you have to figure that most of the common variants have already been identified.
These classifications can be combined to filter the variants, first by predicted effect, then by frequency, to identify those variants with high or moderate predicted impact, that are rare. Then 23andMe asked whether any of these filtered variants occur among a list of 592 genes “involved in Mendelian disorders” (Figure 4).
This filtering scheme resulted in a list of 21 variants. All 21 on my report were predicted to have “moderate” impact, and all were non-synonymous substitutions. But even a cursory look through these 21 amino acid changes reveals that some are more likely to affect protein structure or function than others. For example, some are conservative amino acid changes, where the variant amino acid has similar physico-chemical properties as the original amino acid. Examples are L25V (leucine at amino acid position 25 changed to valine; both have hydrophobic side chains) and I929V (again, isoleucine and valine are both hydrophobic). Other changes are more potentially disruptive, where the variant amino acid has very different properties from the original. Examples are R1125W, with arginine (a positively charged side chain) replaced by tryptophan (large hydrophobic side chain); E158K and E482K, which substitute positively charged lysine for a negatively charged glutamic acid; and R150C, which puts cysteine in place of arginine.
The report does not say whether I am homozygous or heterozygous for any of these 21 variants. I presume that I am heterozygous for all of them (23andMe excluded X and Y chromosome genes). I can check these myself by looking them up in the vcf file (that will be a later post).
This post then gives curious readers what they can expect at this point if they have their exome sequenced by 23andMe. Clearly, this barely scrapes the surface of one tiny corner of the exome sequence data. In my next post, I will present some open-source tools for looking at and sifting through the data yourself. In the meantime, I am making my vcf file publicly available here: http://dl.dropbox.com/u/69564734/LF1396.vcf.gz