Clinical Exome Sequencing: Hope for Rare Genetic Diseases?

Biomedical researchers have often been insulated from the patients they’re studying. At such a distance, it’s sometimes hard to appreciate the day-to-day struggle of people with genetic diseases.  That’s why I like this dual perspective: A study in the Journal of Medical Genetics describes the exome sequencing of 12 patients rare genetic conditions; accompanying it is Hunting Down My Son’s Killer, a blog post by the father of one of the patients chronicling his family’s struggle in understanding the boy’s disease.

You should read both. On one hand, we have the study, in which David Goldstein and colleagues at Duke University describe the exome sequencing of 12 families with unknown but likely genetic diseases. These families met two or more of the following criteria:

  1. Unexplained intellectual disability and/or developmental delay;
  2. One major congenital anomaly;
  3. 2–3 minor congenital anomalies
  4. Facial dysmorphisms

Additionally, the researchers required that the proband and both unaffected parents were available for exome sequencing, that previous genetic testing (Affy 6.0 SNP array) had been normal, and that there were no teratogenic or accidental events in the proband’s life that might be causal.

Their analysis also made use of 830 presumably-undiseased control samples that were enrolled at Duke for human genome variation studies.

Variant Identification and Filtering

With exome sequence data in hand, Dr. Goldstein and colleagues began the search for potential disease-causing variants. They were looking in particular for:

  1. Putative recessive or X-linked variants that were homozygous in the proband and but never homozygous in the parents or any control.
  2. Putative de novo variants heterozygous in the proband but absent from parents.
  3. Compound protein-altering (missense/nonsense/frameshift) heterozygotes in a single gene that did not occur together in the parents or controls.

The authors generated, and confirmed in a CLIA setting, the likely genetic diagnosis for 6 of 12 patients studied.

One Family’s Struggle

On the other side, the patients’ side, is a riveting tale of parents whose son began showing signs of developmental delay at 3 months. A suite of other symptoms followed, accompanied by various hospitalizations, mis-diagnoses, and failed treatments. Several times, the father writes, someone would ask him and his wife “Are you two sure you’re not related?” He’s from an Ohio farm family and she’s from Puerto Rico. So, no. (And, “Duh!”).

Alternatively, some doctors took the wife aside and asked, “Is it possible he’s not the actual father?” Wow, great bedside manner.

At last, the exome sequencing study revealed a likely genetic cause for the boy’s disorder, and suggested a potential treatment. A good thing on both accounts, as the mother was pregnant again. They had a daughter who, fortunately, did not carry the mutation.

This is just one family’s chronicle – there are eleven other families with similar struggles whose story we haven’t heard, and six of them, sadly, have not yet found the answer in exome sequencing.

Need AC, Shashi V, Hitomi Y, Schoch K, Shianna KV, McDonald MT, Meisler MH, & Goldstein DB (2012). Clinical application of exome sequencing in undiagnosed genetic conditions. Journal of medical genetics PMID: 22581936

Grants and Jobs at the Genome Institute

One of the world leaders in genomics and next-generation sequencing is ramping up. The Genome Institute at Washington University has had a busy year already. In January, TGI and its collaborators published three Nature papers that came online the same day: one on the genetic basis of an aggressive pediatric leukemia, another on genomic and epigenetic analyses of childhood retinoblastoma, and a third on tumor evolution in relapsed AML. In February, the funding period began for “A Turnkey System for High-Throughput Variant Discovery and Interpretation”, a U01 grant under which TGI will improve and share its genome analysis tools with the research community. And now, there are job openings in four different groups at the Genome Institute.

Informatics Grant
Staff Scientist in Human Genetics
Postdoc Research Associate in Medical Genomics
Postdoc Research Associate in Parasite Genomics
Programmer/Analyst in Analysis Pipeline Group

Next-Gen Sequencing Informatics Grant

TGI won a four-year, $805,000 grant to develop its analysis pipelines into A Turnkey system for High-Throughput Variant Discovery and Interpretation (NIH project link), one of several informatics grants reported by GenomeWeb’s BioInform last month. TGI has spent years developing a computational framework and innovative tools for NGS analysis, with a particular emphasis on variant discovery and annotation. The goal of the project is to make these tools available to the wider community, both individually and as part of a complete informatics solution from alignment to detection to interpretation. This “turnkey system” will be flexible and powerful enough to be adopted by experienced laboratories, and user-friendly enough to give push-button analysis capabilities to groups with little bioinformatics expertise.

Bottom line, anyone will be able to run “Washington University Genome Institute” analysis on their sequencing datasets with little bioinformatics expertise.

Staff Scientist in Human Genetics

TGI’s Human Genetics Group is looking for a statistical geneticist or biostatistician to work with a dedicated team of researchers investigating inherited human diseases. As you might expect for a major genome center, there are many projects from small family-based studies to large trio studies to massive studies of complex disease involving thousands of unrelated samples. We are looking for someone to help evaluate, design, and execute statistical analysis plans for sequencing projects.

For more details, see Job Posting 23423.

Postdoc Research Associate in Medical Genomics

The Medical Genomics Group is looking for a postdoc research associate in the area of cancer genomics. There are many such projects here, ranging from studies of a single tumor to large-scale studies involving thousands of samples. The ultimate goal is to translate discoveries enabled by next-gen sequencing into medically actionable information. As such, an individual who can assist in the development, implementation, and application of algorithms to characterize and interpret sequence variation in the context of cancer is needed.

For more information, see Job Posting 23424.

Postdoc Research Associate in Parasite Genomics

There is also a postdoc research associate position open in parasite genomics, as part of an established and successful research group focused on integrating ‘omics’ approaches aimed at understanding organisms at a molecular level. This group is working on comparative analysis of parasitic helminthes to identify conserved and/or taxonomically restricted proteins that may prove useful as antiparasitic drugs. The responsibilities will include design, development, testing, and implementation of software applications for comparative analyses.

For more information, see Job Posting 23361.

Programmer/Analyst in Analysis Pipeline Group

The analysis pipeline group has a position open for a software engineer in data management & compression. The job will entail working on a team of 20 software engineers on next-gen sequence analysis pipelines, focusing on utilizing the information management capabilities of the analysis system to migrate data appropriately between different tiers of storage and eliminate data duplication. Responsibilities will include development, integration, and support of software tools/pipelines/databases in collaboration with data analysts.

For details, see Job Posting 23269.


The genetic architecture of triple-negative breast cancer

Triple-negative breast cancer (TNBC), a tumor type defined by its lack of estrogen receptor, progesterone receptor, and Her2 (ERBB2) amplification, accounts for 16% of breast cancers. This clinically defined tumor type overlaps substantially but not completely with “basal-like” breast cancer, a classification based upon gene expression signature. This is a highly heterogeneous disease with a higher risk of recurrence in the absence of systemic therapy.

This month in Nature, researchers from BC Cancer Agency have characterized the landscape of genomic aberrations in 104 TNBC cases with a combination of whole-genome sequencing, exome sequencing, RNA-seq, and high-density SNP arrays. Using ultra-deep targeted resequencing, the authors validated ~2,500 somatic mutations and characterized their frequencies among heterogeneous clonal populations in each tumor.

Genomics of Basal TNBC
Genomic architecture (Basal TNBC), Shah et al, Nature 2012

This is a complex study that’s hard to digest (the supplemental material had over 140 pages – come on, is that really necessary?) so I’ll do my best to break it down. I believe there are three highlights: frequent gene alterations in TNBC, under-representation of mutations in mRNA sequences, and a continuous distribution of mutation frequencies within tumors.

Genetic Alterations in Triple-Negative Breast Cancer

The most frequently mutated gene should be familiar to you: TP53, which harbored validated somatic mutations in 62% of basal and 43% of non-basal TNBCs. Other patterns of alteration were as follows:

  • Significantly mutated genes included TP53, PIK3CA, RB1, PTEN, MYO3A, and GH1. Here, significant means that the gene harbored more mutations than expected from background random mutation processes. The larger the gene, the more likely it is to catch a random mutation. That’s why USH2A, which was mutated in 9.2% of cases, was not significant (it’s a large gene).
  • Recurrent but not statistically significant mutations were observed in the synuclein genes (SYNE1/SYNE2), BRCA2, BRAF, NRAS, ERBB2, and ERBB3.

Somatic Copy Number Alterations (CNAs)

The patterns of somatic copy number changes, as assessed by high-density SNP array, suggest widespread segmental CNA instability:
Somatic copy number in TNBC
Shah et al, Nature 2012 (Supp Fig 3)
These results are largely consistent with a separate study in the same issue that examined CNAs in 2,000 triple-negative breast cancers. I’ll have to cover that another time. Some of the known CNA patterns evident above include gains of chromosomes 1q, 3q, and 8q (where MYC is located). Note frequent deletions across many chromosome arms or entire chromosomes, many of which contain tumor suppressor genes (e.g. TP53 on chromosome 17).

Expression of Somatic Mutations

This group at BC Cancer Agency is a leader in transcriptome sequencing (RNA-Seq), which is a key component of this study. Strikingly, the authors found that just 36% of validated somatic mutations discovered in genomic DNA were present in mRNA transcripts. This number is a little deceptive, and I’ll tell you why. Supplementary figure 2 offers a summary of the expression of validated mutations across all cases with RNA-seq data:
Expression of somatic mutations in TNBC
Mutation Expression Patterns (Shah et al, Nature 2012)
Notably, 23% of somatic mutations occur in genes with no observed transcripts: there’s no allelic effect of the mutation; the genes just aren’t expressed in any form. That leaves:
  • 40.56% of genes where only the wild-type allele is expressed. Here, it’s possible that the mutation alters mRNA expression or stability and thus only the non-mutated allele is seen.
  • 31.48% where both alleles are expressed. The mutation may not affect expression, but it could still alter the translation or function of the encoded protein.
  • 5% where only the mutant allele is expressed. This could be due to genomic loss of the wild-type allele (LOH), mutations on the X-chromosome (one copy of which is inactivated), or even a gain-of-function mutation causing aberrant gene expression.
Bottom line, just over one-third of somatic mutations in the genome are present in the transcriptome. This has important implications for clinical cancer genome sequencing: just because a druggable mutation is present doesn’t mean it’s expressed.

Continuous Distribution of Somatic Mutations

With ultra-deep targeted sequencing, it’s possible to estimate the allele frequency of a somatic mutation with high accuracy, and from that, to infer the relative proportion of tumor cells harboring that mutation. A heterozygous founder mutation, for example, would be present in virtually all tumor cells and have a mutation frequency of 50% in diploid cells. Perhaps surprisingly, the authors find that somatic mutations occur at a continuous distribution in TNBC, and this appears independent of copy number alterations and tumor cellularity.

Part of this observation may technical in nature (i.e. false negatives in mutation discovery). However, this phenomenon has been noted in other epithelial cancers suggesting that the mutation content of cells within a single tumor may be differently shaped by biological processes and mutational mechanisms. It reinforces the notion that tumors (and TNBC in particular) are not a homogeneous mass of identical cells, but a collection of distinct sub-populations of cells evolving somewhat independently of one another. This is probably why they’re sometimes difficult to eliminate: you might destroy most of the subpopulations with therapy, but one or more minor clones could persist.


Shah SP, Roth A, Goya R, Oloumi A, Ha G, Zhao Y, Turashvili G, Ding J, Tse K, Haffari G, Bashashati A, Prentice LM, Khattra J, Burleigh A, Yap D, Bernard V, McPherson A, Shumansky K, Crisan A, Giuliany R, Heravi-Moussavi A, Rosner J, Lai D, Birol I, Varhol R, Tam A, Dhalla N, Zeng T, Ma K, Chan SK, Griffith M, Moradian A, Cheng SW, Morin GB, Watson P, Gelmon K, Chia S, Chin SF, Curtis C, Rueda OM, Pharoah PD, Damaraju S, Mackey J, Hoon K, Harkins T, Tadigotla V, Sigaroudinia M, Gascard P, Tlsty T, Costello JF, Meyer IM, Eaves CJ, Wasserman WW, Jones S, Huntsman D, Hirst M, Caldas C, Marra MA, & Aparicio S (2012). The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature PMID: 22495314

Fast, Efficient Short Read Alignment with Gaps: Bowtie 2

Bowtie AlignerI’ve always been a fan of Bowtie, one of the first algorithms to leverage Burrows-Wheeler Transform for short read alignment. When I first encountered it in 2008, it was incredibly fast. Faster than Maq and Novoalign, two of the early popular algorithms for read mapping. Perhaps more importantly, it was ultra memory-efficient, enabling one to map millions of reads on a typical desktop computer. You’d still need the technical expertise to do anything with the alignments, but hey, it was a start. I liked it enough that the first version of VarScan included support for native Bowtie alignment formats (this was before the widespread adoption of SAM/BAM format).

Early Bowtie Aligner Limitations

Despite these features, Bowtie had a few limitations: First, it required all reads to have the same length and had an upper read length maximum that made it essentially incompatible with Roche/454 data. This wasn’t a big problem, because there were other aligners for 454 data that could handle its moderate level of throughput.

Even though it was faster, Bowtie was less suitable for paired-end data than Maq because it didn’t leverage the mate pairing information to improve alignment – it simply attempted to map each read in the mate pair independently, then went back to calculate the distance between them. This was kind of a bummer, but still made Bowtie quite suitable for fragment-end data which had the majority in 2008.

Another Bowtie limitation was that it didn’t align reads with gaps. In other words, if a read contained an insertion or deletion relative to the reference sequence, Bowtie wouldn’t map it. Side note: This also would have prevented Bowtie from working on Roche/454 data (and later IonTorrent data) due to the known homopolymer-associated sequencing errors. At the time, however, everyone was still struggling with SNP detection in next-gen sequencing data, so ungapped alignments weren’t a dealbreaker.

Indels and Gapped Alignments

In time, though, as our capability to detect insertion/deletion variants (indels) increased — due to algorithmic developments as well as longer reads — gapped alignment became more and more important. Benjamin Langmead, the developer and first author, once mentioned to me that it was the most-requested feature for Bowtie. The demand undoubtedly continued to increase as aligners such as BWA offered similar speed and memory performance, while making efforts to align reads across gaps. In paired-end data with one read anchored, BWA will even perform a more sensitive Smith-Waterman alignment to align its mate while allowing gaps. There was also Novoalign, a commercial aligner, which seemed the most sensitive to gaps in reads according to findings by Heng Li, myself, and others.

Interestingly, the Pindel algorithm, which identifies indels by splitting up the unmapped mate in a read pair where only one read mapped, nicely complements this limitation. In fact, the original Bowtie software paired with Pindel seems like it would be a powerful combination for efficient read mapping with indel detection.

Bowtie 2: Fast Alignment with Gaps

Several subsequent releases of Bowtie addressed some of the early limitations, and continued to increase its performance. And finally, we got the gapped alignment feature we were waiting for in Bowtie 2, which was just published in Nature Methods.

Bowtie 2 aligner

In the publication, Langmead and Salzberg describe a sort of hybrid algorithm that allows efficient gapped alignment of short reads. It essentially has four steps to it:

  1. “Seed” substrings, which are short segments that are likely to have unique matches in the genome, are extracted from each read
  2. Seeds are aligned to the reference genome in ungapped fashion using the compressed index.
  3. Seed placements in the genome are prioritized to find the most likely map location(s)
  4. Seeds are extended into full alignments (allowing gaps) with a hardware-accelerated dynamic programming algorithm

Here, Bowtie leverages the speed of its “full-text minute index” for ungapped alignment to rapidly place seed segments without gaps, and then an accelerated algorithm to do the full read alignment with gaps. According to the authors, it’s a combination that allows for high speed, sensitivity, and accuracy.

The ability of this new Bowtie algorithm to align with gaps will also aid RNA-Seq analysis using the TopHat package, which utilizes Bowtie as its core aligner, because the gaps that are present in mature mRNA are likely to be better handled.

Bottom line, even if you’re using something else to align reads right now, Bowtie might be worth a look.

Download Bowtie 2:


Langmead B, & Salzberg SL (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9 (4), 357-9 PMID: 22388286