RSS 2.0
  • Home
  • About
  • Aligners
  • Genomes
  • Subscribe
  • VarScan
  •  

    The genetic architecture of triple-negative breast cancer

    April 19th, 2012

    Triple-negative breast cancer (TNBC), a tumor type defined by its lack of estrogen receptor, progesterone receptor, and Her2 (ERBB2) amplification, accounts for 16% of breast cancers. This clinically defined tumor type overlaps substantially but not completely with “basal-like” breast cancer, a classification based upon gene expression signature. This is a highly heterogeneous disease with a higher risk of recurrence in the absence of systemic therapy.

    This month in Nature, researchers from BC Cancer Agency have characterized the landscape of genomic aberrations in 104 TNBC cases with a combination of whole-genome sequencing, exome sequencing, RNA-seq, and high-density SNP arrays. Using ultra-deep targeted resequencing, the authors validated ~2,500 somatic mutations and characterized their frequencies among heterogeneous clonal populations in each tumor.

    Genomics of Basal TNBC
    Genomic architecture (Basal TNBC), Shah et al, Nature 2012

    This is a complex study that’s hard to digest (the supplemental material had over 140 pages – come on, is that really necessary?) so I’ll do my best to break it down. I believe there are three highlights: frequent gene alterations in TNBC, under-representation of mutations in mRNA sequences, and a continuous distribution of mutation frequencies within tumors.

    Genetic Alterations in Triple-Negative Breast Cancer

    The most frequently mutated gene should be familiar to you: TP53, which harbored validated somatic mutations in 62% of basal and 43% of non-basal TNBCs. Other patterns of alteration were as follows:

    • Significantly mutated genes included TP53, PIK3CA, RB1, PTEN, MYO3A, and GH1. Here, significant means that the gene harbored more mutations than expected from background random mutation processes. The larger the gene, the more likely it is to catch a random mutation. That’s why USH2A, which was mutated in 9.2% of cases, was not significant (it’s a large gene).
    • Recurrent but not statistically significant mutations were observed in the synuclein genes (SYNE1/SYNE2), BRCA2, BRAF, NRAS, ERBB2, and ERBB3.

    Somatic Copy Number Alterations (CNAs)

    The patterns of somatic copy number changes, as assessed by high-density SNP array, suggest widespread segmental CNA instability:
    Somatic copy number in TNBC
    Shah et al, Nature 2012 (Supp Fig 3)
    These results are largely consistent with a separate study in the same issue that examined CNAs in 2,000 triple-negative breast cancers. I’ll have to cover that another time. Some of the known CNA patterns evident above include gains of chromosomes 1q, 3q, and 8q (where MYC is located). Note frequent deletions across many chromosome arms or entire chromosomes, many of which contain tumor suppressor genes (e.g. TP53 on chromosome 17).

    Expression of Somatic Mutations

    This group at BC Cancer Agency is a leader in transcriptome sequencing (RNA-Seq), which is a key component of this study. Strikingly, the authors found that just 36% of validated somatic mutations discovered in genomic DNA were present in mRNA transcripts. This number is a little deceptive, and I’ll tell you why. Supplementary figure 2 offers a summary of the expression of validated mutations across all cases with RNA-seq data:
    Expression of somatic mutations in TNBC
    Mutation Expression Patterns (Shah et al, Nature 2012)
    Notably, 23% of somatic mutations occur in genes with no observed transcripts: there’s no allelic effect of the mutation; the genes just aren’t expressed in any form. That leaves:
    • 40.56% of genes where only the wild-type allele is expressed. Here, it’s possible that the mutation alters mRNA expression or stability and thus only the non-mutated allele is seen.
    • 31.48% where both alleles are expressed. The mutation may not affect expression, but it could still alter the translation or function of the encoded protein.
    • 5% where only the mutant allele is expressed. This could be due to genomic loss of the wild-type allele (LOH), mutations on the X-chromosome (one copy of which is inactivated), or even a gain-of-function mutation causing aberrant gene expression.
    Bottom line, just over one-third of somatic mutations in the genome are present in the transcriptome. This has important implications for clinical cancer genome sequencing: just because a druggable mutation is present doesn’t mean it’s expressed.

    Continuous Distribution of Somatic Mutations

    With ultra-deep targeted sequencing, it’s possible to estimate the allele frequency of a somatic mutation with high accuracy, and from that, to infer the relative proportion of tumor cells harboring that mutation. A heterozygous founder mutation, for example, would be present in virtually all tumor cells and have a mutation frequency of 50% in diploid cells. Perhaps surprisingly, the authors find that somatic mutations occur at a continuous distribution in TNBC, and this appears independent of copy number alterations and tumor cellularity.

    Part of this observation may technical in nature (i.e. false negatives in mutation discovery). However, this phenomenon has been noted in other epithelial cancers suggesting that the mutation content of cells within a single tumor may be differently shaped by biological processes and mutational mechanisms. It reinforces the notion that tumors (and TNBC in particular) are not a homogeneous mass of identical cells, but a collection of distinct sub-populations of cells evolving somewhat independently of one another. This is probably why they’re sometimes difficult to eliminate: you might destroy most of the subpopulations with therapy, but one or more minor clones could persist.

    References

    Shah SP, Roth A, Goya R, Oloumi A, Ha G, Zhao Y, Turashvili G, Ding J, Tse K, Haffari G, Bashashati A, Prentice LM, Khattra J, Burleigh A, Yap D, Bernard V, McPherson A, Shumansky K, Crisan A, Giuliany R, Heravi-Moussavi A, Rosner J, Lai D, Birol I, Varhol R, Tam A, Dhalla N, Zeng T, Ma K, Chan SK, Griffith M, Moradian A, Cheng SW, Morin GB, Watson P, Gelmon K, Chia S, Chin SF, Curtis C, Rueda OM, Pharoah PD, Damaraju S, Mackey J, Hoon K, Harkins T, Tadigotla V, Sigaroudinia M, Gascard P, Tlsty T, Costello JF, Meyer IM, Eaves CJ, Wasserman WW, Jones S, Huntsman D, Hirst M, Caldas C, Marra MA, & Aparicio S (2012). The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature PMID: 22495314

    AddThis Social Bookmark Button

    Fast, Efficient Short Read Alignment with Gaps: Bowtie 2

    April 12th, 2012

    Bowtie AlignerI’ve always been a fan of Bowtie, one of the first algorithms to leverage Burrows-Wheeler Transform for short read alignment. When I first encountered it in 2008, it was incredibly fast. Faster than Maq and Novoalign, two of the early popular algorithms for read mapping. Perhaps more importantly, it was ultra memory-efficient, enabling one to map millions of reads on a typical desktop computer. You’d still need the technical expertise to do anything with the alignments, but hey, it was a start. I liked it enough that the first version of VarScan included support for native Bowtie alignment formats (this was before the widespread adoption of SAM/BAM format).

    Early Bowtie Aligner Limitations

    Despite these features, Bowtie had a few limitations: First, it required all reads to have the same length and had an upper read length maximum that made it essentially incompatible with Roche/454 data. This wasn’t a big problem, because there were other aligners for 454 data that could handle its moderate level of throughput.

    Even though it was faster, Bowtie was less suitable for paired-end data than Maq because it didn’t leverage the mate pairing information to improve alignment – it simply attempted to map each read in the mate pair independently, then went back to calculate the distance between them. This was kind of a bummer, but still made Bowtie quite suitable for fragment-end data which had the majority in 2008.

    Another Bowtie limitation was that it didn’t align reads with gaps. In other words, if a read contained an insertion or deletion relative to the reference sequence, Bowtie wouldn’t map it. Side note: This also would have prevented Bowtie from working on Roche/454 data (and later IonTorrent data) due to the known homopolymer-associated sequencing errors. At the time, however, everyone was still struggling with SNP detection in next-gen sequencing data, so ungapped alignments weren’t a dealbreaker.

    Indels and Gapped Alignments

    In time, though, as our capability to detect insertion/deletion variants (indels) increased — due to algorithmic developments as well as longer reads — gapped alignment became more and more important. Benjamin Langmead, the developer and first author, once mentioned to me that it was the most-requested feature for Bowtie. The demand undoubtedly continued to increase as aligners such as BWA offered similar speed and memory performance, while making efforts to align reads across gaps. In paired-end data with one read anchored, BWA will even perform a more sensitive Smith-Waterman alignment to align its mate while allowing gaps. There was also Novoalign, a commercial aligner, which seemed the most sensitive to gaps in reads according to findings by Heng Li, myself, and others.

    Interestingly, the Pindel algorithm, which identifies indels by splitting up the unmapped mate in a read pair where only one read mapped, nicely complements this limitation. In fact, the original Bowtie software paired with Pindel seems like it would be a powerful combination for efficient read mapping with indel detection.

    Bowtie 2: Fast Alignment with Gaps

    Several subsequent releases of Bowtie addressed some of the early limitations, and continued to increase its performance. And finally, we got the gapped alignment feature we were waiting for in Bowtie 2, which was just published in Nature Methods.

    Bowtie 2 aligner

    In the publication, Langmead and Salzberg describe a sort of hybrid algorithm that allows efficient gapped alignment of short reads. It essentially has four steps to it:

    1. “Seed” substrings, which are short segments that are likely to have unique matches in the genome, are extracted from each read
    2. Seeds are aligned to the reference genome in ungapped fashion using the compressed index.
    3. Seed placements in the genome are prioritized to find the most likely map location(s)
    4. Seeds are extended into full alignments (allowing gaps) with a hardware-accelerated dynamic programming algorithm

    Here, Bowtie leverages the speed of its “full-text minute index” for ungapped alignment to rapidly place seed segments without gaps, and then an accelerated algorithm to do the full read alignment with gaps. According to the authors, it’s a combination that allows for high speed, sensitivity, and accuracy.

    The ability of this new Bowtie algorithm to align with gaps will also aid RNA-Seq analysis using the TopHat package, which utilizes Bowtie as its core aligner, because the gaps that are present in mature mRNA are likely to be better handled.

    Bottom line, even if you’re using something else to align reads right now, Bowtie might be worth a look.

    Download Bowtie 2: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

    References

    Langmead B, & Salzberg SL (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9 (4), 357-9 PMID: 22388286

    AddThis Social Bookmark Button

    Genetic Evolution of Secondary AML from MDS

    March 14th, 2012

    Contents: Whole-genome SequencingRecurrently MutationsClonal EvolutionReferences
    Myelodysplastic syndromes (MDS) are a group of disorders of ineffective blood production and the most common cause of acquired bone marrow failure in adults. One-third of cases go on to develop secondary AML (sAML), yet there remains uncertainty among patients, insurers, and funding agencies about whether the myelodysplastic syndromes are actually cancers. A study online today at the New England Journal of Medicine has characterized the genetic evolution from MDS to sAML using whole-genome sequencing.

    Whole-genome Sequencing of sAML

    Matthew J. Walter and colleagues of the Washington University School of Medicine performed whole-genome sequencing of tumor samples and matched normal DNA from seven patients with secondary AML. For each subject, hundreds of somatic mutations were genotyped in sAML and MDS-stage samples to characterize the clonal architecture of each tumor. Figure 1A from the paper demonstrates the resolution that can be obtained from deep resequencing of somatic mutations in both sAML and MDS samples:

    Somatic mutation frequencies in MDS and AML
    Credit: Walter et al, NEJM (2012)

    Notice the five clusters (differently colored) representing five clonal populations. In yellow (cluster 1) are mutations present in virtually all cells of both the MDS and the sAML sample. In orange (cluster 2) are mutations present at low frequency in MDS but enriched in sAML. Three more clusters (red, purple, and black) along the y-axis represent mutations that were absent in the MDS sample but acquired during the progression to sAML. The patterns of these mutations suggest that sAML evolved from a clonal population of MDS cells that acquired new mutations along the way.

    Identification of Recurrently Mutated Genes

    In the very near future, it may become feasible and cost-effective to perform whole-genome sequencing (WGS) on hundreds or thousands of tumors of a certain type to exhaustively identify recurrently mutated genes. Until then, WGS of a discovery cohort followed by extension screening in a larger cohort offers a powerful and cost-effective strategy. Two genes were already recurrently mutated in the 7 WGS cases: RUNX1, a known myeloid tumor suppressor, and UMODL1, for which mutations were recently reported in multiple myeloma and ovarian cancer. The authors extended their findings via targeted screening for additional coding mutations in 200 AML cases. This enabled the identification of 9 more recurrently mutated genes, for a total of 11.

    Recurrently Mutated Genes in MDS and sAML

    Gene Mutation(s)
    CDH23 1235insL
    NPM1 W288fs
    PTPN11 G60R
    RUNX1 G170fs; del21q22.11
    SMC3 e8-1 splice
    STAG2 H738fs
    TP53 V272M
    U2AF1 S34F
    UMODL1 T533P; V882M
    WT1 D436E
    ZSWIM4 P18A

    Notably, four of the genes (CDH23, SMC3, UMODL1, and ZSWIM4) had not been implicated in MDS or AML. A specific codon (34) in U2AF1 harbored missense mutations in multiple AML tumors, suggesting a gain-of-function for the splicing factor encoded by that gene. The recurrent mutations in STAG2, a gene located on the X-chromosome, were all protein truncation mutations (nonsense or frameshift) suggesting that a loss-of-function of this gene contributes to MDS and AML pathogenesis.

    Clonal Evolution: from MDS to AML

    By characterizing mutations from secondary AML tumors in the MDS precursors for the same patient, the authors reconstructed the clonal architecture of the disease from early to advanced stages. The findings are summarized in Figure 2A:

    Clonal Evolution from MDS to AML
    Credit: Walter et al, NEJM (2012)

    In all 7 cases, the results suggest a linear model of clonal evolution, in which progression from MDS to sAML was characterized by persistence of a single founder clone (defined by ~200-700 mutations) and the outgrowth of at least one new subclone which contained dozens or hundreds of additional mutations. In other words, a single population of MDS cells underwent multiple rounds of mutation and selection, giving rise to multiple subpopulations present in full-blow secondary AML.

    Please go read this fascinating study at the New England Journal of Medicine.

    References

    Walter MJ, Shen D, Ding L, Shao J, Koboldt DC, Chen K, Larson DE, McLellan MD, Dooling D, Abbott R, Fulton R, Magrini V, Schmidt H, Kalicki-Veizer J, O’Laughlin M, Fan X, Grillot M, Witowski S, Heath S, Frater JL, Eades W, Tomasson M, Westervelt P, DiPersio JF, Link DC, Mardis ER, Ley TJ, Wilson RK, & Graubert TA (2012). Clonal architecture of secondary acute myeloid leukemia New England Journal of Medicine

    AddThis Social Bookmark Button

    5 Things to Know About SAMtools Mpileup

    March 2nd, 2012

    Next-generation sequencing instruments might be considered a disruptive technology. The incredible throughput of these machines, even 4-5 years ago, clearly mandated the development of a new generation of algorithms and data formats capable of storing, processing, and analyzing huge amounts of sequence data. One key achievement in next-generation sequencing bioinformatics was the specification of sequence alignment/map format (SAM) and its binary equivalent (BAM). These formats were widely adopted by a community of scientists desperate to have a common format in which to store next-gen sequencing reads and their alignments to a reference. BAM files quickly became a standard for the Cancer Genome Atlas, the 1,000 Genomes Project, and other large-scale sequencing efforts. The formats were accompanied with a software package, SAMtools, that is probably the most pervasive tool for next-gen sequencing in the world.

    To aid in variant calling and other analyses, SAMtools can generate a pileup of read bases using the alignments to a reference sequence. There’s a lot you can do with pileup-like output, and indeed, SAMtools variant calling is quite popular. The actual command is samtools mpileup, and here are five things that you should know about it.

    1. SAMtools mpileup has permanently replaced pileup. Replaced as in the latter command no longer works. This is generally a good thing; mpileup can do nearly everything that pileup could, and a lot more. You still use it to generate pileup output for a single sample. However, some features have gone away, such as simple consensus calling with the -c parameter, and the option to output mapping qualities for each base (I think that was -k). Consensus calling can be done in mpileup with a couple of extra steps using bcftools; see the mpileup page for details.
    2. Base alignment quality (BAQ) computation is turned on by default. BAQ is a phred-like score representing the probability that a read base is mis-aligned; it lowers the base quality score of mismatches that are near indels. This is to help rule out false positive SNP calls due to alignment artifacts near small indels. There have been recent suggestions, however, that BAQ may be too strict and cause real SNPs to be missed. Several users of the VarScan variant caller have reported that its read counts disagree with what is seen in IGV, or somatic mutations were missed when mpileup was used instead of pileup. These issues are almost always due to BAQ’s downgrade of base qualities to 0 or 1. This adjustment can’t be seen in IGV, but it’s below VarScan’s default base quality threshold. You can disable BAQ with the -B parameter, or perform a more sensitive BAQ calculation with -E. I’ve heard that the latter option will be turned on by default in the next version of SAMtools.
    3. Analyze multiple samples at once. The principal feature to SAMtools mpileup is the ability to analyze data from multiple samples simultaneously. You do this by providing more than one BAM file. This feature is nice because it provides data across all samples on a per-position basis. The first three columns (chromosome, position, and reference base) are the same. Following those, you get three columns per BAM file indicating the read depth, bases, and base qualities for that sample at that position. The VarScan mpileup2cns command will take this raw input and call a genotype for each sample, as well as a consensus genotype based on the data from ALL samples. This is useful for detecting variants in low-coverage regions by leveraging data across samples. You can also use the bcftools pipeline for multi-sample calling.
    4. Rule out false positive with strand bias or poor mapping. Many groups working variant calling in next-generation sequencing have independently converged on several key factors that influence false positive rates. The VarScan 2 paper, for example, describes 9 empirically-derived filtering criteria that we use to identify and remove artifacts. The strand representation and number of mismatches in supporting reads are two important indicators of false positive arising from systematic alignment artifacts. SAMtools mpileup helps users address these issues as well: the -C parameter lets you downgrade the mapping quality of reads with lots of mismatches, and the -S parameter tells SAMtools to report a per-sample strand bias p-value.
    5. Random position retrieval that works. One of the most powerful features of mpileup is that you can specify a region with -r chrom:start-stop and it will report pileup output for the specified position(s). The old pileup command had this option, but took a long time because it looked at all positions and just reported the ones within your desired region. Instead, mpileup leverages BAM file indexing to retrieve data quite rapidly: In my experience, it takes about 1 second to retrieve the pileup for several samples at any given position in the human genome. Multi-sample, rapid random access has lots of uses for bio-informaticians; for example, I can retrieve all bases observed in all samples at a variant of interest to look at the evidence in each sample.

    These features are the results of hard work by Heng Li and others who contribute to the development and improvement of SAMtools. It’s great to see a key piece of software under continued, active development, and I think most of us look forward to what the next SAMtools has in store.

    References

    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, & 1000 Genome Project Data Processing Subgroup (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England), 25 (16), 2078-9 PMID: 19505943

    AddThis Social Bookmark Button