Making the Leap: Maq to BWA

December 23, 2009 by Dan Koboldt

Like most curmudgeons I fought the change as stubbornly as I could. Leave Maq behind for something else? Never! Yet over the past few months I have come to realize that BWA, as it’s called, is not bad. At our genome center we still generate both Maq and BWA alignments for Illumina data; thanks to the hard work of our automated pipeline (apipe) group, I never have to run either on my own. Admittedly, even without a learning curve to daunt me, I found myself reluctant to use the BWA results over Maq.

How I Came to Love BWA

Speed was obviously the most compelling argument. Whereas Maq can take 1-2 days to map a single lane of Illumina data to the human genome, BWA can do it in a few hours. Also, BWA outputs directly into SAM format, which we’ve universally (and gratefully) adopted as the standard for next-gen sequencing data. There are, as it turns out, many differences between Maq and BWA, which I’ve summarized in the table below:

Aligner:	Maq	BWA
Algorithm:	Hash lookup table	Burrows-Wheeler Transform
Input format:	Binary FastQ	FastQ
Output format:	MAP file	SAM file
Gapped Alignment:	Paired-end only; one mate must map	Single-end and paired-end
Maximum Read Length:	63 bp	200 bp (aln) or 1,000 bp (dbtwsw)
Multi-threading:	No	Yes
Assembly / SNP calling:	Tools included	None. Use SAMtools.
Simulation Tools:	Tools included.	None.
Continued development:	No	Yes

There are other differences, but these are the ones I find most important. The ability to take standard FastQ input is a plus for BWA, but its native SAM output is critical. Granted, it’s possible to convert the output of several aligners (Maq, Bowtie, NovoCraft, even BLAST) to SAM format, but a direct output is convenient. Compared to Maq, BWA is lean and mean – no assembly, SNP calling, or simulation tools – with the single purpose of mapping reads to a reference, and a reliance on companion software (SAMtools) for subsequent analysis.

Two new features in BWA – gapped alignment and the ability to take longer reads – are a devastating blow to other next-generation sequencing aligners. Now, it’s possible to map single-end reads with gaps (Novoalign) and to map longer 454 reads (SSAHA2). Like Maq, BWA has color-space functions as well.

The take-home message: BWA can map data from all three available next-generation sequencers – Illumina/Solexa, Roche/454, and ABI SOLiD.

The Heng Li Factor

No software tool can remain competitive without continued development. Thus, when I heard that Heng Li did not plan any further releases of Maq, I knew its days were numbered. Some of us had already gotten the impression that Heng was focusing on BWA more than Maq anyway, but that made it official. Maq is still an incredible piece of software, and represented a technological leap for NGS analysis that helped bring about the era of whole-genome sequencing. We were among the first to climb aboard the Maq train, and it took us far. Now, sadly but with eager anticipation, we transferred to the express train, the bullet train. BWA.

References
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics, 25 (14), 1754-1760 DOI: 10.1093/bioinformatics/btp324

Mapping Bias in Short Read Alignment

December 11, 2009 by Dan Koboldt

A recent paper in Bioinformatics investigates the effect of read-mapping biases on detecting allele-specific expression (ASE) from RNA-Seq data. The authors generated 16 million 36-bp cDNA reads in each of two HapMap individuals on the Illumina/Solexa platform. When evaluating known SNPs for evidence of ASE, they observed that heterozygous SNPs exhibited a mapping bias favoring the reference allele.

mapping-bias-header

This alone is perhaps not surprising, as we already knew that indels suffer from such bias. Initially, most short read aligners simply ignored gapped alignments. Now, even with aligners like BWA and Novoalign that allow for gaps when mapping short reads, alignments supporting the reference allele (ungapped) will be favored over alignments supporting an indel (gapped). The longer the indel, the larger the gap, and the less likely a short read would be to be mapped across it.

It is easy to see how SNPs might have a similar effect. Clusters of SNPs in close proximity, for example, may result in reads with more mismatches than are permitted by the aligner. In simulations, the authors found that random error (i.e. sequencing error) exacerbated the mapping bias. At an error rate of 0.01, some 51.4% of reads at heterozygous sites supported the reference allele, while an error rate of 0.05 increased the proportion to 59%. My own conclusion based on these results is that a variant allele, combined with nearby sequence changes that result from random error, pushes the mismatch profile of certain reads above the threshold at which alignments are discarded.

SNP-Masking Reveals Inherent Bias

What is surprising in this study by Degner et al is that even after they masked SNP positions in the reference sequence, some 5-10% of SNPs still had an inherent mapping bias favoring one allele. For 1.4% of SNPs, in fact, all of the reads came from a single allele. This obviously has important implications for evaluating ASE in RNA-Seq data, since the relative frequency of alleles from read mapping is used to infer allelic expression. It also affects the now-widespread application of Illumina/Solexa and ABI/SOLiD sequencing to characterize genetic variation from genomic DNA. Because virtually every variant calling algorithm relies on the ratio of reads supporting variant versus reference alleles, an inherent mapping bias favoring the reference allele will reduce the detection sensitivity.

Mapping Bias and Sequence Homology

To better understand the causes of inherent mapping bias, the authors investigated some of the most severely affected SNPs. The strongest biases occured among SNPs in regions of the genome with homology to other locations. When the SNP position was not masked, variant-containing reads matched another locus equally or even better than the true location. When the SNP position was masked, both reference- and variant-containing reads had a 1-bp mismatch to the reference, but either allele might match better elsewhere in the genome. In Figure 3, two examples of such SNPs demonstrate how variant-containing reads either mapped incorrectly or were “not mapped.” Some of these “not mapped” reads may have exceeded the number of allowable mismatches, while others may have become non-unique (i.e. matching multiple places). The authors filtered any alignments with mapping quality of 0, so it’s unclear which caused the mapping failure.

I should point out here that the masking approach may have contributed to this result. The authors “masked” heterozygous SNPs by changing the reference base to a third allele that matched neither reference nor the known variant. A superior approach might be to mask heterozygous SNPs to N, so that any base call at that position is considered a match. This would reduce the number of read mismatches overall, and might help improve the bias. Then again, some read aligners may consider any base at “N” to be a mismatch, which would have essentially no effect. What might have been interesting, though, is increasing the # and base-quality-sum of mismatches allowed by Maq to see if the read bias was removed.

Implications Moving Forward for ASMB

Your reaction might be to shrug, since Illumina/Solexa now routinely generates 76-bp and 100-bp reads. There are, however, a number of reasons why this might not address the bias issue. First, while read lengths are getting longer, alignment “seeds” for short reads are essentially unchanged, and if the SNP occurs in the ~22-25 bp alignment seed, it can still have an effect. Second, many published datasets these days are still based on read lengths of 50 bp or less, especially from groups running ABI/SOLiD or older Illuminas. Third, at least one promising single-molecule sequencer is still generating reads in the 30 bp range. And finally, there’s a practical reason that we’ll continue to see short read datasets: running a 75-bp or 100-bp Illumina flowcell takes several days and multiple kits – expenses of time and dollars that may not always be available. Thus, allele-specific mapping bias (ASMB) [acronym invented, D. Koboldt, 12/11/09] in short reads will remain a key issue in next-generation sequencing.

References
Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, & Pritchard JK (2009). Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics (Oxford, England), 25 (24), 3207-12 PMID: 19808877

Short Read Aligners and Variant Detection

November 6, 2009 by Dan Koboldt

In recent weeks I’ve had conversations with many people in the NGS community who are attempting to call variants, accurately, in Illumina/Solexa data. Part of it stems from VarScan, my SNP and indel caller for next-gen sequencing data that works with Bowtie, Novoalign, cross_match, and other aligners.

Another part of it stems from my involvement in 1,000 Genomes Pilot 3, for which several participants have applied their own variant detection pipelines to the same dataset. Last month, Goncalo Abecasis, with input from David Craig, Heng Li, Gerton Lunter, and Fiona Hyland, proposed an exercise comparing several read mappers on real and simulated ABI SOLiD and Illumina/Solexa data. The initial list of aligners – Maq, BWA, Stampy, BFAST, BioScope, and KARMA – demonstrated just how rapidly the field has grown since my aligner comparison last year at AGBT. I’d looked at Maq and BFAST, and knew about (but hadn’t tried) BWA, but the others on the list (Stampy, BioScope, and KARMA) were ones I’d never heard of.

I proposed adding three aligners to the list: Bowtie and Novoalign for Illumina data, and SHRiMP for SOLiD data. My suggestions were politely declined by Richard Durbin (WTSI), who said “In our hands Bowtie doesn’t seem accurate enough for variant calling. It is a great tool for fast assignment of reads for some other purposes. Novoalign is accurate and good, but perhaps a little slow. SHRiMP is also I think very slow.”

Personally, I think that Bowtie works very well for variant calling, I know of several groups who are using it for that exact purpose. And while Novoalign *is* a bit slow, in my experience it’s just as fast as Maq, one of the two aligners out of Durbin’s lab that were already on the list. Of course, Maq remains the most widely used tool for Illumina data (for now), and that’s an important consideration. Most NGS analysts know and love Maq as much as I do.

Balancing Speed and Sensitivity

However, these assessments bring into focus the key issue surrounding short read alignment for variant detection – finding the balance between speed and sensitivity. Bowtie and Novoalign exemplify this well. Bowtie is ultrafast – the fastest short read aligner I’ve used – and maps an entire lane (~15m reads) in just 1-2 hours. Yet in my experience, it places slightly fewer reads than BWA/Maq. And it performs only ungapped alignments, so indels won’t be detected. In contrast, Novoalign typically maps more reads than Maq and BWA, seems very accurate, and remains one of the few aligners to allow gaps in fragment-end reads. In general, my comparisons demonstrated that Novoalign speed is comparable to Maq on typical datasets. However, longer reads and lower-quality data can make Novoalign very slow indeed. The ultimate short read aligner, in my opinion, would have Bowtie-like speed, Novoalign-like sensitivity, and the widespread community support that Maq enjoys.

Ask the Guru: Heng Li

Heng Li, who led development of both Maq and BWA, told me that he’s not worried about sensitivity. “Most aligners nowadays are sensitive enough,” Heng wrote to me in an e-mail this week. “For detecting variations, specificity is of more importance. Nonetheless, how much wrong alignments may contribute to wrong SNPs is an open question. As long as alignment errors are random, more wrong alignments may not necessarily lead to worse SNP calls.” Clearly, he has already given some thought to these issues. If we’re lucky, Heng Li may begin to address these open questions in his new post at the Broad Institute.

Underlying Causes of False Positives

Read mis-alignment would not be a serious problem if it occurred randomly across the genome. The trouble is that wrong alignments don’t seem to be random, at least in my experience. In projects like TCGA Ovarian, we see numerous false positives (particularly in tumors) that seem to arise from read mis-alignment. These typically manifest as clusters of variants, often present in each of a subset of reads whose true alignment is probably a paralogous region of the genome. It’s also possible that they’re caused by an indel, which (as Kiran Garimella of the Broad Institute recently showed) sometimes manifest as clusters of substitutions at several positions near one another. We can aggressively filter these by looking for clusters of predicted SNPs, but even better would be to remove the mis-alignments before variant calling even begins.

Read Mis-Alignment and Paired-End Sequencing

Here at WashU, we have a growing concern that the alignment scores for short reads are continually over-estimated. Often our manual reviewers find that reads supporting false-positives have mate pairs that align to a different chromosome altogether. In the absence of translocation events, when this occurs, one of the two reads is incorrectly placed, and any variant it supports is probably not real. Personally, I’d rather remove both reads in such situations, and rely on correctly mapped read pairs for detection of small variants.

The pervasive spread of paired-end sequencing is beginning to reveal just how often short aligners can get it wrong. The corollary here is that taking read pair information into account during alignment is of critical importance, and those hopeful short read aligners that don’t do it yet (crossmatch, for example) are destined for inferiority.

High-Throughput Sequencing: Speed Matters

Yet what I’m learning from discussions with others in the community – particularly the growing surge of users making the leap from Maq to BWA – is that speed matters. With Illumina machines cranking out 20 gigabases in a single run, and projects like the 1,000 Genomes generating terabytes of sequence over the course of months, we can’t afford to be using the slower aligners, no matter their sensitivity. At worst, we might apply a two-stage approach to alignment that rapidly maps reads that precisely match the reference, and passes only the variant reads to a more sensitive aligner for mapping.

Of course, as a colleague of mine recently joked, by the time we write the perfect aligner, Pac Bio will have come along and sequenced the entire genome, kilobases at a time.

The Search for Somatic Changes

October 29, 2009 by Dan Koboldt

As cancer genome sequencing ramps up here and pretty much everywhere around the world, I got to thinking about strategies for identifying somatic changes, with confidence, from massively parallel sequencing data. As part of the Cancer Genome Atlas Project (TCGA), we’ve been applying both targeted (capture-based) and whole-genome sequencing approaches to tumor samples and matched normal controls. Ideally, the resulting data will yield high (>20x) coverage in both tumor and normal across our positions of interest. What happens next, at least at WashU, is the culmination of a multiple-year effort to develop a comprehensive pipeline for detecting somatic variants.

First up: Single Nucleotide Variants (SNVs)

With more than 15 million entries in dbSNP, single nucleotide polymorphisms (SNPs) remain the most common form of DNA sequence variation in humans. In cancer, most of the well-characterized somatic mutations are single nucleotide changes as well. Conceptually, SNVs should be the easiest things to find in next-gen sequencing data. They occur at a single position that can be directly compared between tumor and normal. They should have minimal effects on sequence alignments to the reference genome. For example, here’s a putative somatic variant in TP53:

What you see above is SAMtools “pileup” output at a single position (7518990 on chr17), for Normal and Tumor. The Normal shows 4 reads that all support the reference on the – strand (,,,,). The Tumor, however, shows 6 reads that all support a G variant, 2 on the + strand (GG) and 4 on the – strand (gggg). It seems reasonable that, given this output across the entire genome for Normal and Tumor, one can compare them at every position and look for differences such as these.

Yet we struggle to validate even high-confidence SNVs that look to be somatic. Some are real, but Germline (probably under-sampled or missed in the Normal); most are simply false positives in the tumor. These might arise from a number of causes – homopolymers, paralogs, repeats, sequencing error, alignment error, etc. Only a small fraction of variants that appear somatic in NGS data will validate as such.

Why is that? In general, it’s because by screening for somatic variants, we remove all of the variants that are most likely to be real. First, we exclude any variants that are present in the normal (germline) – which account for the majority of true sequence variations. We also exclude known variants from dbSNP and 1,000 Genomes databases, which are also likely to be real but almost certainly germline events. Then, we prioritize variants that are predicted to have functional effects – on protein coding, on splicing, in conserved regions, etc. Such regions are often under negative selection for damaging mutations, meaning that variants should be exceedingly rare. Every one of these filters selects for variants that are less likely to be valid.

Small Indels

With longer (>50 bp) fragment-end reads and/or paired-end libraries, it’s possible to detect small insertion/deletion variants (indels) in next-gen sequencing data. Here, detection and specificity are the challenges. In 454 data, the reads are [hopefully] of sufficient length (250 bp) for accurate gapped alignment to a reference sequence, and indeed, aligners commonly used with 454 data (Newbler, BLAT, cross_match, SSAHA2) do so. Unfortunately, indels are both the strength and the weakness of 454 data – due to the underlying pyrosequencing, homopolymeric regions are often under- or over-called, resulting in numerous false positives. Many can be filtered, but often homopolymer-associated errors cause mis-alignment of reads, yielding indels that might not look like homopolymer artifacts.

Indel detection is also possible with Illumina data, though the shorter read lengths make this challenging. Few short read aligners can handle the throughput of Illumina data and allow for gaps in read alignments, because speed and gapped alignment are at odds with one another. Fortunately, paired-end sequencing on Illumina offers a solution implemented by Maq some time ago – first, map all reads that you can without gaps, and then, look for gapped alignments in unplaced reads whose mate is mapped nearby. This reduces the search space considerably for gapped alignment, and also limits the query space to reads that likely contain indels (gaps).

In cancer sequencing, small indels present one additional problem – determining whether they are present in the normal. Even the best aligners can’t always precisely define where an indel starts or stops. Thus, a germline indel might have different coordinates in the tumor than in its matched control; when comparing the samples, it might appear to be somatic.

Loss of Heterozygosity (LOH)

It is well known that the genomes of tumor show extensive loss of heterozygosity (LOH). Generally, this occurs because a position that is heterozygous in the germline is affected by some kind of structural event – deletion, gene conversion, chromosome loss, etc. – that results in the loss of one allele. Of course, to detect LOH, one needs a variant that’s heterozygous in the Normal, and to precisely define the region of LOH, one needs a dense set of heterozygotes. Even so, the maximum precision for the start and stop of an LOH region is the interSNP distance, since only SNPs can inform on LOH, and that can be hundreds or thousands of bases. But LOH calls do tend to cluster, and detection of LOH regions is not really the problem. Even lower-resolution array technologies identify recurrent LOH regions in tumor samples.

But what exactly does LOH mean in terms of cancer development and growth? It’s hard to say. Quite possibly, a tumor suppressor gene was deleted, or an oncogenic allele was duplicated. Unfortunately, LOH regions tend to be kilobases or megabases in size, containing dozens or hundreds of genes, and identifying which ones are truly affected in terms of cancer remains challenging. We see a lot of LOH in cancer, but sadly, it never seems to get anyone excited.

Structural and Copy Number Variation

Image Credit: Wikipedia

Last and most difficult to characterize are the sub-microscopic structural changes – insertions, deletions, inversions, translocations, duplications, etc. – that often occur in tumor genomes. These tend to be large, complex events that are tough to infer from NGS data. We run Ken Chen’s breakDancer, of course, and it predicts numerous SVs. But how do you validate a massive, complex variant spanning thousands of bases? We do our best with PCR and 3730/454 sequencing, but until read lengths get really really long (perhaps on single-molecule sequencing), validating such events and determining their breakpoints is tough.

There are well-characterized, recurrent copy number alterations in cancer, like EGFR amplification on chromosome 7. Here’s my question: where are all of those extra copies? Are they just tandem duplications of part of a chromosome, or are they duplications that get inserted elsewhere in the genome? In the absence of a complete, linear, high-confidence genome, I’m not sure we can tell.

Fruits of Our Labors

It occurs to me that this is a bit of a negative article – focusing entirely on the challenges and failures, without highlighting the successes. And there are many successes. Every cancer genome tells us something, and every new piece of knowledge goes into our arsenal in the war against cancer. As sequencing ramps up, we’ll see exponential growth in the number of known somatic mutations across a wide array of cancers. With the help of cancer biologists, these data will be leveraged to better understand the genes, proteins, and pathways underlying tumorigenesis. Greater understanding will undoubtedly improve the detection, diagnosis, prognosis, and treatment of cancer patients.

« Previous Page