Mapping Bias in Short Read Alignment

December 11, 2009 by Dan Koboldt

A recent paper in Bioinformatics investigates the effect of read-mapping biases on detecting allele-specific expression (ASE) from RNA-Seq data. The authors generated 16 million 36-bp cDNA reads in each of two HapMap individuals on the Illumina/Solexa platform. When evaluating known SNPs for evidence of ASE, they observed that heterozygous SNPs exhibited a mapping bias favoring the reference allele.

mapping-bias-header

This alone is perhaps not surprising, as we already knew that indels suffer from such bias. Initially, most short read aligners simply ignored gapped alignments. Now, even with aligners like BWA and Novoalign that allow for gaps when mapping short reads, alignments supporting the reference allele (ungapped) will be favored over alignments supporting an indel (gapped). The longer the indel, the larger the gap, and the less likely a short read would be to be mapped across it.

It is easy to see how SNPs might have a similar effect. Clusters of SNPs in close proximity, for example, may result in reads with more mismatches than are permitted by the aligner. In simulations, the authors found that random error (i.e. sequencing error) exacerbated the mapping bias. At an error rate of 0.01, some 51.4% of reads at heterozygous sites supported the reference allele, while an error rate of 0.05 increased the proportion to 59%. My own conclusion based on these results is that a variant allele, combined with nearby sequence changes that result from random error, pushes the mismatch profile of certain reads above the threshold at which alignments are discarded.

SNP-Masking Reveals Inherent Bias

What is surprising in this study by Degner et al is that even after they masked SNP positions in the reference sequence, some 5-10% of SNPs still had an inherent mapping bias favoring one allele. For 1.4% of SNPs, in fact, all of the reads came from a single allele. This obviously has important implications for evaluating ASE in RNA-Seq data, since the relative frequency of alleles from read mapping is used to infer allelic expression. It also affects the now-widespread application of Illumina/Solexa and ABI/SOLiD sequencing to characterize genetic variation from genomic DNA. Because virtually every variant calling algorithm relies on the ratio of reads supporting variant versus reference alleles, an inherent mapping bias favoring the reference allele will reduce the detection sensitivity.

Mapping Bias and Sequence Homology

To better understand the causes of inherent mapping bias, the authors investigated some of the most severely affected SNPs. The strongest biases occured among SNPs in regions of the genome with homology to other locations. When the SNP position was not masked, variant-containing reads matched another locus equally or even better than the true location. When the SNP position was masked, both reference- and variant-containing reads had a 1-bp mismatch to the reference, but either allele might match better elsewhere in the genome. In Figure 3, two examples of such SNPs demonstrate how variant-containing reads either mapped incorrectly or were “not mapped.” Some of these “not mapped” reads may have exceeded the number of allowable mismatches, while others may have become non-unique (i.e. matching multiple places). The authors filtered any alignments with mapping quality of 0, so it’s unclear which caused the mapping failure.

I should point out here that the masking approach may have contributed to this result. The authors “masked” heterozygous SNPs by changing the reference base to a third allele that matched neither reference nor the known variant. A superior approach might be to mask heterozygous SNPs to N, so that any base call at that position is considered a match. This would reduce the number of read mismatches overall, and might help improve the bias. Then again, some read aligners may consider any base at “N” to be a mismatch, which would have essentially no effect. What might have been interesting, though, is increasing the # and base-quality-sum of mismatches allowed by Maq to see if the read bias was removed.

Implications Moving Forward for ASMB

Your reaction might be to shrug, since Illumina/Solexa now routinely generates 76-bp and 100-bp reads. There are, however, a number of reasons why this might not address the bias issue. First, while read lengths are getting longer, alignment “seeds” for short reads are essentially unchanged, and if the SNP occurs in the ~22-25 bp alignment seed, it can still have an effect. Second, many published datasets these days are still based on read lengths of 50 bp or less, especially from groups running ABI/SOLiD or older Illuminas. Third, at least one promising single-molecule sequencer is still generating reads in the 30 bp range. And finally, there’s a practical reason that we’ll continue to see short read datasets: running a 75-bp or 100-bp Illumina flowcell takes several days and multiple kits – expenses of time and dollars that may not always be available. Thus, allele-specific mapping bias (ASMB) [acronym invented, D. Koboldt, 12/11/09] in short reads will remain a key issue in next-generation sequencing.

References
Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, & Pritchard JK (2009). Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics (Oxford, England), 25 (24), 3207-12 PMID: 19808877

Cancer Genomics Meeting in St. Louis

December 3, 2009 by Dan Koboldt

The Genome Center at Washington University is currently hosting a remarkable two-day event focused on the study of cancer genomics. Yesterday there was a symposium on the School of Medicine campus, featuring speakers from major genome centers around the world, who delivered an excellent series of talks on recent advances in cancer genome research. Here were the highlights.

Mutational Signatures in Lung Cancer (Peter Campbell, WTSI)

First up was Peter Campbell from Wellcome Trust Sanger Institute, who presented their first cancer genome – a small lung cancer cell line called NCI-H209. Using the ABI SOLiD platform, they sequenced NCI-H209 and a matched B-cell sample from the same individual to 30-40x coverage. Extensive PCR-based validation yielded almost 23,000 somatic substitutions and over a hundred structural events (indels and rearrangements). As expected, the mutational spectrum was enriched for G->T and C->A changes associated with adduct formation on guanine nucleotides induced by benzopyrene, the chemical mutagen found in tobacco smoke. Dr. Campbell also described some of the complex rearrangements observed in the paired-end sequencing data, which were particularly convincing when overlaid with spectral karyotyping images.

Next-Gen Sequencing Strategies to Study Cancer Genomes (Elaine Mardis, WU)

Next was our own Elaine Mardis, who gave an excellent overview of the strategies developed here to apply NGS to cancer genomes. She described five key elements to success in this arena:

Genomic characterization prior to sequencing. For example, at WashU we type tumor and normal samples on genome-wide SNP arrays, which yield tumor purity/ploidy estimates, LOH information, and a dense set of SNPs for tracking the coverage of genomes by Illumina sequencing.
Resource characterization. The tissue preservation method, DNA/RNA quality and quantity, and pathology information are all critical components. Also important are high-quality clinical data (diagnosis, chemotherapy/radiation protocols, and outcome), informed consent, IRB approval, and additional cases of the same cancer subtype for recurrency screening.
Data production capacity. US genome centers seem to have this, either in the form of Illumina (WashU and Broad) or ABI SOLiD (Baylor). It’s not just the throughput of the machines, either – it’s the ability to construct sequencing libraries from ever-shrinking DNA inpus. Tumor samples are precious, and the ability to use only a tiny amount of DNA or RNA while achieving informative results is one of the key areas of focus of tech development groups.
Informatics and bioinformatics. We have entire groups devoted to LIMS, pipeline automation, medical genomics, and sequence data submission. Other important elements of bioinformatics that Elaine touched on were data display interfaces for collaborators and high-end data storage and computational infrastructure.
Validation and recurrent site screening. This the essential coup de grace for tumor genome characterization, in which we validate somatic mutations and identify those that are recurrent in other samples of the same subtype, the best indication that we currently have of pathological relevance.

Elaine also discussed the rapid scaling up of TCGA (which is adding 20 tumor types thanks to ARRA funds) and other projects, which will only exacerbate the challenges of scale that NGS platforms have already presented.

Integrating Genomics with Biology (Richard Gibbs, Baylor)

Richard Gibbs gave an action-packed talk of some relevant work going on at Baylor, both for cancer and inherited diseases. They are applying an intriguing if controversial multiple-platform strategy for whole genome sequencing: deep (20-30x) coverage on ABI SOLiD and light (6-10x) coverage on 454. “We’re just telling people that if you do it twice, you’ll get it right,” Dr. Gibbs said. One interesting project is an investigation of Charcot-Marie Tooth (CMT) syndrome, a recessive inherited disorder where the locus is unknown. Whole-genome sequencing of an affected individual on ABI SOLiD identified a few dozen novel missense mutations; among them lurked the causal variant, which was found to segregate with the disease in a family cohort.

Dr. Gibbs also gave an overview of their investigations into heritable variants in pediatric cancers (in collaboration with MD Anderson). There’s also a lot of work under way for TCGA, not just the 6K capture project, but also adjunct analyses of gene expression, DNA copy number, microRNA, and DNA methylation data being generated on TCGA samples.

Insights into Rare Tumors (Steven Jones, BC Cancer Agency)

Steven Jones from BC Cancer Agency retold the story of the rare tongue adenocarcinoma that I heard at AGBT 2009. What I didn’t know about BCCA is that under the Canadian universal healthcare system, they see all of the cancer patients in the surrounding population of over 4 million citizens. One of these was a rare one – an 80 year old man with adenocarcinoma of the tongue. It was removed surgically, of course, but in a short time metastasized to the lungs. The clinician prescribed erlotinib, an EGFR inhibitor, but unfortunately the patient did not respond. To help the patient, and also make some advances in tech development, Jones and his colleagues did whole-genome and RNA-Seq of the tumor samples and matched normals. There were just four somatic mutations: two in known cancer genes and two in zinc finger proteins (these remain unexplained). Transcriptome and copy number analysis showed that the tumor had loss of PTEN and down-regulation of SMAD4. Unfortunately, it had recently been shown that tumors lacking PTEN and TP53 don’t respond to TK inhibitors like erlotinib. However, this particular tumor showed an amplification of Ret, and as it happened, the drug bank had a single drug, sunitinib, that was known to inhibit Ret. The patient’s response, initially, was quite dramatic – all of the metastases vanished. Sadly, several months later they turned up again, and this time were resistant even to sunitinib. Still, the results of this effort were promising, because genomic information was used to keep cancer at bay, if only for a short time.

Genomic Medicine in Pediatric Brain Tumors (Chinc C. Lau, Baylor)

Ching Lau of Baylor presented genomic studies of medulloblastoma (MBM), which accounts for 20% of all brain tumors and has a 60% survival rate. Classification of MBM patients in the past was relatively crude – based on the amount of residual tumor post-surgery and metastatic status. Using gene expression profiling, Lau and colleagues identified 4-5 distinct clusters. Two clusters were associated with known cancer pathways – SHH signaling and WNT activation. The same four clusters could also be isolated by unsupervised miRNA clustering. Also, gene expression analysis showed that ERBB2 expression correlates with outcome (higher expression = poor prognosis).

Finally, Dr. Lau mentioned some future directions for targeted cancer therapy. One of these that I readily admit I don’t understand: cytotoxic T-cells with Chimeric TCRs. Evidently these are T-cells that recognize and attack cancer cells in the body. There was a short movie, courtesy of Dr. Lau’s collaborators, in which we saw these specially programmed immune cells recognizing and attacking a tumor that was roughly four times their size. It was like watching ants swarm a piece of fruit on the sidewalk, and very compelling.

Evolution of a Breast Cancer Tumor (Samuel Aparicio, BC Cancer Agency)

Dr. Aparicio presented a study recently published in Nature and already discussed on Massgenomics. However, he did discuss the continuing challenge of mutation heterogeneity in tumors – we can no longer refer to mutations as present or absent, but instead should report their frequency, which represents the proportion of clones with each mutation. The question of how deep we need to sequence to find the very rare variants has yet to be answered.

Breast Cancer Genomics (Matthew Ellis, Siteman Cancer Center)

Matthew Ellis, our collaborator from the Siteman Cancer Center, presented very recent work we’ve done on a basal subtype breast cancer. A quartet of samples were sequenced in this study – the primary breast tumor, the matched normal tissue, the brain metastasis (from which the patient died), and finally, a mouse xenograft model developed in “humanized” NODSCID mice. We validated some 50 tier 1 mutations, all of which were detected (at some level) in all four samples. Deep read counts for these mutations in each sample revealed some interesting stories about the progression of the cancer from tumor to metastasis.

Genomic Signatures and Cancer (Todd Golub, Broad / Dana Farber)

Todd Golub of the Broad Institute and Dana Farber Cancer Center presented his group’s work on Hepatocellular Carcinoma (liver cancer), which is the fifth most common cancer worldwide. It’s a disease of growing concern on the African and Asian continents, and presents numerous challenges. Molecular classification “is a mess,” Dr. Golub said, and recurrence is common. The problem is that there are few frozen samples with long-term outcome information. Thus, Dr. Golub and his group applied the Illumina DASL assay – which enables very small, highly multiplexed, locus-specific PCR – to perform expression profiling in formalin-fixed paraffin-embedded (FFPE) samples. They achieved up to 90% success across 6,000 genes in samples that were 25 years old. Doing so opened up a vast bank of viable samples for gene expression profiling, from which Dr. Golub and colleagues made some interesting findings.

The AML Genome (Tim Ley, WashU and Siteman Cancer Center)

Tim Ley gave the last talk, which highlighted the work that he and colleagues at WashU began around a decade ago on the disease acute myeloid leukemia (AML). Our goal, he said, was to find 95% of the mutations that occur in at least 5% of AMLs. To do so will require whole genome sequencing of at least 30 genomes, according to statistics from my colleague Mike Wendl. Two of these (AML-1 and AML-2) are already done and published, and a number of others are currently under way. One intriguing bit of work that Dr. Ley described was on the “Mouse APL” project, a knock-in mouse with the PML-RARA gene fusion backcrossed 10+ generations to CBL/BL6 mice. This yielded inbred strains of mice, some of which developed AML after ~6 months, presumably after acquiring “cooperative” mutations. One mouse was sequenced to 15x coverage, and among the handful of somatic nonsynonymous mutations found, one was recurrent, not only in the APL mice, but also in the same gene in human tumors.

Crossbow: NGS Informatics in the Cloud

November 23, 2009 by Dan Koboldt

Just online at Genome Biology is a new paper from the Steven Salzberg lab (UMD) on searching for SNPs with cloud computing. Using $85 of computing time rented from Amazon’s EC2, Langmead et al processed an entire human genome – 3.3 billion reads totaling 38x coverage – in three hours.

logo_aws

The “Cloud” Can Be Nebulous

Cloud computing is a term bandied about often these days. What it boils down to is this: places with huge banks of computers (Providers, i.e. Amazon) rent out processing time to people who need it (Users). The “cloud” refers to a software layer between providers and users that acts like a virtual operating system – it loads any software needed by the user, and also provides an access point for running highly parallelized tasks on the cluster. Next-gen sequencing data is well suited to this kind of processing, since a large NGS dataset can usually be broken into smaller subsets (i.e. Illumina lanes) and processed at the same time on different computers, without affecting the results.

Map, Sort, Reduce

Crossbow – the cloud computing software featured in this publication – cleverly breaks down the analysis into a series of map, sort, and reduce steps. It takes a large sequencing dataset, breaks the reads into subsets, and maps them to the human genome using Bowtie (map). Then, it divides the 3.2 gigabase human genome into 1,600 non-overlapping 2-megabase partitions and assigns every mapped read to a bin (sort). The SNP caller, in this case SOAPsnp, is applied to each of these smaller bins rather than to the entire genome (reduce).

The Need for Parallelization

The CHB dataset is ~3.3 billion reads, with an average read length of 35 bp. Even with Bowtie’s multi-threading and incredible speed, this massive dataset would take months to process on a single computer. However, the authors divided the input reads into smaller subsets and aligned them in parallel, then processed the 2-Mbp genome “bins” in parallel as well. Throw all of these parallel tasks at Amazon’s Elastic Compute Cloud (EC2), and it eats them up. The high-performance EC2 cluster (40 nodes, each with 8 CPUs and 7 GB of RAM) finished all of the tasks in about 3 hours.

Digging into the Numbers

There are a couple of inconsistencies in the numbers that need to be ironed out. For example, the BGI study reported 36X coverage from 3.3 billion reads (2.02 billion single-end, 658 million paired-end), whereas Langmead et al downloaded 2.7 billion reads from the “YanHuang Site” and noted that it represented 38X coverage. Where did that extra 2X come from? Langmead et al do cite the Nature paper by Wang et al, and I believe it’s the same dataset.

At first I was concerned that the Salzberg group had only downloaded the mapped reads and run them, which would have been a biased test of alignment performance. However, I don’t believe this is the case. Instead, I believe they meant to say that they’d downloaded 2.02 billion single-end reads, and they’d also downloaded 657 million read pairs (1.314 billion paired-end reads). This would yield the correct total of 3.3 billion reads. I realize this is nitpicky.

More of a concern and hopefully less nitpicky are the SNP calling numbers. Langmead et al reported over 21% more SNPs (3.73 million) than BGI did (3.07 million) on the same dataset, and attributed the difference to less stringent filtering. Yet both groups used the same SNP caller, so is it possible that the Bowtie alignment, not the SNP filters, were responsible for what we presume are false positives? This is an important question that Heng Li and others are already considering.

Whole-genome Sequencing Analysis for the Masses

I like the Salzberg group because they’re all about the small lab, about putting NGS processing capabilities into the hands of people without substantial computing resources. Bowtie made it possible to map a lane of Illumina/Solexa data in a few hours, using only a laptop with 4 GB of RAM. Now, Crossbow offers anyone with $85 in their budget to run entire WGS datasets on borrowed (or rented) CPU time. There’s no need to purchase, maintain, or continuously upgrade expensive computing hardware. Even the storage space can be rented (i.e. from Amazon S3, which the authors used). It is literally now possible for someone to analyze an entire human genome while sitting on their laptop at the local coffee house.

References
Ben Langmead, Michael C. Schatz, Jimmy Lin, Mihai Pop and Steven L. Salzberg (2009). Searching for SNPs with cloud computing Genome Biology, 10 (R134) : doi:10.1186/gb-2009-10-11-r134

Short Read Aligners and Variant Detection

November 6, 2009 by Dan Koboldt

In recent weeks I’ve had conversations with many people in the NGS community who are attempting to call variants, accurately, in Illumina/Solexa data. Part of it stems from VarScan, my SNP and indel caller for next-gen sequencing data that works with Bowtie, Novoalign, cross_match, and other aligners.

Another part of it stems from my involvement in 1,000 Genomes Pilot 3, for which several participants have applied their own variant detection pipelines to the same dataset. Last month, Goncalo Abecasis, with input from David Craig, Heng Li, Gerton Lunter, and Fiona Hyland, proposed an exercise comparing several read mappers on real and simulated ABI SOLiD and Illumina/Solexa data. The initial list of aligners – Maq, BWA, Stampy, BFAST, BioScope, and KARMA – demonstrated just how rapidly the field has grown since my aligner comparison last year at AGBT. I’d looked at Maq and BFAST, and knew about (but hadn’t tried) BWA, but the others on the list (Stampy, BioScope, and KARMA) were ones I’d never heard of.

I proposed adding three aligners to the list: Bowtie and Novoalign for Illumina data, and SHRiMP for SOLiD data. My suggestions were politely declined by Richard Durbin (WTSI), who said “In our hands Bowtie doesn’t seem accurate enough for variant calling. It is a great tool for fast assignment of reads for some other purposes. Novoalign is accurate and good, but perhaps a little slow. SHRiMP is also I think very slow.”

Personally, I think that Bowtie works very well for variant calling, I know of several groups who are using it for that exact purpose. And while Novoalign *is* a bit slow, in my experience it’s just as fast as Maq, one of the two aligners out of Durbin’s lab that were already on the list. Of course, Maq remains the most widely used tool for Illumina data (for now), and that’s an important consideration. Most NGS analysts know and love Maq as much as I do.

Balancing Speed and Sensitivity

However, these assessments bring into focus the key issue surrounding short read alignment for variant detection – finding the balance between speed and sensitivity. Bowtie and Novoalign exemplify this well. Bowtie is ultrafast – the fastest short read aligner I’ve used – and maps an entire lane (~15m reads) in just 1-2 hours. Yet in my experience, it places slightly fewer reads than BWA/Maq. And it performs only ungapped alignments, so indels won’t be detected. In contrast, Novoalign typically maps more reads than Maq and BWA, seems very accurate, and remains one of the few aligners to allow gaps in fragment-end reads. In general, my comparisons demonstrated that Novoalign speed is comparable to Maq on typical datasets. However, longer reads and lower-quality data can make Novoalign very slow indeed. The ultimate short read aligner, in my opinion, would have Bowtie-like speed, Novoalign-like sensitivity, and the widespread community support that Maq enjoys.

Ask the Guru: Heng Li

Heng Li, who led development of both Maq and BWA, told me that he’s not worried about sensitivity. “Most aligners nowadays are sensitive enough,” Heng wrote to me in an e-mail this week. “For detecting variations, specificity is of more importance. Nonetheless, how much wrong alignments may contribute to wrong SNPs is an open question. As long as alignment errors are random, more wrong alignments may not necessarily lead to worse SNP calls.” Clearly, he has already given some thought to these issues. If we’re lucky, Heng Li may begin to address these open questions in his new post at the Broad Institute.

Underlying Causes of False Positives

Read mis-alignment would not be a serious problem if it occurred randomly across the genome. The trouble is that wrong alignments don’t seem to be random, at least in my experience. In projects like TCGA Ovarian, we see numerous false positives (particularly in tumors) that seem to arise from read mis-alignment. These typically manifest as clusters of variants, often present in each of a subset of reads whose true alignment is probably a paralogous region of the genome. It’s also possible that they’re caused by an indel, which (as Kiran Garimella of the Broad Institute recently showed) sometimes manifest as clusters of substitutions at several positions near one another. We can aggressively filter these by looking for clusters of predicted SNPs, but even better would be to remove the mis-alignments before variant calling even begins.

Read Mis-Alignment and Paired-End Sequencing

Here at WashU, we have a growing concern that the alignment scores for short reads are continually over-estimated. Often our manual reviewers find that reads supporting false-positives have mate pairs that align to a different chromosome altogether. In the absence of translocation events, when this occurs, one of the two reads is incorrectly placed, and any variant it supports is probably not real. Personally, I’d rather remove both reads in such situations, and rely on correctly mapped read pairs for detection of small variants.

The pervasive spread of paired-end sequencing is beginning to reveal just how often short aligners can get it wrong. The corollary here is that taking read pair information into account during alignment is of critical importance, and those hopeful short read aligners that don’t do it yet (crossmatch, for example) are destined for inferiority.

High-Throughput Sequencing: Speed Matters

Yet what I’m learning from discussions with others in the community – particularly the growing surge of users making the leap from Maq to BWA – is that speed matters. With Illumina machines cranking out 20 gigabases in a single run, and projects like the 1,000 Genomes generating terabytes of sequence over the course of months, we can’t afford to be using the slower aligners, no matter their sensitivity. At worst, we might apply a two-stage approach to alignment that rapidly maps reads that precisely match the reference, and passes only the variant reads to a more sensitive aligner for mapping.

Of course, as a colleague of mine recently joked, by the time we write the perfect aligner, Pac Bio will have come along and sequenced the entire genome, kilobases at a time.

« Previous Page