Mapping Bias in Short Read Alignment

December 11, 2009 by Dan Koboldt

A recent paper in Bioinformatics investigates the effect of read-mapping biases on detecting allele-specific expression (ASE) from RNA-Seq data. The authors generated 16 million 36-bp cDNA reads in each of two HapMap individuals on the Illumina/Solexa platform. When evaluating known SNPs for evidence of ASE, they observed that heterozygous SNPs exhibited a mapping bias favoring the reference allele.

mapping-bias-header

This alone is perhaps not surprising, as we already knew that indels suffer from such bias. Initially, most short read aligners simply ignored gapped alignments. Now, even with aligners like BWA and Novoalign that allow for gaps when mapping short reads, alignments supporting the reference allele (ungapped) will be favored over alignments supporting an indel (gapped). The longer the indel, the larger the gap, and the less likely a short read would be to be mapped across it.

It is easy to see how SNPs might have a similar effect. Clusters of SNPs in close proximity, for example, may result in reads with more mismatches than are permitted by the aligner. In simulations, the authors found that random error (i.e. sequencing error) exacerbated the mapping bias. At an error rate of 0.01, some 51.4% of reads at heterozygous sites supported the reference allele, while an error rate of 0.05 increased the proportion to 59%. My own conclusion based on these results is that a variant allele, combined with nearby sequence changes that result from random error, pushes the mismatch profile of certain reads above the threshold at which alignments are discarded.

SNP-Masking Reveals Inherent Bias

What is surprising in this study by Degner et al is that even after they masked SNP positions in the reference sequence, some 5-10% of SNPs still had an inherent mapping bias favoring one allele. For 1.4% of SNPs, in fact, all of the reads came from a single allele. This obviously has important implications for evaluating ASE in RNA-Seq data, since the relative frequency of alleles from read mapping is used to infer allelic expression. It also affects the now-widespread application of Illumina/Solexa and ABI/SOLiD sequencing to characterize genetic variation from genomic DNA. Because virtually every variant calling algorithm relies on the ratio of reads supporting variant versus reference alleles, an inherent mapping bias favoring the reference allele will reduce the detection sensitivity.

Mapping Bias and Sequence Homology

To better understand the causes of inherent mapping bias, the authors investigated some of the most severely affected SNPs. The strongest biases occured among SNPs in regions of the genome with homology to other locations. When the SNP position was not masked, variant-containing reads matched another locus equally or even better than the true location. When the SNP position was masked, both reference- and variant-containing reads had a 1-bp mismatch to the reference, but either allele might match better elsewhere in the genome. In Figure 3, two examples of such SNPs demonstrate how variant-containing reads either mapped incorrectly or were “not mapped.” Some of these “not mapped” reads may have exceeded the number of allowable mismatches, while others may have become non-unique (i.e. matching multiple places). The authors filtered any alignments with mapping quality of 0, so it’s unclear which caused the mapping failure.

I should point out here that the masking approach may have contributed to this result. The authors “masked” heterozygous SNPs by changing the reference base to a third allele that matched neither reference nor the known variant. A superior approach might be to mask heterozygous SNPs to N, so that any base call at that position is considered a match. This would reduce the number of read mismatches overall, and might help improve the bias. Then again, some read aligners may consider any base at “N” to be a mismatch, which would have essentially no effect. What might have been interesting, though, is increasing the # and base-quality-sum of mismatches allowed by Maq to see if the read bias was removed.

Implications Moving Forward for ASMB

Your reaction might be to shrug, since Illumina/Solexa now routinely generates 76-bp and 100-bp reads. There are, however, a number of reasons why this might not address the bias issue. First, while read lengths are getting longer, alignment “seeds” for short reads are essentially unchanged, and if the SNP occurs in the ~22-25 bp alignment seed, it can still have an effect. Second, many published datasets these days are still based on read lengths of 50 bp or less, especially from groups running ABI/SOLiD or older Illuminas. Third, at least one promising single-molecule sequencer is still generating reads in the 30 bp range. And finally, there’s a practical reason that we’ll continue to see short read datasets: running a 75-bp or 100-bp Illumina flowcell takes several days and multiple kits – expenses of time and dollars that may not always be available. Thus, allele-specific mapping bias (ASMB) [acronym invented, D. Koboldt, 12/11/09] in short reads will remain a key issue in next-generation sequencing.

References
Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, & Pritchard JK (2009). Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics (Oxford, England), 25 (24), 3207-12 PMID: 19808877

Comments

Luke says

December 12, 2009 at 3:13 am

Coding ambiguous bases as a third, unrelated base is a bit of a nasty hack that makes their results very hard to interpret. I’d expect virtually all of the ASMB to result from the masking technique.

The natural solution, that doesn’t seem like it’d be hard to implement, would be for the reference to contain IUPAC ambiguity codes. Then you’d just need to make a relatively minor change to the alignment code to make them consider these a match to either base.
sm says

December 14, 2009 at 11:39 am

I find novoalign to be a very sensitive mapping tool. Trying their analysis on novoalign would be interesting to see as well.

I agree with Luke – reference contains IUPAC ambiguous codes; the only thing to check here is, how much does this change the unique mappability/sequencability of the genome.
lh3 says

December 14, 2009 at 1:34 pm

The idea of incorporating IUPAC code has been there for quite some time. It is may be easy to implement for some algorithms, but harder for others. I used to think about supporting IUPAC reference in bwa, but it turns out to be very complicated and in the end I gave up given limited time. I know mosaik and novoalign support IUPAC code. Are there others?

The reference bias is present not only in expression profiling, but also in SNP calling. The bias tend to be higher given short error-prone reads. I guess when reads are >100bp long, the bias should be fairly small. just a guess. Might be wrong.
Colin says

December 16, 2009 at 2:22 am

If you’d like to try Novoalign it already handles IUPAC codes in the reference sequence. It’s just a matter of putting the right codes in the reference before running alignments. An alignment to an N will score 6 vs 30 for a full mismatch and alignment to an IUPAC code for 2 NAs (e.g M, R, W,..) will score 2 points, a lot less than a full mismatch.
graveley says

December 23, 2009 at 8:04 am

While this study definitely indicates that there will be a bias if the analysis is done in the manner that the authors conducted it, I am very concerned that the paper is being interpreted to mean that all allele-specific RNA-Seq experiments will have this bias, which it does not. The mapping bias observed is a consequence of the approach taken. If one instead separately maps all reads to each haploid reference genome (which one must have and is becoming increasingly possible), then assigns the reads to one genome or the other (or to map equally to both), then the bias is eliminated.
JFD says

December 29, 2009 at 10:18 am

Many thanks to D.K. for highlighting our paper on this blog and to everyone for the feedback. As Ih3 has mentioned, the read mapping programs that were available to us at the time did deal well with IUPAC ambiguity codes (e.g., randomly assigning one of the 4 canonical bases to a location marked as N or completely disallowing alignments overlapping ambiguity codes) This left us with few options other than developing a new read-mapping algorithm or masking the SNP locations artificially with a non-segregating base.

It is great to hear that some of the read mapping algorithms are now supporting ambiguity codes. However, I am not yet convinced that this will offer a complete solution to the problem (although I agree it will help a great deal and plan to post a more complete analysis later). Take for example the SNP in figure 3A of our paper. The 3rd non-reference read is a perfect match to a location that shares homology with the SNP location. Thus, even if the mapping algorithm were to consider both the reference and non-reference alleles as perfect matches to the true location, there would still be a bias at this SNP as reads like this would not map unambiguously. It seems to me that approaches such as that suggested by Gravely might suffer in the same way.

I think that simulations of short reads around SNP locations will remain a useful tool for identifying the locations for which ASMB will be a problem even as reads get longer and mapping programs improve.

Finally, our intention with this paper was certainly not to suggest that all next-gen sequencing studies must suffer from these biases, but only to describe the problem we encountered so that others could avoid some of the puzzlement we endured while trying to think this problem through. As we worked through this problem, we found that some seemingly sensible solutions did not work as well as we had hoped.
J Pickrell says

December 30, 2009 at 12:10 pm

To follow up on JFD’s comment, consider this following situation:

A region of the genome (region 1) has an A/G SNP. An homologous region of the genome (region 2) is identical, but has a G at the position where there’s a SNP in region 1.

Consider a read from region 1 carrying the A allele–it matches perfectly back to region 1, and will be correctly assigned. Now consider a read from region 1 carrying the G allele. No matter how you do the mapping (with/without ambiguity codes, or mapping to both genomes as suggested by Gravely), it’s unclear whether this read came from region 1 or region 2, simply because it perfectly matches both places. If this read is binned, only reads from the A allele will be mapped in region 1, leading to a mapping bias.

This is independent of how the masking/mapping is done–one allele is unique in the genome, and the other is not. It’s unclear how a mapping algorithm could deal with this situation efficiently.
graveley says

January 3, 2010 at 9:07 pm

Bias in the situation presented by Pickrell can be dramatically reduced by using paired-end reads.
lh3 says

October 22, 2010 at 11:02 am

I happen to read this great paper again, and think we might be able to get less bias. In the paper, the authors always regard a masked site as a mismatch. They have shown this does not help much. However, if we always regard a masked site as a match, I believe we will not have any biases in Fig 3. When we do this, A-ref/nonRef-read-1/2 are mapped correctly, A-ref/nonRef-read-3 unmapped (no bias); all six reads in Fig. 3B will be mapped correctly (no bias again). This masking strategy (regarding masked sites as matches) also solves J Pickrell’s example because at the SNP site, no reads can be mapped.

To implement this idea, the aligner in use must give GR a full matching score, not a reduced score. Novoalign is not suitable. I do not know how Mosaik and Gsnap are working.