I got an early look at data from IonTorrent’s forthcoming semiconductor sequencing chip, called the 316. The data comes from a single sequencing run of E. coli, specifically the laboratory reference strain DH10B, not to be confused with the hybrid E. coli strain in the European outbreak that IonTorrent sequenced earlier this year. According to the run report, some 1.69 million reads were generated, of which 1.35 million were 100 base pairs or longer. It’s a nice, shiny report with color images, and of course I summarily ignored it in favor of looking at the data myself.
The data package included an SFF file, a data format often used for Roche/454 data that contains read sequences and base qualities. There was also a BAM file, which included the read sequences aligned to the E. coli reference, indexed for random access. And there was a FASTQ file, which is the common sequence-and-quality input format used by many short read aligners. So how does one go about evaluating a new sequencing dataset? This is a relevant question for bioinformaticians, who are often handed a lane or a run’s worth of data and asked, “How does it look?”. Personally, I like to start with the basics: reads, bases, and qualities. I extracted sequence and quality files from the SFF archive, downloaded the latest E. coli reference genome from NCBI, and got to work.
Distribution of Read Lengths
There were 1.69 million reads totaling 175,570,901 bases, which works out to an average read length of about 104 bp. Here’s my own plot of the read length distribution:
The distribution was very tight, peaking right around 100 bases. The longest read was 127 bp. Using BLAT, I mapped 1.63 million reads to the E. coli reference (96.36%), and 1.51 million (about 90%) were uniquely placed. Yes, I could have simply used the alignments provided in the BAM file, but I knew little about how they were generated. I’ll talk about those later. The BLAT-like alignment score distribution closely mirrors the read length:
What this means is that the majority of bases in each read were mapped, and most of the time, they matched the reference. You’d expect a leftward shift in the distribution if the reads contained undesired primer or barcode sequences that didn’t match the reference.
Base Quality by Read Length
Since this is a new sequencing technology, I was keenly interested in base quality along the length of the read. If you’ll recall, early versions of the Illumina/Solexa (1×32 bp or 1×36 bp, remember those days?) showed a decline in base quality as the read went on, with a substantial drop at around 28 bp, while early 454 data had fairly uniform base quality along the length of the read. Thus, I calculated the average base quality from the first 1,600 reads in the IonTorrent FASTA file. This is where things got interesting.
You’ll notice that, like early Illumina/Solexa data, average base quality declines along the length of the read. I should point out that these are variable-length reads, so more values were used for read position 1 than, say, read position 60. Thus, it’s possible that only the last few bases in each read are low-quality, which reduces the average score as you reach the end of the reads. Also notable is the fact that the highest average base quality is 23. Remember, this is the platform’s own estimate of error rate; a score of 20 indicates that 1/100 bases will be wrong. I also noticed a lot of base quality values of 8, and wonder if this is the equivalent of Illumina’s q=2 indicating virtually no confidence in the base.
The E. coli reference genome totals about 4.69 Mbp. With 175 Mbp of data, the theoretical coverage is around 37.5-fold across the E. coli genome. Because this is a laboratory reference strain, we expect it to be genetically homogenous, and essentially identical to the reference sequence. Any apparent SNPs or indels are likely due to sequencing error. To get a handle on this, I used VarScan 1 (which parses BLAT output) to detect SNPs and indels using the ~1.51m uniquely mapped reads. Some 142,920 substitution (SNP) events were detected, reflecting a substitution error rate of 0.092%.The distribution of where those substitutions occurred indicates some interesting biases:
Most substitutions happen near the ends of reads, which is consistent with my earlier observations of base quality. There’s also a notable peak in the first few bases, which is more likely to represent an alignment artifact than true error rates. The average sequence depth across all substitutions was around 36, which is very consistent with the theoretical 37.5-fold coverage genome-wide. Looking at the context of substitutions, if one ignores a homopolymer issue (see below), the most frequent changes were T->C and A>G transitions, which often occurred when flanked by G/C bases.
Homopolymers and Insertion/Deletion Errors
VarScan detected an astonishing 1,122,276 insertion/deletion events, reflecting an indel error rate of 0.726%, or about eight-fold higher than the substitution error rate. VarScan’s heuristic filter immediately removed 983,367 indels (87.6%) because they were clear artifacts in homopolymer runs of 4 or more bases. Even the indels that passed this filter (28,182 insertions and 110,727 deletions) are mostly single base indels matching one of the flanking bases. In other words, an overcall or undercall in a homopolymer. This is very interesting, because I asked Jonathan Rothberg about this precise issue (homopolymer errors) at AGBT 2010, and he answered glibly that they wouldn’t be an issue because there wouldn’t be signal loss from the semiconductor (something about a linear pH change).
Homopolymers are obviously an issue for this platform. In fact, when I went back and looked at the context of substitutions, at least 28% of them were due to an overcall followed by an undercall, e.g. AATT sequenced as AAAT, or vice-versa. Looking at the indel distributions, it seems that C and G were most likely to be under-called (accounting for 58% of all indels), and they were far more likely to be under-called (i.e. bases missed) than over-called.
Applications and Implications
The 454-like homopolymer issue raises some concerns about using this platform for variant discovery. Yet despite the errors, I’m impressed at how rapidly the technology has matured. In just a couple of years, IonTorrent is delivering over 100 Mbp per run. In this run, at least. There’s no guarantee that Dr. Rothberg himself will do the sequencing in future experiments (just an educated guess, based on his initials in the library name). What can you do with 100 Mbp of data? Human genome and exome sequencing are out, obviously, but more targeted efforts (PCR or custom capture) could achieve high read depth. Microbial sequencing is an obvious application as well, particularly when time is of the essence. At a price that’s 1/10 of most NGS systems, IonTorrent seems rather promising.