The emergence of next-generation sequencing has presented numerous significant challenges to the bioinformatics community. NGS instruments have given rise to a new generation of software tools for the alignment, assembly, management, and visualization of incredible amounts of data. New algorithms have also been developed to assess coverage, assess genomic copy number, call variants (SNPs/indels), and infer large-scale structural variation.
Regardless of their purpose, most tools for NGS data analysis are under increased demand for the same things:
- Efficiency – in the face of ever-growing throughputs from NGS instruments
- Flexibility – to accommodate new sequencing platforms, experimental protocols, and input formats
- Scalability – to continually improve upon and enhance their features as needs evolve
The definition and widespread acceptance of the Sequence Alignment Map (SAM) as the standard format for representing NGS data was a key development for the field. Aaron McKenna and colleagues at the Broad Institute have just published another advance – the Genome Analysis Toolkit (GATK), a structured programming framework for NGS data anlysis. Essentially, GATK is a foundation of code that takes advantage of the SAM/BAM input format to simplify many of the common requirements for data analysis tools. The core system can accommodate reads from any sequencing platform, as long as they’ve been converted to SAM/BAM format. It therefore supports most sequence aligners, and also recognizes public database formats (HapMap, dbSNP) and some of the common data-exchange file formats (e.g. GLF and VCF). It’s written in Java, which means that the framework is operating-system-independent as well.
GATK implements something called a “mapreduce” paradigm to allow analysis tasks to be performed in parallel. If you’re developing a new analysis tool, there are a few different ways (traversals) to get to the data that’s in a BAM file. For example, if you wanted to compute the average read length, you could use the TraverseReads scheme to pull out every read and walk through them. Alternatively, if you wanted to calculate the average read depth across the genome, you could use the TraverseLoci scheme to pull out information (reference base, read bases, etc.) at every base in the genome. The best part is that you don’t have to write any of the code for indexing, retrieving, and parsing NGS data – that’s already done. You can focus on your analysis tool, while the GATK developers can continually improve the core engine.
Analysis Tools Built on GATK
The authors demonstrate two simple applications that were developed using the GATK framework. The first, a depth-of-coverage tool, took just 83 lines of code to generate a depth-of-coverage report for every position in a given locus (or the whole genome). This might easily be developed into a highly automated, graphic-supported system for reporting coverage on, say, an exome sequencing project. The second demonstration tool was a simple Bayesian genotyper (57 lines), which uses posterior probability to determine the most likely genotype at each position in the reference.
I’m aware of at least two more valuable NGS data analysis tools that were built on this framework. The first is actually the framework’s foundation, Picard (http://picard.sourceforge.net), which contains a number of SAM/BAM parsing elements, but perhaps more importantly, has the widely used “MarkDuplicates” tool for identifying redundant sequences in NGS data. The second tool, one that I’ve recently been evaluating, is the GATK indel genotyper. Given a pair of BAM files from a tumor sample and matched (normal) control, the GATK indel genotyper implements a stringent algorithm to call indels and determine their somatic status (Germline or Somatic) based on the evidence in both files. Optionally, this can be done with local realignment of reads around indel positions, which helps remove some false positive variant calls. Compared to other tools for indel calling, GATK seems to offer greater precision (fewer false positives), while maintaining sensitivity, in the datasets that I’ve tested.
I readily admit that I don’t know enough about parallelization to discuss it in detail, but what I read in the paper seems encouraging. On a single CPU, the simple Bayesian genotyper took something like 14 hours to complete chromosome 1 of a whole-genome sequence using a single CPU. But when offered 12 CPUs, the built-in parallel processing support of GATK brought down execution time almost 12-fold, to about an hour and a half. It strikes me that frameworks such as this, coupled with the latest 4-core, 8-core, even 50-core CPUs, may finally be bioinformatics’ answer to the challenge of massively parallel sequencing.
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, & Depristo MA (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome research PMID: 20644199