Next-generation sequencing instruments might be considered a disruptive technology. The incredible throughput of these machines, even 4-5 years ago, clearly mandated the development of a new generation of algorithms and data formats capable of storing, processing, and analyzing huge amounts of sequence data. One key achievement in next-generation sequencing bioinformatics was the specification of sequence alignment/map format (SAM) and its binary equivalent (BAM). These formats were widely adopted by a community of scientists desperate to have a common format in which to store next-gen sequencing reads and their alignments to a reference. BAM files quickly became a standard for the Cancer Genome Atlas, the 1,000 Genomes Project, and other large-scale sequencing efforts. The formats were accompanied with a software package, SAMtools, that is probably the most pervasive tool for next-gen sequencing in the world.
To aid in variant calling and other analyses, SAMtools can generate a pileup of read bases using the alignments to a reference sequence. There’s a lot you can do with pileup-like output, and indeed, SAMtools variant calling is quite popular. The actual command is samtools mpileup, and here are five things that you should know about it.
1. SAMtools mpileup has permanently replaced pileup.
Replaced as in the latter command no longer works. This is generally a good thing; mpileup can do nearly everything that pileup could, and a lot more. You still use it to generate pileup output for a single sample. However, some features have gone away, such as simple consensus calling with the -c parameter, and the option to output mapping qualities for each base (I think that was -k). Consensus calling can be done in mpileup with a couple of extra steps using bcftools; see the mpileup page for details.
2. Base alignment quality (BAQ) computation is turned on by default.
BAQ is a phred-like score representing the probability that a read base is mis-aligned; it lowers the base quality score of mismatches that are near indels. This is to help rule out false positive SNP calls due to alignment artifacts near small indels. There have been recent suggestions, however, that BAQ may be too strict and cause real SNPs to be missed. Several users of the VarScan variant caller have reported that its read counts disagree with what is seen in IGV, or somatic mutations were missed when mpileup was used instead of pileup. These issues are almost always due to BAQ’s downgrade of base qualities to 0 or 1. This adjustment can’t be seen in IGV, but it’s below VarScan’s default base quality threshold. You can disable BAQ with the -B parameter, or perform a more sensitive BAQ calculation with -E. I’ve heard that the latter option will be turned on by default in the next version of SAMtools.
3. Analyze multiple samples at once.
The principal feature to SAMtools mpileup is the ability to analyze data from multiple samples simultaneously. You do this by providing more than one BAM file. This feature is nice because it provides data across all samples on a per-position basis. The first three columns (chromosome, position, and reference base) are the same. Following those, you get three columns per BAM file indicating the read depth, bases, and base qualities for that sample at that position. The VarScan mpileup2cns command will take this raw input and call a genotype for each sample, as well as a consensus genotype based on the data from ALL samples. This is useful for detecting variants in low-coverage regions by leveraging data across samples. You can also use the bcftools pipeline for multi-sample calling.
4. Rule out false positive with strand bias or poor mapping.
Many groups working variant calling in next-generation sequencing have independently converged on several key factors that influence false positive rates. The VarScan 2 paper, for example, describes 9 empirically-derived filtering criteria that we use to identify and remove artifacts. The strand representation and number of mismatches in supporting reads are two important indicators of false positive arising from systematic alignment artifacts. SAMtools mpileup helps users address these issues as well: the -C parameter lets you downgrade the mapping quality of reads with lots of mismatches, and the -S parameter tells SAMtools to report a per-sample strand bias p-value.
5. Random position retrieval that works.
One of the most powerful features of mpileup is that you can specify a region with -r chrom:start-stop and it will report pileup output for the specified position(s). The old pileup command had this option, but took a long time because it looked at all positions and just reported the ones within your desired region. Instead, mpileup leverages BAM file indexing to retrieve data quite rapidly: In my experience, it takes about 1 second to retrieve the pileup for several samples at any given position in the human genome. Multi-sample, rapid random access has lots of uses for bio-informaticians; for example, I can retrieve all bases observed in all samples at a variant of interest to look at the evidence in each sample.
These features are the results of hard work by Heng Li and others who contribute to the development and improvement of SAMtools. It’s great to see a key piece of software under continued, active development, and I think most of us look forward to what the next SAMtools has in store.
References
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, & 1000 Genome Project Data Processing Subgroup (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England), 25 (16), 2078-9 PMID: 19505943
Thanks!
Thanks for these valuable comments on Samtools. One question concerning point 2) BAQ: I want to have the “better” BAQ computation using -E option – do I still have to disable normal BAQ computation using -B, or will it be disabled automatically?
JMS, All you should need is -E.
Thanks for this post! I have been looking for something like that for a while. I have a question though. According to the SAMtools manual, for mpileup each one of your samples has to be from a single individual. It is my understanding that the old pileup could handle samples that contained multiple individuals per sample which are very frequently used to address questions in population genetics. Is that true or am I completely misunderstanding how the two commands are different?
Great post. Thanks much.
Another one: For paired-end data, samtools mpileup ignores reads that have an unmapped mate (SAM flag 8 set) by default. If you want to include reads that have an unmapped mate, use the -A option. Dan, maybe the VarScan User’s Manual page should mention the -B and -A options in the “How to build a samtools (m)pileup file” section?
Dan, in the post you make the comment “You can also use the bcftools pipeline for multi-sample calling.” Can you give me a pointer how this is done? I tried this a while ago and could not figure out how to get consensus calls; instead, I always got a vcf with one column for each sample. This contrasts with VarScan, which (as you mention) gives the consensus calls in a straightforward way.