Capture and Subassembly with Jay Shendure

Yesterday our 2010 Genetics Seminar Series kicked off with Jay Shendure (Univ. Washington) whose twelve-exome paper landed in Nature late last year. His talk covered three very different applications of next-generation sequencing: high-throughput mutational studies of core promoters, sub-assembly of Illumina reads to 454-length contigs, and exome capture to unravel Mendelian disorders.

Mutational Profiling

First, Dr. Shendure described some interesting experiments under way in his lab to elucidate the function of non-coding regulatory variants – specifically, single nucleotide changes in the core promoter that alter gene transcription. The approach is called “saturation mutagenesis” and involves generating every possible mutant in a construct, and then assaying the effect of each construct on transcription. By leveraging high-density Agilent arrays and next-generation sequencing, Shendure and his colleagues performed saturation mutagenesis in vitro in high-throughput fashion. Their process involves three steps:

Synthesize mutant constructs on an Agilent array. The oligos (probably ~150 bp) include the core promoter region surrounding a gene’s transcription start site (TSS). They generate a single mutation (SNP or single-base indel) per construct, and label each construct with a sequence barcode downstream of the TSS.
Cleave mutant templates from the array, amplify, and sequence on Illumina to measure relative construct abundance.
Perform in vitro transcription, then Illumina RNA-Seq, to measure the expression of each construct.

Dr. Shendure noted that there was some sequencing bias between barcodes, so they used multiple barcodes (6) per mutant construct and normalized the results. Then, by combining the construct abundance data (Seq) and the expression data (RNA-Seq) for mutants and comparing them to the results for the wild-type construct, they could assess the functional impact of each synthesized mutation on transcription.

As far as results go, Dr. Shendure showed a histogram: on the X-axis was each base of the core promoter region that they evaluated, and on the Y-axis, the effect of mutating that position on transcription. Most of the values were negative, indicating that mutations reduced transcriptional activity, particularly around the TATA box and INR site. Essentially, the plot neatly described the footprint of RNA polymerase binding, with the most effective mutations centered on the TSS. Intriguingly, the single-base deletion mutants consistently showed the greatest reduction of transcription, suggesting, perhaps, that indels in promoter regions are likely to be functional variants.

Short Read Subassembly

The next area of interest was very pertinent to groups with access to next-generation sequencing, but not the 454 “length matters” platform. While Illumina read lengths are still growing (most groups currently run 75- or 100-bp protocols), they still cannot rival the ~450 bp reads consistently produced on 454 Titanium. And yet, many applications of NGS benefit from longer reads – de novo assembly, metagenomics, and the core promoter assays I’ve just described, to name a few. Thus, Shendure and his group sought to combine some Tech D cleverness with Illumina’s incredible read depth to generate localized assemblies of kilobase-length fragments.

First, they sheared DNA into fragments that were a few kilobases long, ligated adapters to the ends of each fragment, and did a round of amplification. Now they had many copies of each fragment with adapters on each end. The fragments are concatemerized, then somehow randomly sheared to variable-length pieces of the original fragment such that each piece has one of the original adapters on one end. A new adapter is ligated to the sheared end. Then there’s another round of PCR, followed by Illumina paired-end sequencing. The resulting paired-end reads (75-mers) have a “read2” that’s the same for all pieces of the same kilobase-fragment, but a read1 that comes from some random location within the fragment.

Then, it’s possible to perform a localized assembly for each kilobase fragment. It’s an interesting approach, but here’s the problem: after assembly, in their proof-of-principle experiment, they achieved a median contig size of 350 bp. Granted, the per-base quality was very high (85% of bases had Q>40), but the lengths are unimpressive. As Dr. Shendure joked, they managed to get similar read lengths to a 454 run and make it cost just as much. There’s still a lot of work to do. Or they could just pick up one of those cute little GS-Juniors.

Human Exomes and Mendelian Disease

Finally, Dr. Shendure gave an overview of last year’s elegant Nature paper, in which exome sequencing of four individuals, followed up by careful downstream informatics, correctly identified the causative gene. Their defined “exome” was 30 Mb, which they targeted using two solid-phase array capture chips. Illumina sequencing of the exome capture generated about 6.4 gigabases per individual. Exome sequencing makes a lot of sense in certain Mendelian disorders, where (1) the pattern of inheritance, e.g. autosomal recessive, is known, and (2) the causative mutations occur in a single gene.

By sequencing the exomes of multiple individuals, isolating what we’d call “tier 1” variants – Nonsynonymous, nonsense, splice site, or frameshift-indel – and then removing all known common variants from public databases, Dr. Shendure and colleagues can reduce 20,000 gene candidates down to a handful. It worked out beautifully in the Nature paper – all four individuals had rare, tier 1 mutations in the same gene.

But in another cohort (4 individuals from 3 kindreds with Miller syndrome, a rare developmental disorder) Dr. Shendure and colleagues discovered the danger of overfiltering. They removed all variants from dbSNP 129, but when they limited the scope to only mutations predicted to be “damaging” or “deleterious”, the number of genes dropped to zero. Apparently the deleteriousness of at least one of the causal mutations wasn’t predicated correctly.

Obviously, the need is for better filters of common variants. But with projects like the 1,000 Genomes in full swing, I wonder, will filtering out using dbSNP get better, or worse? Already, as Shendure pointed out, certain genes have basically a SNP reported at every position. I know that TP53 does. What’s more, with the advent of next-generation sequencing, I hate to tell you, but people are going to be reporting a lot of false positives. I guarantee it. So when you filter all of the variants, you might actually remove the ones you’re looking for.

References
Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, & Shendure J (2009). Targeted capture and massively parallel sequencing of 12 human exomes. Nature, 461 (7261), 272-6 PMID: 19684571

Trackbacks

GenomeQuest Blog says:

January 18, 2010 at 2:30 pm

Challenges in 1000 Genomes Data…

Variant reports are not the right deliverable for a re-sequencing study.
A well written technical blog ‘MassGenomics‘ written by Dan Koboldt illustrates why. Dan says “What’s more, with the advent of next-generation sequencing, I ha…