Capture and Illumina Sequencing of Human Exomes

September 24, 2009 by Dan Koboldt

This month in Nature, a group from Jay Shendure’s lab reported perhaps the most ambitious targeted resequencing study to date – the whole exome sequences of 12 individuals.

Targeted capture and massively parallel sequencing of human exomes

Using an array-based hybridization capture method (2 microarrays, 10 g of input DNA), Ng et al selectively targeted CCDS regions totaling 26.6 Mb of sequence (~0.83% of the human genome). Capture specificity was similar to that of other published methods (35-55% of reads mapping to targets), but the completeness was astonishing – on average, 99.7% of target bases covered at least once and 96.3% covered at 8x with q>=30.

By focusing on coding exons, the authors achieved 51x coverage (on average) with just 6.4 Gb of mappable sequence per individual. Illumina 76-bp single-end sequencing was the platform of choice. If I make some rough empirical estimates of mapping rate and reads per lane, they generated a single Illumina run of data (7-8 lanes) per individual. Compared to whole-genome sequencing, the authors claim a 20-fold reduction in the amount of sequence required. I’d say this estimate is pretty close. Our second leukemia genome, which had 23x haploid coverage, took 16.5 Illumina runs to complete.

Strong Illumina Pipeline

It’s not simply the technological feat that impressed me about this study. The presentation of the work and underlying analytical approaches are just outstanding. While reading through the methods, I couldn’t help but think that nearly every step the authors took in processing their data was something that we’ve implemented here – Maq alignment, start site de-duplication, mining Maq-unplaced reads for indels, etc. We have a bit of a friendly rivalry with University of Washington (since we are, after all, Washington University), so I looked for weak points. Try as I might, I couldn’t find much to criticize about the analysis. When it comes to Illumina sequencing, UW seems to know what they’re doing.

How to Write A Nature Paper

And paper itself is just clear, concise, well-written – everything I’d expect from a Nature publication. Take Figure 1, for example. Figure 1, in general, is the focal point of most research papers, and for that reason I think many authors try to cram way too much into it. Not this time. Four histograms that all have “Number of observations of minor allele” as their X-axis. Yet each one tells a different story: (a), how novel-to-dbSNP variants were rare; (b), how nonsynonymous variant frequencies are shifted to lower values relative to those of synonymous variants, (c), how this shift in allele frequencies is more pronounced for damaging nsSNPs, consistent with natural selection, and (d), how the sizes of observed indels are enriched for non-frameshift events divisible by 3.

Illumina Sequencing and Deduplication

Early into our days of Illumina/Solexa sequencing, we observed a strange phenomenon in the data: lots of reads with identical start sites and orientations. The theory was that these occasional pileups were PCR-related, and each one arose from a single molecule that somehow was sequenced over and over again. Since just about every downstream analysis (coverage, mutation detection, etc.) relies on unbiased read counts, it’s important to normalize for such events. This requires a “de-duplication” step in which multiple reads with the same start site and orientation (presumably the same molecule) are discarded and only one is kept.

Credit: Nature 461:272-276 (2009)

The implications of this deduplication requirement, as pointed out by Ng et al, are that the maximum read depth for any given position in the genome is twice the read length for single-end libraries. In their case, 152x. One might be concerned that even with de-duplication there would be substantial bias in targeted capture. But look at the bell curve of the coverage distribution from supplemental figure 1 (left).

Someone had better call O’Reilly, because that’s just beautiful data. Importantly, the deduplication paradigm changes somewhat for paired-end sequencing, which is largely what we do here. With paired ends, you have two reads from each molecule, each with a start site and orientation. So the maximum coverage immediately jumps to 4 times the read length. Furthermore, due to the variation in fragment sizes of sheared DNA, insert sizes add further distinction for different molecules, allowing for read depths of 1000x or more after de-duplication for paired-end reads.

Identifying Disease-Causing Mutations

What pleased me most about this study is that the authors didn’t just present exome capture and sequencing of “undiseased” individuals. In addition to 8 HapMap samples, they included four samples from unrelated individuals with Freeman–Sheldon syndrome (FSS), an autosomal-dominant disorder caused by mutations in MYH3. After collecting the set of coding variants in each individual, the authors asked a simple question: could we have pinpointed the disease gene from mutation data? With the knowledge in hand that this was a monogenic, autosomal-dominant disorder, the authors assumed that the same gene might be mutated in most (or all) samples. And since the disease itself is uncommon, the authors inferred that common variants could be excluded. So, with the full set of mutations for each affected individual in hand, the authors looked for genes where:

There was at least one (but not necessarily the same) nonsynonymous SNP, splice-site SNP, or coding indel in all four samples.
The mutations were novel; that is, they weren’t found in dbSNP or the other 8 HapMap samples.
The mutations were predicted to damage the encoded protein

When these criteria were applied, the authors whittled down a list of 4,510 genes with mutations in at least one sample to just 1, and that gene was MYH3. Thus, whole-exome sequencing allowed for direct identification of a disease-causing gene with just a few samples from affected individuals. Granted, the authors got lucky. The causal mutations might have been SVs, or missed by variant callers, or not covered sufficiently by sequence data. Or, the disorder might be caused by a single mutation in one of several genes, as is the case of autosomal dominant RP, a monogenic disorder for which at least 16 genes have been implicated.

Even so, the authors applied a relatively straightforward approach and got the right answer. With whole-exome sequencing capability within reach, finding the genes behind autosomal disorders is only a matter of time.

References
Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, & Shendure J (2009). Targeted capture and massively parallel sequencing of 12 human exomes. Nature, 461 (7261), 272-6 PMID: 19684571

Genetics of Human Longevity

August 19, 2009 by Dan Koboldt

A new study in PLoS ONE resequenced candidate genes in a cohort of the “healthy oldest-old” – individuals aged 85 or older that are healthy and have never been diagnosed with cancer, cardiovascular disease, Alzheimer’s, pulmonary disease, or diabetes. The idea is that these robust old-timers harbor genetic variants that reduce susceptibility to, or even protect against, the prevalent age-related disorders that tend to shorten lifespans. Demographic data suggest that less than 36% of the population of western nations will live to see 85, and only a third of these (12% overall) will do while remaining in good health. Since longevity is highly heritable (~25%), it stands to reason that genetics play a key role.

Tortoise Still Winning the Race

Intriguingly, despite the sum of human technological achievements – in agriculture, sanitation, medicine, etc. – our maximum observed lifespan (122 years) is not the longest on the planet even among animals. Indeed, the authors point out that rougheye rockfish, bowhead whales, red sea urchins, and Galapagos tortoises easily outlive us, with lifespans of 150-200 years. We can extend the lifespan of other animals – mice, by putting them on reduced-calorie diets, and C. elegans, by inhibiting expression of insulin/IGF receptor daf-2 – but can’t seem to change our own.

Rounding Up the Usual Suspects (Genes)

Next, the authors selected 24 candidate genes known to be involved in age-related processes. These included genes implicated in dietary restriction (SIRT1/3, UCP2/3, PPARG), autophagy (FRAP1, BECN1), stem cell activation (NOTCH1, DLL1), progeria syndromes (LMNA, ZMPSTE24, KL), tumor suppression (TP53, ING1, CDKN2A), and DNA methylation (TRDMT1, DNMT3A/B). Also included were the human homologs of several genes known to be differentially expressed in long-lived daf-2 mutant worms: IGF1R (growth factor receptor), SCD and APOB (lipid metabolism), and CRYAB and HSPB2 (heat shock proteins). Such an esoteric gene list allowed the authors to screen for variants across a wide range of gene functions and biological pathways that might contribute to longevity.

Ye Olde Candidate Gene Resequencing

Some 716 PCR amplicons were designed to isolate the exons, 5′ and 3′ UTRs, 1.5 kbp promoters, intron-exon junctions, and selected conserved noncoding sequences (CNSs) for each of the 24 genes. Altogether some ~360 kbp of DNA was sequenced, bidirectionally, producing a grand total of ~35 million high quality (phred > 20) bases.

Variant detection with phred/phrap/polyphred and Mutation Surveyor identified 935 sequence variants (848 SNPs and 87 small indels), of which 59% were previously known to dbSNP. Unsurprisingly, the majority of variants found mapped to introns or conserved noncoding regions. About 50 novel coding SNPs were identified, though the authors point out that they were far less common (average MAF 1.6%) than the 80 or so previously known coding SNPs (average MAF 19%).

Tag SNPs: Leveraging the HapMap Resource

Here the authors took a rather puzzling turn and sought to compile a set of longevity tag SNPs by combining their data with the findings of the International HapMap Project. Only 12% of the combined variant set was shared between HapMap and the resequencing dataset, but that’s hardly surprising – HapMap variants were selected on the basis of high frequency (MAF > 5%), whereas many of the novel variants identified in this study were rare (in coding regions, MAF=1.6%). Thus the SNP sets are very likely to complement one another.

The authors selected 682 tag SNPs representing 1,550 non-redundant variants from the combined datasets (using LD > 0.8 for HapMap SNPs, LD >= 1 for resequencing SNPs). These were utilized to genotype a larger cohort (493 healthy oldest-old and 439 random controls), but unfortunately, the data was not shown. How disappointing! It seems to me that if the authors had found any significant association between their tag SNPs and longevity, that would have been an important result.

Common vs. Rare Variants: Is HapMap Enough?

One conclusion that was perhaps over-emphasized was that HapMap SNPs were inadequate to capture rare variation in the study population. Some 264 of the 935 variants identified by resequencing were singletons, i.e. present in just one individual, and only around 2.5% of these could be captured by HapMap tag SNPs using r-squared of 0.8. The authors conclude that “This shows that HapMap tagSNPs generally do not adequately represent, private re-sequencing SNPs. This analysis highlights a major challenge for genetic association studies. Using only HapMap SNPs, effects due to uncommon variants would often be missed.” Well, yes, but also, duh. HapMap was intended to represent common, and not rare, variation. Far more compelling would have been if the authors found rare variants actually associated with their phenotype of healthy aging. But alas…

The authors raise a fair point in that association studies cannot rely on the HapMap alone. To obtain the complete picture of genetic variation underlying a phenotype of interest requires a hybrid strategy that includes both common and rare variants. At some point this will require whole-genome resequencing of affected individuals, and for that, we’ll need something more than the 3730.

References
Halaschek-Wiener, J., Amirabbasi-Beik, M., Monfared, N., Pieczyk, M., Sailer, C., Kollar, A., Thomas, R., Agalaridis, G., Yamada, S., Oliveira, L., Collins, J., Meneilly, G., Marra, M., Madden, K., Le, N., Connors, J., & Brooks-Wilson, A. (2009). Genetic Variation in Healthy Oldest-Old PLoS ONE, 4 (8) DOI: 10.1371/journal.pone.0006641

ABI SOLiD Joins the WGS Party

July 1, 2009 by Dan Koboldt

At last published in early access at Genome Research is the whole-genome sequencing of a Yoruban male on ABI SOLiD technology. A year ago, this might have merited a Nature or Science publication. That window seems to have closed for whole-genome sequencing of a single, undiseased individual. By my count, this is the sixth published individual genome sequenced on next-gen platforms. I begin to wonder if this ABI SOLiD paper is too little, too late.

gr-abi-solid-paper-screenshot

Well, it’s probably not too little. The advance access PDF is over 60 pages, and I must admit that the authors did a substantial amount of work to identify, characterize, and discuss the sequence variation in this genome. Despite a relatively modest coverage level (18x), the combination of paired-end sequencing and two-base encoding made it possible to simultaneously detect SNPs, small indels (3-11 bp), large indels (30 bp-97 kbp), and structural variants.

Two-Base Encoding in Colorspace for Calling SNPs

My central interest, however, is how much the two-base encoding aids distinguishing SNPs from sequencing errors. The ABI SOLiD study identified ~3.8 million SNPs in the genome, compared to 4.1 million SNPs identified by Illumina sequencing of the same individual, an anonymous African male from the HapMap collection. However, the ABI study did it with less than half the coverage (18x compared to 40x), and called a greater fraction of novel-to-dbSNP SNPs (19% compared to 12.7%). Experimental validation confirmed 280 of 299 (94%) of the novel SNPs, suggesting that most of these variants are real.

The authors performed a rather elegant comparison with HapMap data for this individual, by comparing not only SNP genotypes but the phase of the genotypes, which they inferred on the basis of mate pair information. Some 21.74% of HapMap-phased heterozygotes were covered by at least one ABI read pair, and the phase agreement was 98.95%. Thus, the read-pairing strategy employed by ABI can serve to produce more accurate and complete haplotyping of the sequenced individual. I find this side-benefit of whole-genome sequencing to be very valuable, given the huge amount of money and efforts spent to build the human haplotype map.

Lots of Indels and Structural Variants

Perhaps the greatest strength of this study is that it represents, to my knowledge, the most extensive and detailed effort to characterize indels/SVs from WGS of a single individual. Small intra-read indels (<=13 bp) had a high dbSNP concordance (67%), perhaps benefited by the terminating chemistry and two-base encoding of ABI SOLiD. Using mate pair information to identify discordant insert clones, the authors called 1,515 insertions (30-1,287 bp in size) and 4,075 deletions (86-96,957 bp in size), many of which were also detected in Venter, Watson, and CHB (?) genomes.

Cross-WGS Comparisons: Key Illumina Study Ignored

In a direct comparison, 20% of the SNPs identified in the ABI study were also seen in Watson, Venter, and CHB genomes. Fewer structural variants were shared between genomes, but this very well may be related to the difficulty in calling such types of variation on different platforms, rather than true biological diversity. Here’s something I find both irritating and amusing. The ABI study authors made no comparisons whatsoever to the results from the Bentley et al. (Illumina WGS) study, which is surprising since BOTH STUDIES SEQUENCED THE SAME INDIVIDUAL. I refer you to:

“We sequenced the genome of a male Yoruba from Ibadan, Nigeria (YRI, sample NA18507).” [Bentley et al]

“We compared the SNPs and structural variations identified in NA18507 to those found in the Venter (Levy et al. 2007), Watson (Wheeler et al. 2008) and YH (Wang et al. 2008) genomes.” [McKernan et al].

I’m sorry, but when you do whole-genome sequencing on an individual that’s been sequenced already on a different technology, you have to do that comparison. Whatever their reasons, the ABI study authors’ decision to blatantly avoid comparisons with Bentley et al results is outright negligence.

Functional Consequences of Genetic Variation

The authors embarked on a long exploration of the putative phenotypic impact of variants in NA18507 using OMIM and HGMD databases along with a comprehensive literature review. They developed a pipeline to map the poorly-formatted OMIM entries to genomic coordinates, and successfully obtained 9,239 uniquely mapped nonsynonymous OMIM variants. I’d hoped for a supplemental table of these, or better yet that the results might be shared back with OMIM, but alas. No dice. NA18507 is apparently a carrier for over 50 disease-associated alleles, including five which appear to be homozygous. These are all listed in supplemental tables 4 and 5, however, no supplemental data appears to be available at present.

There were 2,477 large indels in NA18507 that potentially disrupted genes. Among 2,015 genes affected, some 303 were disease-associated genes from OMIM, HGMD, or the literature review. The authors conclude “we can see a trend for disruption events to cluster around genes, but no clear preference to cluster around disease genes. Further analysis of these disruption events along with an evaluation of whether an exon is disrupted is warranted.” This is why individual HapMap genomes no longer merit Nature papers. Without a phenotype to study, “further investigation is warranted” is as far as such studies can go to assess the functional impact of many mutations.

Signatures of Natural Selection

All gripes aside, the study did provide evidence of purifying selection, notably an under-representation of damaging nsSNPs, and an under-representation of variation inside exons in general. Using the Panther database, the authors identified several protein families with evidence of purifying selection (fewer than expected damanging nsSNPs) – nucleic acid binding proteins, ligases, transferases, transcription factors, and of course kinases. There were also categories over-represented for damaging nsSNPs, which may reflect either higher mutation rates or positive selection. These included G-protein coupled receptors, extracellular matrix glycoproteins, cell adhesion molecules, as well as genes related to olfactory perception. Ah yes, sense-of-smell diversity.

The Outlook for ABI SOLiD

With a high-profile publication of an individual human genome, ABI SOLiD officially joins the ranks of WGS-enabling platforms. In my opinion, they’re a little late to the game. I recall seeing a poster presenting much of this data about a year ago, and even that was after Illumina had taken the lead in whole genome sequencing. According to a report by Julia Karow on Genomeweb, SOLiD accounts for just 17% of next-gen sequencers at major genome centers, just ahead of Roche/454 (14%) but well behind Illumina, which claims 2/3 of the market. ABI can’t compete with 454 on read length, and it can’t compete with Illumina on data throughput or market share. In short, SOLiD needs to find a niche, and find it quickly, or this platform will go the way of the dodo.

McKernan, K., Peckham, H., Costa, G., et al. (2009). Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two base encoding Genome Research DOI: 10.1101/gr.091868.109

HapMap Continues to Bear Fruit

April 22, 2009 by Dan Koboldt

You might have thought that the 1,000 Genomes Project would render the International HapMap obsolete. But just yesterday I heard a talk about how some groups are still leveraging the HapMap resource in numerous ways to better understand the relationship between genotype and phenotype. The speaker was Wei Zhang, a postdoc at the University of Chicago who’s published an astonishing 25 papers in the last 2 years.

One key advantage of the HapMap samples is the availability of transformed cell lines for all samples at Coriell. This allows researchers to assess various phenotypes with cell-based assays (e.g. gene expression, drug toxicity) and then mine the rich HapMap genotype dataset to perform genotype-phenotype associations. In a collaboration with Affymetrix, Zhang and his colleagues measured gene expression in 87 CEU samples and 89 YRI samples using the Human Exon 1.0 ST array, which captures ~1.4 million annotated exons from ~18,000 transcript clusters in the human genome. The data are available in the SCAN Database hosted at the University of Chicago.

Differentially Expressed Genes and SNP Association

The researchers found ~9,100 expressed genes in the CEU and YRI samples, including 383 that were differentially expressed between the populations (247 had higher expression in YRI than CEU, 136 had higher expression in CEU than YRI). Next, they used sample-level data in each population to correlate expression of those 383 genes with SNP genotypes. They successfully identified 75 genes with significant expression-genotype correlations, 11 of which were in cis (same chromosome within 2.5 Mb) and 64 of which were in trans.

Isoform Variation

Isoform variation was also detectable in the exon array data – by examining expressed genes with 3 or more exons, the researchers could compare probe intensities for each exon to see if any were differentially expressed. They identified a number of genes with differential isoform expression between YRI and CEU populations, and when they performed GO analysis, the most enriched gene category was, interestingly, genes that encode splicing factors.

SNPs, Gene Expression, and Pharmacogenetics

The Chicago group also performed a number of cell-based assays on the Hapmap samples to measure toxicity induced by a number of anti-cancer drugs. In this case their phenotype was IC50, the drug concentration at which growth was inhibited in 50% of cells. Such a drug study seems ideal for the HapMap samples since they happen to be transformed (i.e. continuously proliferating) cells. They measured IC50 for several types of anti-cancer agents (6 total), including DNA antimetabolites, platinating agents, and topoisomerase II (TopoII) inhibitors.

First, using the HapMap trio (mother-father-child) information in the CEU panel, Zhang and colleagues determined the “heritability” of IC50, which proved to be high (values in the 0.3-0.4 range) for all of the drugs. This provides more evidence for what seems to be an accepted fact: pharmaceutical response is a phenotype with a significant inheritable genetic component.

What they did next was very interesting: they performed an integrated analysis of HapMap genotypes, gene expression, and drug response to identify predictors of drug-induced toxicity. Zhang described their method as a “triangle approach”: first, SNPs were associated with drug response, then those SNPs were analyzed with the expression data to determine if any were also associated with gene expression. The correlated genes were then compared back to the response data, to see if any were also associated with drug response. As a result, they’re able to identify SNPs that influence gene expression which in turn influences repsonse to the drug. Genotype-mechanism-phenotype. I like it.

As an example of their findings, Zhang presented a SNP in GALNTL4 that was associated with response to Cisplatin, which I presume is a platinating agent. SNP genotypes were correlated with expression of GALNTL4, and that in turn was correlated with IC50 to Cisplatin. But here’s what I liked most about this example: the SNP they presented was intronic. It’s another reminder that it’s time to look outside the exons, people!

Future Directions: miRNA and Methylation

Efforts are currently under way at the University of Chicago to measure two more cell phenotypes on the HapMap samples. One is micro-RNA (miRNA) expression, which they’re assessing with something called the Exiqon miRCURY platform. The other is DNA methylation, as measured by chip-CHiP assays with CpG antibodies. I seem to recall that another group has already identified methylation-associated SNPs using HapMap data, but even so, I look forward to what Zhang and his colleagues will find.

« Previous Page