CSHL 2010: Genomes Get Personal

September 22, 2010 by Dan Koboldt

Last week I attended the third annual “Personal Genomes” meeting at Cold Spring Harbor. The meeting opened with a keynote talk by NHGRI director Eric Green, who reminded us that finding the pathway to genomic medicine is the central mission of NHGRI. He mentioned several of the past successful initiatives that have yielded key findings concerning human genetic variation and its relationship to phenotype: The HapMap Project (common variation), the ENCODE Project (functional variation), and the 1,000 Genomes Project (rare variation), to name a few. He showed the absolutely stunning growth of the NHGRI-hosted genome-wide association study (GWAS) catalog, which currently holds ~2,600 associations from 780 publications.

Dr. Green also discussed the dichotomy of genetic architecture underlying human diseases, and took the position that while we’ve made substantial progress studying rare, monogenic, mendelian disorders (predominantly caused by coding mutations), we face a more daunting task with common, complex, multigenic diseases because he believes that these arise from primarily noncoding mutations.

Theme 1: Human Mutation Rates

Several talks addressed the topic of mutation rate in human genomes. Donald Conrad, who will be joining the WashU Genetics Department next year, presented mutation rate as a quantitative trait based on 1,000 Genomes Project trio data. Three of the primary sources of variation in mutation rate are age (males have 3x-6x higher rates), environment, and genetic variation (e.g. inherited aging disorders).

Lee Hood gave an excellent keynote on “Systems Genetics and P4 Medicine”, part of which was a discussion of mutation rate. His group uses whole-genome sequencing (WGS) of family cohorts (in this case, the Miller syndrome family quartet), focusing on the ~2.3 GBP of non-repetitive reference sequence. Using the family information and inheritance modeling, they identify de novo mutations in the offspring, which manifest as errors of Mendelian inheritance. Validation using a custom capture array for 60,000 candidate sites followed by deep sequencing showed that only 1/1,000 “new” mutations in the offspring were real; the vast majority proved to be sequencing errors. That works out to a mutation rate of 1.1 x 10-8, or roughly 70 mutations per child.

Lynn Jorde (Univ. of Utah) later gave a talk on directly estimating human mutation rate by WGS, also using the Miller syndrome quartet. Sequencing by Complete Genomics yielded >50x fold coverage per subject; there were ~4 million positions in the 1.8 Gbp of “useful” reference sequence in which at least one subject differed from the reference. Only 330,000 or so SNPs were novel (not known to dbSNP), and 20% of these proved to be sequencing errors. More array validation, more calculations, and the same answer as given by Dr. Hood: a mutation rate of 1.1 x 10-8.

Theme 2: Personal Cancer Genomes

Cancer genomes were another focus of the meeting. Sean Grimmond (Univ. of Brisbane, Queensland, Australia) presented some of his group’s work on pancreatic cancer as part of the International Cancer Genome Consortium (ICGC). Pancreatic is one of the most deadly forms of cancer; about 90% of patients diagnosed die within one year. Brisbane has assembled a very nice workflow from sample collection to sequencing, that includes pathology review, tumor dissection, QA, and microarray analysis to determine tumor cellularity. The sequencing strategy (WGS, exome, and RNA-seq) differs between high-cellularity (70-100%) and low-cellularity (~30%) tumors. The ultimate deliverable is a “tumor report” documenting cellularity estimates, microarray findings, cytogenetics, what sequencing was done, and what mutations were found.

James Brugarolas (UT Southwestern Medical Center) described the genome evaluation and functional studies of a patient with clear cell renal carcinoma. I learned a bit more about this form of cancer – 85% of tumors prove to be the “clear cell” carcinoma; common lesions include 3p loss (VHL gene) and 5q35 gain. This particular tumor underwent Illumina whole-genome sequencing to 35x coverage; some 46 somatic mutations were validated. One of these was in a gene whose protein product complexes with mTOR, the central player in a known cancer pathway. The tumor was successfully xenografted to a mouse model; some 43/46 somatic mutations were retained, and all had higher frequencies (similar to our findings on basal-like breast cancer). The xenograft let them test a few different cancer drugs – erlotinib (an EGFR inhibitor that had no effect), sunitinib (the front-line therapy for these patients, also no effect), and others. Intriguingly, however, the tumor was sensitive to an mTOR inhibitor compound.

Rick Wilson (The Genome Center at Washington University) gave a talk on whole-genome sequencing of leukemia patients at WashU. Of the 50+ leukemia patients sequenced to date, most have less than 20 valid protein-altering mutations. For most patients, low-resolution cytogenetic screens are the paradigm for disease classification and treatment decisions. Favorable-risk patients (17% of cases) undergo light chemotherapy. For adverse-risk patients (22% of cases), an all0-matched bone marrow transplant is the standard of care. That leaves a large body of patients (~61%) with “intermediate” risk according to cytogenetics; here, the correct treatment decision is harder to make. Better stratification of intermediate-risk patients is the first goal. Dr. Wilson related a fascinating case study, a 39-year-old female with suspected acute promyelotic leukemia, in which rapid-turnaround WGS was able to provide an accurate diagnosis that was not obtained by conventional FISH, and ultimately guided her treatment.

Theme 3: Genome Regulation and Epigenetics

Peter Laird (Univ. Southern California, LA) led us out of the genome to the epigenome with his talk on mining the cancer methylome. He argued that the first steps in oncogenesis may be epigenetic changes, specifically, the dysrgeulation of genes due to abnormal methylation. Dr. Laird presented what he’s calling the first cancer methylome – a tumor sample and matched normal control that underwent bisulfite treatment and sequencing to ~30x coverage. As expected, bisulfite sequencing yielded very accurate estimates of DNA methylation (r=0.97 with Illumina Infinium) but was able to do so across the complete human genome with base-pair resolution.

Theme 4: Exome Sequencing

There is a ton of exome sequencing going on. I saw at least two posters describing “whole” exome sequencing in 1,000 cases and 1,000 controls. I put “whole” in quotes because it’s not true at this point; people really shouldn’t be going around saying that the “whole exome” was sequenced. It’s more like 80-90% of known genes. Rick Lifton spoke about some of the valuable applications of exome sequencing – finding dominant reproductive lethal mutations, unraveling recessive traits with high locus heterogeneity, characterizing somatic mutations in cancer, and identifying rare variants associated with common disease. He described recently published work in which recessive mutations in WDR62 were linked to severe brain malformations by exome sequencing. Matt Bainbridge gave a nice overview of the exome sequencing currently under way at Baylor. So yes, it turns out that groups outside of WashU are doing exome sequencing too.

Rapid Human Adaptation to High Altitudes

August 31, 2010 by Dan Koboldt

Two studies in the journal Science demonstrated that genes in the hypoxia-inducible factor (HIF) oxygen signaling pathway have undergone strong, recent positive selection in Tibetan highlanders. One study was a genome-wide scan using SNP arrays; the other a large-scale exome sequencing effort. The exome study was particularly interesting; using the Nimblegen 2.1M exon capture array and Illumina GAIIx instruments, Yi et al sequenced the exons of nearly 20,000 genes (92% of CCDS) in 50 unrelated Tibetans.

Exome Sequencing Summary

To my knowledge, this represents the largest published study of human exome sequencing to date. The main text in the report to Science was necessarily brief, so I used the supplemental materials to glean the following information:

Genes Targeted:	18,654
Total Target Size:	34 Mbp
Number of Samples:	50
Data per Sample:	3.4 Gbp
Avg. Read Length:	71 bp
Reads per Sample:	47.87 m
Map Rate (SOAPaligner):	67.79%
Mapped Reads per Sample:	32.45 m
On Target (+/- 500 bp):	68.1%
Avg. Target Depth:	17.58x
Avg. Target Breadth:	95.48%

The production numbers are consistent with a single lane of 2×75 bp reads (3.4 Gbp) per exome. The low mapping rate (68%) is slightly alarming, but I’d guess (hope) that only uniquely mapped reads are counted here. The on-target mapping rate, a measure of capture specificity, was 68%, well within the expected enrichment of large-scale capture technologies.

Highly Variable Coverage Across Samples

I do feel obligated to point out that while the average target depth was 18x, which seems appropriate for variant calling, the actual target depth varies widely across the 50 samples. Here’s my plot of target coverage breadth (% of bases) by average target depth (redundancy) using data from supplemental table 1:

bgi-50exomes-st1-breadthbydepth

Almost every sample reaches 90% coverage breadth, but 7 of them have less than 10x coverage on average. This will undoubtedly affect the ability to call variants accurately, though only a statistician might be able to extrapolate the effects of such variable coverage on the study’s outcome.

Searching for Selection

To look for evidence of positive selection for altitude, they compared SNP allele frequencies to between Tibetans and 40 Han Chinese whose genomes were sequenced to low (4x) coverage as part of the 1,000 Genomes Project. About 100,000 high-confidence SNPs (>99% probability) were called in the Tibetan samples. A subset (53/56) were validated by Sanger sequencing, suggesting that ~95% of sites are valid polymorphisms. Allele frequency estimates showed an excess of low-frequency variants, particularly among nonsynonymous SNPs.

Using synonymous sites in both populations, the population historical modeling estimated that Tibetans and Han Chinese diverged 2,750 years ago, with Han expanding from a small initial population, and Tibetans shrinking from a larger ones. Migrational evidence suggests that Han Chinese migrated from the Tibetan region, with recent admixture in the opposite direction.

Exon Targets, Intron Findings

Intriguingly, though the “exome” sequencing strategy focused on coding regions, no amino-acid changing variants differed by more than 6% between Han and Tibetan populations. Fortunately, hybrid selection (capture) also captures some of the noncoding regions that flank target exons. This happens because randomly sheared DNA fragments (200-250 bp) may overlap both exon and intron sequence, yet still have enough sequence overlapping a probe to be captured. This creates a “shoulder” of coverage upstream and downstream of target exons, often in intronic or UTR sequences.

This side-benefit of exome capture proved serendipitous because intronic sequences harbored the most divergent SNP between Han (9% frequency) and Tibetan (87% frequency) populations. The gene in question was endothelial PAS-domain protein 1 (EPAS1), also known as hypoxia-inducible factor 2-alpha (HIF2A). Hypoxia in the name of a candidate gene for high altitude adaptation was a good sign. A protein-stabilizing mutation in EPAS1 had already been linked to erythrocytosis, suggesting a possible link between this gene and red blood cell production.

Even more promising was the fact that another study published in the same issue of Science had pinpointed the same gene by high-density SNP array genotyping. The irony here is priceless: an expensive exome sequencing project finds an intronic SNP, implicating a gene that was just as easily identified by genotyping. Of course, if the relevant haplotypes had been comprised of rare variants – ones absent from the Han population and not covered by current SNP arrays – only one group would have identified this gene, and the other would have gone home empty-handed.

Perspective
Storz, J. (2010). Genes for High Altitudes Science, 329 (5987), 40-41 DOI: 10.1126/science.1192481

Reports
Simonson TS, Yang Y, Huff CD, Yun H, Qin G, Witherspoon DJ, Bai Z, Lorenzo FR, Xing J, Jorde LB, Prchal JT, & Ge R (2010). Genetic evidence for high-altitude adaptation in Tibet. Science, 329 (5987), 72-5 PMID: 20466884

Yi X, Liang Y, Huerta-Sanchez E, et al. (2010). Sequencing of 50 human exomes reveals adaptation to high altitude. Science, 329 (5987), 75-8 PMID: 20595611

Not-so-whole Exome Sequencing

June 30, 2010 by Dan Koboldt

There is growing interest in applying next-generation sequencing to targeted regions of interest, particularly the “exome” – the set of coding exons in the human genome. A paper in Genome Biology from Matthew Bainbridge and colleagues at Baylor describes solution-phase exome capture and sequencing of a HapMap sample with just 3 GB of data. The 1,000 Genomes Project recently announced a new pilot study focused on exome sequencing for hundreds of individuals. A few studies of human exome resequencing to identify disease genes have been published, and more are sure to come as genome centers ramp up their exome capabilities.

Yet this week’s In Sequence magazine writes that there are concerns about what exome capture is missing. For example, at CHI’s Beyond Sequencing meeting this week, researchers from NCI reported that current exome capture projects omit some medically important genes, such as insulin, ABO blood group, and HLA. Of course, some of this can be attributed to GC-rich exons and other tough-to-capture regions. The concern is that many RefSeq coding sequences aren’t even targeted by the two commercial platforms – 23% are missing from Nimblegen’s 2.1m array, and 17% are missing from Agilent’s SureSelect (according to the NCI group).

Exome Sequencing on Illumina and SOLiD

Even so, exome sequencing is rapidly reaching maturity. The Baylor study, led by Matt Bainbridge, used a customized Nimblegen solution-phase capture product to target 36 Mbp of consensus coding sequence (CCDS), and sequenced capture libraries on both ABI SOLiD and Illumina GAII platforms. Six individual capture libraries were generated from HapMap sample NA12812. Four were sequenced as technical replicates on SOLiD, while two more libraries went to Illumina single-end and paired-end sequencing.

On average, some 49.6% of mappable reads from the four SOLiD libraries were derived from target regions, with the remainder mapping elsewhere in the genome. The target coverage correlation between the four replicates was 98%, suggesting that reproducibility across capture and SOLiD sequencing was pretty good.

Duplication Rates in Exome Capture

The authors performed a detailed analysis of duplication rates in their data, a metric that is critical to the unique coverage and downstream analysis. The duplication rate for three SOLiD libraries with 3GB of data was ~22%, and highly consistent between replicates. Duplication was higher (~33%) in the fourth SOLiD library, which is not surprising since it had more than three times (10 GB) the data.

Intriguingly, the authors used simulations to demonstrate that the “expected” duplication rates for 3GB and 10GB of data are 14% and 22% by random chance, suggesting that as many as one-third of observed duplicates are not artifactual, but chance events.

Paired-end sequencing offers the opportunity to identify duplicates using both reads in a read pair. Theoretically, this should help distinguish artifacts from chance events. Indeed, the authors observed a dramatic difference in duplication rate between the Illumina fragment-end (30.97%) and paired-end (8.3%) libraries, even though both generated about 2.5 GB of data. They surmised that the improved identification of duplicates from paired-end sequencing, not a difference in library construction, was the reason. When pairing information was ignored, the duplication rate in the PE library nearly quadrupled to 27.6%.

SNP Discovery and HapMap Concordance

Because this was a HapMap sample, the authors were able to compare SNPs identified in sequencing to known genotypes from the HapMap Project. Genotype concordance in the target regions was 82% for 3GB libraries and 92% for 10GB libraries, but importantly, this considered all sites regardless of coverage. When the authors limited comparisons to sites with >=9x unique read depth, concordance was ~95%. That’s still a bit low for my taste, but within the realm of expectation for sequence-to-genotype comparisons.

SOLiD Versus Illumina Sequencing

I was pleased that Bainbridge and his colleagues made some direct comparisons between SOLiD and Illumina sequencing. This is a delicate issue, from the point of view of the sequencing vendors, but one of great interest to the NGS community. The Illumina PE data yielded ~25% more SNP calls in target regions, with higher HapMap concordance (98%) than ABI SOLiD data (95%). The authors attribute this to the better mapping, higher coverage, and low duplication rate made possible by paired-end sequencing. Considering only HapMap heterozygous SNPs, SOLiD out-performed Illumina at low (<9x) coverage, but Illumina consistently yielded 2-3% higher concordance at high coverage.

In their concluding section, the authors write “Interestingly, Illumina sequencing consistently shows higher levels of enrichment than SOLiD sequencing. This is unexpected because both sequencing platforms yield similar coverage distributions in whole genome sequencing data… therefore we suspect that differences in efficiency are due to an increase in initial library complexity from better annealing efficiencies of the Illumina adapter.”

Such a frank conclusion, from a group that’s highly invested in SOLiD sequencers, is especially poignant. When it comes to exome sequencing, Illumina seems to have the advantage.

References
Bainbridge MN, Wang M, Burgess DL, Kovar C, Rodesch MJ, D’Ascenzo M, Kitzman J, Wu YQ, Newsham I, Richmond TA, Jedeloh JA, Muzny D, Albert TJ, & Gibbs RA (2010). Whole exome capture in solution with 3Gbp of data. Genome biology, 11 (6) PMID: 20565776

Outsourced Sequencing and Analysis

May 21, 2010 by Dan Koboldt

A company in Malaysia is offering to map whole-genome sequencing data and call variants in one week’s time for $4,000.

I readily admit that I have not taken sequencing-as-a-service companies very seriously. The idea of sending precious samples off to a third party and getting back the sequence and variants doesn’t appeal to me for a number of reasons. Outsourcing just the analysis of sequence data is even more anathema. Why would anyone want to do that? Analysis is the best part! Then again, I’m fairly biased in this matter because (1) I work at a major genome center with significant in-house sequencing resources, and (2) sequence analysis and variant detection are among my job responsibilities. Obviously I don’t want those to go away.

That said, there seems to be a growing interest in outsourcing sequencing and/or analysis in the wider research community. Complete Genomics had a strong presence at Marco Island this year, and has a growing customer list that includes (perhaps surprisingly) at least two genome centers. Beijing Genomics Institute (BGI) announced a purchase of 128 Illumina HiSeq2000 instruments in January; a month later in Science magazine I saw a full-page ad indicating that they’re open for business as a sequencing provider. No big deal, they’re half a world away, right? So I thought, until I heard whispers of a BGI facility in San Francisco.

Second and third-generation sequencing technologies are bringing about volatile changes in the fields of genetics and genomics. Throughput continues to skyrocket, while the costs of sequencing plummet. It’s now possible to sequence a complete human or mammalian genome to high coverage on a single instrument run at ~$20,000. This has had two effects on the research community:

Genomes abound. At least a dozen individual human genomes have been published, but NGS technologies are being applied to a wide range of studies – exomes, transcriptomes, model organisms, you name it.
Everyone wants to sequence. Thanks to a lot of press and some high-profile publications, massively parallel sequencing is known to every corner of the biomedical research world. Suddenly every clinician with a patient cohort wants in, because if they don’t find the disease-causing genes, someone else will.
Not everyone can buy an NGS instrument. Commercially-available sequencers currently cost a quarter to a half million dollars or more each, which is a significant purchase even for labs flush with ARRA funding. This means that a lot of small labs will not be looking to buy a machine, but rather to rent space from someone who has one. Music, no doubt, to the ears of BGI and Complete Genomics.

One thing is clear. These new sequencers and service providers are going to put high-throughput sequencing into the hands of many investigators. Investigators, I might add, who likely have never dealt with NGS data. I think that’s potentially very exciting, and I hope that the experiences of major genome centers will help newcomers address the challenges of massively parallel sequencing.

« Previous Page