GWAS and the Genetics of Human Disease

May 4, 2010 by Dan Koboldt

An essay published last week in Cell dismissed the findings of genome-wide association studies (GWAS) and questioned their value to the study of human disease. In their article Genetic Heterogeneity in Human Disease, McLellan and King argue that because common diseases exhibit a high deegree of allelic, locus, and phenotypic heterogeneity, their causality “can almost never be resolved by large-scale assocation studies.” Instead, the authors believe that rare mutations underlie most of the disease-relevant genetic variation in humans, and as such, their causal relationships can only be uncovered by sequencing-based approaches. The article as a whole comes off as uninformed and misleading. Thanksfully, genomics bloggers have taken them to task: p-ter at the Gene Expression blog explains how noncoding variants influence disease risk, and Kai Lang guest-posts at Genetic Future with a full-on criticism of the essay.

GWAS Overload

I am tempted to agree with McLellan and King in some respects, particularly in their concern that the myriad of GWAS publications often fail to advance our understanding of many common diseases. And I do mean myriad. Once upon a time, Nature Genetics was my favorite journal for cutting-edge genetics and genomics discoveries. Take, for example, the summer of 2006 when three high-profile papers revealed the presence of extensive structural variation in the human genome. In recent months, however, I find myself underwhelmed by the content of this particular journal, as it seems saturated with GWAS, GWAS, and more GWAS.

In fact, when I looked at the ~70-80 research articles published in 2010 in Nat. Genetics, more than half (46) were association studies, or worse, meta-analyses of association studies. It’s like every investigator in the world with a disease cohort got a hold of an Affy or Illumina SNP array. When I scan the titles each month in my RSS reader, my eyes begin to glaze over with each new title that reads “Common variants associated with…” or “Genome-wide assocation study identifies…” Unless you happen to be an investigator studying the phenotype or disease of interest, these cookie-cutter papers probably hold little interest for you.

That said, I took issue with much of what was written in the McLellan and King essay. Specifically:

Their disparagement of the value of GWAS studies based upon the observation that most associations come from intergenic regions. As my colleagues in the blogosphere have pointed out, the aim of high-density SNP arrays is not to pinpoint the causal SNP; in fact, a high-frequency variant is more likely to be included than a rare nonsynonymous SNP simply because the former is more informative as a genetic marker.
Their blanket dismissal of most GWAS findings as artifacts of “cryptic population stratification.” The authors suggest that although outliers based on population substructure may be excluded, “hypervariable polymorphisms resmain vulnerable to stratification.” As Kai Wang points out in a guest post on Genetic Future, the methods to account for hidden population structure are well established in the GWAS community.
Their apparent misunderstanding of how genome-wide association studies work. They write: “Had sickle cell anemia been investigated among afected individuals worldwide, the number of responsible mutations would be far greater and hence no one allele at any SNP would be consistently associated with the disease.” This is flat-out wrong. Although there are hundreds of known mutations in HBB — the gene that encodes hemoglobin and, when mutated, causes sickle-cell anemia — most cases are caused by a single amino acid change (glutamic acid -> valine). Sickle-cell is autosomal recessive, so it’s rather preposterous to assume that a worldwide study would fail to associate the homozygous variant with the disease.

Common Disease, Common Variants

The authors seem convinced that the common disease, common variant theory no longer holds because (according to them) not many have been found. Rather, McLellan and King believe that “the overall magnitude of human genetic variation, the high rate of de novo mutation, the range of mutational mechanisms that disrupt gene function, and the complexity of biological processes underlying pathophysiology all predict a substantial role for rare severe mutations in complex human diseases.” Do humans have a high rate of de novo mutation? That’s news to me.

Unfortunately, the difficulties of associating common variants with complex disease are also faced by rare variants. Namely, picking out causal relationships among complex networks of interactions between many genes and environmental factors. The observation that few such relationships have been elucidated, if true, does not mean that we are looking at the wrong variants. An important fact that seems to have been overlooked by the authors it that the vast majority of human genetic variation *is* shared. From the dozen or so individual genomes published so far, it is clear that perhaps 10% of variants are novel; as databases like dbSNP continue to grow, this will shrink even further. I am reluctant to believe that this small fraction of “rare” mutations accounts for the numerous prevalent human diseases.

A Time to Sequence

Strangely, the emphasis on rare variation seems to indicate that the authors would make a strong case for sequencing. Yet the issue does not even come to light until the last 3/4 of a page in a section entitled “A Time to Sequence – With an appreciation to Maynard Olson.” Surely, I thought, they’ll wow us with the capabilities of next-generation sequencing technologies and their promise for studying complex disease. Not so. Instead, the authors vaguely hint that “new sequencing technologies provide conceptual and practical advantages over current approaches (Olson, 1995).” Why are they citing a fifteen-year-old article to support the advantages of new sequencing technologies? Where are the citations of landmark sequencing/WGS papers? The only citation related to NGS that I see is McKernan 2009, and you know how I feel about that one.

This ending is unfortunate, because sequencing ultimately will provide us with many of the answers. I’m tired of seeing Yet-Another-GWAS that concludes with a table of loci and p-values, or at most, a list of genes. Comprehensive, convincing studies of genetic association should have a strong sequencing component, in which the regions implicated by genotyping are exhaustively sequenced to identify all putative causal variants. Such variants could then be analyzed computationally and experimentally to characterize their effects on gene structure or regulation. Thus, I find myself reluctantly agreeing with King and McLellan on this point: genetic association is not enough.

References
McClellan J, & King MC (2010). Genetic heterogeneity in human disease. Cell, 141 (2), 210-7 PMID: 20403315

Transcriptome Genetics with HapMap and RNA-Seq

April 22, 2010 by Dan Koboldt

Two papers in Nature this month leverage the power of second-generation sequencing technologies to investigate gene expression variation in human cell lines. By performing RNA-Seq in HapMap cell lines, the authors generated the most extensive gene expression data to date for these samples, and were able to use publicly available HapMap genotypes to associate expression differences with genetic variation. This strategy was applied to the HapMap samples two years ago using expression microarrays. Using RNA-Seq instead of microarrays, however, offers a few key advantages:

More accurate quantification of highly abundant transcripts, where microarrays reach saturation
Access to rare transcripts below the sensitivity threshold for microarrays
Detection of novel gene structure from alternative splicing and unannotated exons
Identification of allele-specific expression

The first study, from Jonathan Pritchard’s lab at the University of Chicago, sequenced RNA from 69 Yoruban (African) individuals on the Illumina GAII platform. They generated at least two lanes per individual, for a total of 1.2 billion reads, of which 964 million (80%) mapped uniquely to the genome or to exon-exon boundaries. The second study, from Emmanouil Dermitzakis’s group at the Sanger center, sequenced RNA from 60 CEU (CEPH Europeans from Utah) individuals, also on the Illumina GAII platform. They generated one lane of paired-end data per individual, for a total of about 1.0 billion reads. Since neither study provided a table summarizing their data (which I’d have liked), I put one together:

Study	Pickrell et al.	Montgomery et al.
Samples	69, African descent (YRI)	60, European descent (CEU)
Sequencing	Illumina 1X35 or 1X46	Illumina 2X37
Reads/Sample	17.4 million	16.9 million
SNP Dataset	HapMap II/III	HapMap III
Total SNPs	3.8 million	1.2 million

Up to this point, the two studies sounded nearly identical. For the data analysis, however, each group went in a different (and interesting) direction.

Pooled Data for Discovery of Novel Gene Structures

Novel Exons. The Pritchard group pooled all data to examine the completeness of current gene annotations. Some 86% of uniquely mapped reads corresponded to known exons. Using conservation data from alignments of 28 vertebrate exomes, the authors identified 4,031 regions that are evolutionarily conserved and show evidence of transcription. About one-quarter of these appear to be part of spliced transcripts, but most appeared to be novel untranslated regions (UTRs). Some 115 regions, however, had sequences consistent with protein-coding exons. To investigate the possibility that their novel exons are real, the authors used RNA-Seq data from several human tissues and chimpanzee cell lines. The evidence suggests that their regions do represent novel exons, but ones that are expressed in a more tissue-specific fashion than annotated exons.

Novel Poly-A Sites. The authors next screened the ~70 million unmapped sequence reads for long runs of A or T nucleotides, which might indicate novel poly-adenylation sites. Of the ~8,000 novel sites that they identified, some 45% fell within 10 bp of a known cleavage site. To further validate their findings, they screened their poly-A regions for the binding site of the CPSF polyadenylation factor, and found a 32-fold enrichment for the CPSF target hexamer. The net result was a high confidence set of 3,481 cleavage sites that show evidence of poly-A (from RNA-Seq data) and CPSF binding.

RNA-Seq: 10 million Reads Is All You Need

The Dermitzakis study generated 16.9 million (+/- 5.9 million) reads per individual, which were mapped to the NCBI 36 reference sequence using Maq with a maximum insert size of 2 megabases). The resulting alignments were filtered to remove alignments with low mapping quality or to the X, Y, or MT chromosomes. Discordant read pairs (by distance or orientation) were also removed. To quantify the expression of known exons/transcripts/genes, the authors scaled read counts for each individual to a theoretical yield of 10 million reads, and only considered exons with data in >90% of individuals. This resulted in data for 90,064 exons from 10,777 genes, of which 95% had at least 10 reads (on average) per individual. While the normalization seems to reduce the dataset to less than half of known genes, it nevertheless provided an extensive view of gene expression across these 60 individuals.

Cis-Regulatory Effects on Gene Expression

Using HapMap genotypes for 1.2 million SNPs, the Dermitzakis group identified 836 genes associated with cis-regulatory variants (compared to 539 genes identified in microarray studies of the same individuals). Even when normalized for the number of genes tested, the increased resolution of RNA-Seq over microarrays yielded a larger number of genetic regulatory effects. The RNA-Seq exon eQTLs (expression quantitative trait loci) were enriched for abundant transcripts, suggesting that saturation of highly expressed exons reduces the sensitivity for microarrays to detect some cis-regulatory effects.

The Pritchard group searched for cis-regulatory variation with an even larger dataset – RNA-Seq for 69 individuals and 3.8 million HapMap SNPs. They identified 929 genes with local eQTLs (4.6% of annotated genes); consistent with previous findings, virtually all SNPs associated with expression level were near the corresponding gene. They also reported the overlap with the CEU study results: the top 500 associations reported in CEU samples were enriched 10 to 40-fold for significant eQTLs in YRI samples. Given the marked genetic differences between these two populations, this result suggests that these studies are identifying replicable cis-regulatory events.

Mechanism of Cis-Regulatory Effects

An important feature of RNA-Seq data is that it can be used not only to detect cis-regulatory variation, but to assess the mechanism by which these variants act. The Pritchard group looked at 222 of their 929 eQTLs for which the associated SNPs fell within the gene exons. They classified the RNA-Seq reads as originating from the high-expression haplotype or the low-expression haplotype, and found that for 195 of the genes (88%), more than 50% of the expressed transcripts carried the allele associated with high expression. Therefore, the modulation of gene expression is a direct result of the associated variation (probably by activating nearby cis-regulatory elements). In other words: the eQTL tells us that variants near the gene are associated with its expression. That means something nearby is regulating it. The fact that the haplotype associated with increased expression is the haplotype that predominates tells us that the high-expression allele is what drives the expression of its nearby gene. As opposed to, say, driving expression of the gene from both chromosomes.

Allelic Effects on Splicing

Finally, both groups looked at the actual content of expressed transcripts, to find SNPs associated with alternative splicing. The Pritchard group calls these splicing quantitative trait loci (sQTLs), and found 187 genes with significant associations. Binding sites for known splice factors (U1 snRNP and U2AF) were enriched for sQTLs, as were SNPs within 2 bp of a canonical splice site. The Dermitzakis group found 110 genes with significant associations, and stratified splicing-associated variants according to their position in the gene structure. When tested against the exons upstream and downstream of where they resided, splice donor variants were enriched 3.17-fold with the upstream (5′) exon, while splice acceptor variants were enriched 7.02 fold with the downstream (3′) exon. Thus, these SNPs affect the inclusion/exclusion of their exons in the mature transcript.

Dermitzakis’s group visually examined their most significant associations to characterize the mechanism of splicing regulation. Of the 110 significant sQTLs identified in CEU samples:

41% were single exon skipping events
17% created an alternate acceptor
13% were double or triple exon skipping events
6% created an alternate donor
5% were mutually exclusive exons
5% were retained introns.

In summary, these studies establish the feasibility of transcriptome sequencing to assess gene expression and characterize regulatory variation. Indeed, as the title of one study suggests, RNA sequencing is a powerful tool for studying the mechanisms underlying human gene expression variation, and will undoubtedly yield better understanding of the complex relationships between genotype and phenotype.

References
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, & Pritchard JK (2010). Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature, 464 (7289), 768-72 PMID: 20220758

Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, & Dermitzakis ET (2010). Transcriptome genetics using second generation sequencing in a Caucasian population. Nature, 464 (7289), 773-7 PMID: 20220756

The Four Dimensions of a Breast Cancer Genome

April 15, 2010 by Dan Koboldt

Published today in the journal Nature is the whole-genome sequencing of a basal-like breast cancer tumor, metastasis, and xenograft. There’s also a News and Views article by Joe Gray of Lawrence Berkeley National Laboratory, as well as a news feature on large-scale cancer projects.

brc1-nature08989screenshot

This study is a bit unlike our previous cancer genomes (AML1 and AML2). By my count it is the sixth cancer genome to be sequenced, and the third to come out of the Genome Center at Washington University. Obviously, it’s our first solid tumor. What’s particularly interesting about this study, however, is that we sequenced four DNA samples from a single patient with “double-negative” breast cancer: the primary tumor, peripheral blood (normal), a brain metastasis, and a mouse xenograft derived from the primary tumor. The xenograft is a success story in itself – we managed to create a human-in-mouse (HIM) transplant of the primary tumor that was >90% pure when harvested 101 days after engraftment.

The genomes of these four samples (tumor, normal, metastasis, and xenograft), examined with the incredible power of Illumina massively parallel sequencing, offer an unprecedented view of the somatic changes that underlie breast cancer development, growth, and metastasis.

Repertoire of Somatic Mutations

We validated a total of 50 somatic sites in at least one of the three cancer genomes, including:

28 missense mutations predicted to alter the sequence of an encoded protein
11 synonymous (silent) mutations in coding sequences
4 small insertions ranging in size from 1 to 6 bp
3 small deletions ranging in size from 1 to 13 bp
2 splice site mutations at intron-exon junctions
1 nonsense mutation predicted to result in a truncated protein
1 RNA mutation in a gene encoding a signal recognition particle (SRP) RNA.

We employed deep Illumina sequencing of PCR amplicons to assess the frequencies of each mutation across all four tissues. Intriguingly, more than half of them exhibited differential frequencies between primary tumor, metastasis, and/or xenograft. Two mutations (a nonsense mutation in MYCBP2 and a missense mutation in TGFBI) were significantly enriched in the primary tumor (88-89% vs 14-44%). Some 26 mutations were significantly enriched in the metastasis and/or xenograft. Perhaps most interesting, however, were two sites (a missense mutation in SNED1 and a silent mutation in FLNC) that appear to be de novo mutations unique to the metastasis.

Acquired Structural Variation

Using our internally developed tools for structural variant prediction (BreakDancer) and de novo assembly (TIGRA), we predicted 59 deletions and 18 inversions that were putative somatic events. Validation by PCR and 454/3730 sequencing showed that 73/77 (94.8%) were real structural variants, of which 34 (28 deletions and 6 inversions) were somatic alterations not present in the normal genome. Among them was a 46.5 kbp heterozygous deletion affecting FBXW7 (a known cancer gene) and two overlapping 500-kb deletions affecting CTNNA1 and a handful of other genes. The latter was particularly interesting, because loss of CTNNA1 has been shown to result in global loss of cell adhesion in human breast cancer cell lines.

We also validated seven translocations with a combination of manual review (Pairoscope), assembly, and PCR/3730 sequencing. One translocation that we assembled in all three tumor samples involves a long terminal repeat (LTR) from the ERVL-MaLR family on chromosome 4 and the ABCA2 gene on chromosome 9. Two other validated translocations that assembled in all three tumors are on chromosome 2, and separated only by a 393-bp TcMar-Tigger repeat.

Insights from Comparisons of Tumor, Metastasis, and Xenograft

One of the most intriguing findings from our study was the differential mutation frequencies and structural variation patterns that we observed in the metastasis and xenograft, compared to the primary tumor. More than half of the somatic mutations (26/50) were significantly enriched in the metastasis and xenograft, while observed at relatively low frequencies in the primary tumor. This suggests that a sub-population of tumor cells, not the primary clone, gave rise to the cerebellar metastasis that eventually killed the patient.

Is there a fitness cost to the mutations that enabled metastasis? Can we develop sensitive tests to detect the cells that are likely to spread? Genome sequencing has brought us to a point where we can begin to ask these questions, and answering them brings us one step closer to unraveling the complex, devastating, deadly disease that is cancer.

References
Li Ding, Matthew J. Ellis, Shunqiang Li, David E. Larson, Ken Chen, John W. Wallis, Christopher C. Harris, Michael D. McLellan, Robert S. Fulton, Lucinda L. Fulton, Rachel M. Abbott, Jeremy Hoog, David J. Dooling, Daniel C. Koboldt, Heather Schmidt, Joell (2010). Genome remodelling in a basal-like breast cancer metastasis and xenograft Nature, 464 (15), 999-1005 : 10.1038/nature08989

Next-Gen Sequencing in 2010

March 9, 2010 by Dan Koboldt

On the shuttle from Marco Island to the airport last week, I happened to sit next to a very nice gentleman from Illumina. We got to talking, of course, and I asked him if they saw a threat from any of the new sequencing platforms presented at AGBT. I’m aware that Illumina currently enjoys a greater-than-50% share of the next-gen sequencing market, so I was curious about his impressions.

“We definitely see a segmentation of the market,” he admitted.

Something had been bothering me about the sequencing-company presentations this year, and I finally realized what it was. During AGBT 2009, every player was gunning to take over the world. This year it seems like every sequencing platform has a niche in mind.

General Sequencing: Illumina vs. Life Technologies

Illumina’s HiSeq2000 and Life Tech’s SOLiD 4 are after the general sequencing market – whole genome, transcriptome, and targeted (capture) sequencing. It’s a constant game of one-upmanship in throughput and claimed accuracy. In February this year, Illumina launched the HiSeq2000 with expected throughput of 200 GB per run. Life Technologies launched SOLiD 4 with 100 GB per run, but promised 300GB per run later this year. On the read length front, Illumina remains the clear winner – 2×100 is in production at many genome centers, and even longer reads have been promised. Life Tech, to their credit, is pushing the SOLiD 4 platform pretty hard.

When Length Matters: 454

Roche/454 has wisely backed away from large-scale sequencing, and instead seems to be targeting applications where longer (450 bp) reads are a requirement. At AGBT, Henry Erlich (Roche) gave an interesting talk about genotyping and haplotyping human HLA regions to improve donor matching for organ transplants. Here’s a key challenge of modern medicine where sequencing can offer tangible benefits. Here at the genome center, we use 454 runs for validation and for small-scale targeted sequencing. There are many applications where relatively inexpensive long-read sequencing runs are idea; full-length cDNA sequencing, for example, comes to mind.

Complete Genomics: Sequencing as a Service

The business model of Complete Genomics seems a bit of a gamble to me. They aim to be the provider of relatively inexpensive, start-to-finish sequencing services. No technology or reagent sales for these guys. Instead, they want to take your samples and give you back the SNPs. In the coming years, they hope to build as many as 10 facilities throughout the world that provide these services. I’m a bit leery of Complete Genomics, not only because their proprietary technology lags behind others (currently it’s at 2X35 bp), but because they’ll need to do something like 10,000 genomes a year just to stay in business. I don’t think we’re ready for that.

Sequencing for the Masses: IonTorrent

Many of us were impressed by IonTorrent this year at AGBT. The incredibly low cost of their instrument ($50K) and sequencing runs ($300-500) mean that nearly any lab could write a grant around this technology. The sample prep, accuracy, and throughput are still a grey area, but if they prove to be good enough, high-throughput sequencing will suddenly be available to just about everyone.

Single Molecule Applications: Pac Bio and Oxford Nanopore

The true single-molecule sequencing platforms that are close to market are certainly getting everyone excited. In the next few years, however, it’s unlikely that Pacific Biosciences, Oxford Nanopore, mystery-Chinese-platform, or other companies will displace massively parallel sequencing. No, I think Illumina and SOLiD will remain the “work horses” for discovery, certainly at major genome centers. Where SMS technologies can excel, however, is ultra-long reads – think about PacBio’s strobe sequencing to resolve structural variation or finish assemblies – and lots of molecule-kinetics stuff that I don’t understand.

I think that 2010 will be an exciting and telling time for all of these platforms. In a year’s time, we should have results in hand from HiSeq, SOLiD4, PacBio, and even IonTorrent, and be able to distinguish between marketing claims and sequencing reality.

« Previous Page