Variant Prioritization in Rare Mendelian Disorders

March 11, 2014 by Dan Koboldt

Few areas of biomedical research have benefited more from next-gen sequencing than studies of rare inherited diseases. Rapid, inexpensive exome sequencing in individuals with rare, presumably-monogenic diseases has been hugely successful over the past few years. There’s been a lot of discussion in the NGS community about the analysis burden of the large-scale whole-genome sequencing that will be possible with Illumina HiSeqX Ten systems, but even exome sequencing analysis brings considerable challenges.

In the March 2014 issue of the American Journal of Human Genetics, we present a software package called MendelScan to aid the analysis of exome data in rare Mendelian disorders.

Exome Sequencing Challenges

Every individual harbors thousands of coding variants, 5-10% of which are not in public databases such as dbSNP. A study led by my friend Daniel MacArthur found that, even after correcting for annotation errors and other artifacts, the genome of a healthy individual contains ~100 loss of function coding variants. This is just one of the reasons that exome sequencing of Mendelian disorders can fail.

And they do fail. The current solve rate for incoming cases at NIH Mendelian Centers remains at around 25%. Dominant disease pedigrees remain more difficult to solve than recessive ones.

Mendelian Disorder: Retinitis Pigmentosa

Retinitis pigmentosa (RP) offers a wonderful example of challenging Mendelian disorders. It’s a “genetically heterogeneous” disease, which is a fancy way of saying that many different mutations in dozens of different genes can cause dominant, recessive, or X-linked disease. The disease affects around 1 in 3,500 individuals in the U.S., and it’s incurable.

No matter the genetic cause, the progression of RP is remarkably uniform. Basically, it’s a disease of rod photoreceptors — the light sensing kind, not the color-sensing kind — whose slow, inexorable attrition usually causes night blindness (usually apparent by adolescence) and a sustained narrowing of the visual field (tunnel vision). Most RP patients will be legally blind by the age of 40.

Mutations in about 18 different genes can cause dominant RP, which is the form that we’re studying. Routine genetic testing of common disease-causing mutations explains about 50% of incoming cases right off the bat. We’re interested in the cases that come back negative from these screens, the ones that may have rare or as-yet-unknown causal mutations.

Exome Sequencing in 24 Families

We did exome sequencing for 24 families that lacked common disease-causing mutations. The typical family had a proband, the affected parent (because it’s dominant), the unaffected parent, and a distant affected relative. Overall we did 2-7 affected and 0-2 unaffected samples per family, for a total of 91 samples. On average, in each family, we identified ~30,000 single nucleotide variants (SNVs) and 600 insertions/deletions (indels) in coding regions. That’s a lot to sort through when you’ve got 24 families.

Variant Prioritization Strategy

Based on our knowledge of dominant RP, and an analysis of 762 disease-causing mutations downloaded from HGMD, we expected that most disease-causing mutations would exhibit some key characteristics:

Segregation. In dominant pedigrees with full penetrance, all affected individuals should carry the causal mutation, and none of the unaffected individuals should.
Rareness. All of the mutations known to cause dominant RP are quite rare. In the HGMD set, 68% of mutations were novel to dbSNP 137, and another 21% were present only because they were pulled in from OMIM and other mutation databases.
Protein impact. We expect that most (but not all) causal mutations will impact genes. When classified by current VEP annotation, most of the mutations were predicted to alter protein sequence (66%), reading frame (13.5%), splicing (4.3%), or length (6.8%).
Retinal expression. Genes in which mutations cause retinal disease tend to be highly expressed in the retina. According to recent human retina RNA-seq data, about 97% of genes in RetNet (a retinal disease gene database) are in the top 50% of all genes when ranked by retinal expression.

You’ll note that there are exceptions to every rule above. That’s why we were uncomfortable with simply ruling out variants that don’t segregate perfectly or ones that appear to be synonymous. Instead, we developed a scoring algorithm to prioritize variants based on segregation, rareness, annotation, and retinal expression.

So how well does it work? In our exome dataset, 8 of 24 families harbored a likely-pathogenic mutation in a known RP gene. When we sorted the variants in those families by prioritization score, the causal mutation never ranked lower than 13. Out of 20,000+ variants. There was one exception, a family that turned out to have an error in the pedigree. The causal mutation there ranked #439 out of 26,666 SNVs, so it was still in the top 2%.

This robust performance — even in the face of incorrect assumptions — is why we prefer to prioritize rather than filter-and-remove candidate variants.

Mapping Dominant Disease Genes

Koboldt et al, AJHG 2014

Even though our scoring algorithm seemed to be working well, we still had hundreds or thousands of variants to sift through in some families. And while those pedigrees often weren’t large enough for traditional linkage analysis, we asked whether the dense information provided by exome sequencing could help nominate or exclude regions based on segregation.

Disease-causing variants usually don’t occur in isolation. They’re part of a haplotype that segregates within a family pedigree (example at right). For dominant disease, all affected individuals have one haplotype in common, the one that hosts the causal variant (denoted in black, on the left):

Rare Heterozygote Rule Out

That haplotype (orange) might also host other variants that aren’t disease-causing, but might still be picked up by exome sequencing. Some may be quite rare, and because they’re physically linked to the causal mutation, they’ll be heterozygous in affected individuals. There also should be no homozygous differences between pairs of affecteds (red) because all affecteds share at least one haplotype. So a cluster of shared rare (heterozygous) variants, and an absence of homozygous differences, helps us map haplotypes shared by affecteds in the pedigree. We call this rare heterozygote rule out (RHRO).

Shared IBD Analysis

Another approach would take the same principles, but use identity by descent (IBD). Since the haplotype shared by affecteds was inherited from a common ancestor, we can also search for regions that are IBD between most or all pairs of affecteds. We call this shared IBD (SIBD) analysis, and like RHRO, its discriminatory power grows with the numbers of and genetic distance between affecteds in a pedigree.

We applied these approaches to families with 3+ sequenced affecteds. When you put both mapping methods together, you get something like this:

Koboldt et al, AJHG 2014

Some of them had orthogonal information — traditional linkage peaks or an identified pathogenic mutation — that told us where the disease-causing variant resided (blue line, above), so we could test the performance of these approaches. Our mapping methods did well: they recapitulated known linkage regions and/or captured the region of the causal mutation. And they did so with far fewer affected individuals than were used for the linkage analysis.

Our approaches also identified new candidate regions. These might be eliminated by adding more affected individuals, or they might reflection linkage that was missed by traditional approaches.

Improving Exome Analysis for Rare Disorders

So the MendelScan tool is out and freely available. We would love to have your feedback and suggestions for it! The current JAR release is v1.2.1. Let’s go forth and conquer some Mendelian disorders.

References

Koboldt DC, Larson DE, Sullivan LS, Bowne SJ, Steinberg KM, Churchill JD, Buhr AC, Nutter N, Pierce EA, Blanton SH, Weinstock GM, Wilson RK, & Daiger SP (2014). Exome-Based Mapping and Variant Prioritization for Inherited Mendelian Disorders. American journal of human genetics PMID: 24560519

Illumina’s Other Business: HiSeq, MiSeq, 23andMe, and BaseSpace

January 24, 2014 by Dan Koboldt

Although Illumina’s two new NGS platforms overshadowed their press conference at the J.P. Morgan Healthcare Conference, there were other interesting tidbits as well. When one gets past the hype and somewhat-unrealistic calculations of per-genome sequencing costs, it’s good to remember that Illumina’s current platforms already have ~80% of the next-gen sequencing market and a similar command of the genotyping array market. They’re investing heavily on the informatics side as well. Here are some of the highlights.

HiSeq and MiSeq Instruments

I’m guessing that the majority of those customers don’t have $10 million handy to plunk down on a HiSeqX Ten cluster. Thus, it seems likely that the MiSeq and HiSeq will remain workhorse platforms for much of the research community. In the last quarter, Illumina booked around 400 instruments, and 300 of those were MiSeqs. Half of the overall demand came from non-academic customers, and Japan was second only to the U.S. in terms of growth. Illumina expects MiSeq output to reach 15 Gbp per run this year, which is enough for 2-3 exomes or a lot of more targeted sequencing.

Longer Reads in Rapid Run

In the second quarter of 2014, Illumina plans to roll out longer paired-end reads (2×250 bp) for the HiSeq2500 rapid mode. As a bioinformatician, I’m a huge fan of (slightly) longer reads with faster turnaround times. As a manager, however, I can’t ignore the cost of said capability. Running the HiSeq2500 in rapid mode is already expensive, reagents-wise, and when you generate longer reads, you need more kits per run.

There’s another reason 2×250 bp reads might be wasteful: if the typical DNA fragment size is 250-300 bp, most fragments will be sequenced twice (once from each end). That was great in the 3730 days, but many NGS pipelines soft-clip the overlapping portions of paired-end reads from the same fragment. In other words, they only use the information from one read. Then again, speed and accuracy are likely higher priorities than cost in clinical settings, where the rapid run is more likely to be employed.

Speaking of which, the HiSeq2500 will be submitted for FDA clearance this year. Well, that’s the tagline. Specifically, they’ll submit the use of the 2500 on a specific genetic test for FDA approval, and build out from there.

Illumina’s SNP Arrays

There’s also the SNP array business, which has been booming for a long time. We like to talk about 10,000 or 50,000 exomes, but chip-based GWAS cohorts typed by SNP arrays often number hundreds of thousands of samples. The exome chip, for better or worse, saw rapid and widespread adoption over the last couple of years. They’re cheap, they’re easier to analyze, and they require simpler consenting procedures than exome or genome sequencing. These arrays are selling like hotcakes.

During the J.P. Morgan conference, Jay Flatley dropped this interesting line: they have seen no decrease in demand from their biggest SNP array customer, 23andMe, despite the FDA’s warning letter. I knew that 23andMe used Illumina genotyping arrays (their own customized version), but I was surprised to learn that they’re the biggest customer. It’s certainly interesting to learn that the personal genotyping service business remains strong even though 23andMe can’t currently market the health or medical benefits.

Then again, since it costs only $99 to get genotyped, even the ancestry stuff is fascinating.

Illumina’s BaseSpace Informatics

One other area of movement is BaseSpace, Illumina’s cloud computing platform for NGS informatics. They’ve connected over 2,000 instruments to it already, and registered 12,000 user accounts. There are 25 BaseSpace “apps” available now, and that number is expected to double by the end of the year. Supposedly, the amount of data already in Illumina’s cloud already exceeds the size of the NCBI short read archive (SRA, not to be confused with dbGaP).

Also, there’s a “private” version coming out: Illumina BaseSpace Onsite, which contains all of the reporting without being on the cloud. Yet another development aimed at the clinical market.

Sure, BaseSpace is growing, and it’s reassuring to see Illumina investing heavily in informatics (remember when all they had was ELAND?). Bioinformatics already represents a slower and more expensive prospect than sequencing. That bottleneck will only grow worse when HiSeqX Ten installations start churning out 20,000 whole genomes per year. If they can find that many samples, that is.

Remaining Challenges of Next-Gen Sequencing

August 7, 2013 by Dan Koboldt

Although I often write about the challenges of next-gen sequencing, it occurred to me that these technologies have reached a certain point of maturity. Thanks to the happy co-evolution of technological and informatics advances, many of the initial problems are essentially solved. Consider the achievements this field has seen over the past several years:

We developed new algorithms for rapid, accurate mapping of short reads
Read lengths and base qualities improved dramatically while costs continue to drop
SAM/BAM formats became community standards for data storage and sharing
Sequence variants, particularly SNVs, can be detected with high accuracy using NGS data
Data submission and sharing with appropriate safeguards are now possible in dbGaP.

Perhaps most importantly, sequencing service providers and new benchtop instruments have made NGS available to the wider research community. Nevertheless, some key challenges remain before we can exploit the full potential of high-throughput sequencing.

Finding Samples to Sequence

Once upon a time, the cost and laborious nature of sequencing limited how many samples could be included in a study. Now that throughput is no longer a problem, we are beginning to recognize that samples are the new commodity. Obtaining enough high-quality, properly consented samples remains a significant challenge for our field, particularly when it comes to rare disorders and minority populations.

If every investigator who studied a rare disease provided samples, clinical data, and consent letters to a central collection, we might overcome the sample bottleneck for many diseases. We all know that this will often be unrealistic. Samples have intrinsic value to them. Investigators have careers to build and grants to win. Where we have seen some success is in large-scale, multi-center studies like The Cancer Genome Atlas and the Alzheimer’s Disease Sequencing Project, which bring together investigators and samples and the resources required for some kind of integration.

Compute and Storage for Sequence Data

One of the more obvious challenges of NGS is the sheer volume of data. With sequencing throughputs rapidly outpacing Moore’s Law for compute power (see NHGRI’s timeline of sequencing costs), we find ourselves facing a major CPU and storage problem:

Cost per Genome vs. Moore’s Law (NHGRI)

There have been many innovative algorithmic developments that help address the growing data-to-CPU ratio. Storage, however, is another matter. Without losing information, storing aligned next-gen sequencing data takes disk space. The more data you have, the more space required.

How long should we store NGS data? Storage isn’t free — it requires hardware and maintenance and physical space — so at some point (especially given shrinking research budgets), we will need to address this issue.

Indels, SVs, and Other Difficult Variants

Detection of SNVs in NGS data for individuals or tumor-normal pairs is largely a solved problem, but that can’t be said for other types of variation. You know the kinds of variants I’m talking about: insertions, deletions, duplications, inversions, and more complex rearrangements are tough to characterize using short-read sequencing data. Many impressive algorithms have come out to detect and characterize these, but they all suffer much higher false-positive and false-negative rates than we see for SNVs.

It is possible that the current paradigm for high-throughput sequencing data — paired-end reads in the 100-250 bp range — may never yield an optimal solution for detecting these kinds of variants. At least, I think it’s more likely that some of the long-read technologies racing to market stand a better chance at letting us characterize these variants with >90% accuracy.

Discovery and Interpretation

Next-gen sequencing technologies have opened the floodgates for genomic data. Thousands of whole genomes and hundreds of thousands of exomes have been sequenced already. You will notice, however, that the rate at which new genomic discoveries are made (published), while impressive, does not match the rate at which new samples are sequenced. In other words, we have far more data than findings.

Previously I wrote about some of the reasons exome sequencing studies can fail even for well-characterized, monogenic Mendelian disorders, and the need for functional validation of genomic findings before we accept them as fact. Our ability to find, associate, and implicate genetic variants and candidate disease genes far outstrips our ability to understand them.

There is a pressing need for better downstream analysis tools to help interpret genomic data, especially in the ~97% of the genome that lies outside of the exons of known protein-coding genes.

Clinical Translation of Genomic Discoveries

Perhaps more importantly, the findings enabled by next-gen sequencing must eventually translate into improvements to human health. That, essentially, is our funded mandate.

There are many ways to make this happen. Identification of new disease genes may provide new therapeutic targets, and improve the predictive abilities of genetic testing. Clinical sequencing of patients suffering from disease may eventually guide diagnosis and treatment decisions.

Importantly, these pathways to clinical translation will require the expertise in many disciplines: molecular biology, pharmacology, genetic counseling, clinical care, and many others.

We “genomics” people can’t do it alone. Reaching our ultimate goal will probably require ambitious multi-disciplinary collaborations focused on specific health problems. Genomics and NGS will certainly be a part of that, but not the biggest part. Not even half.

Data Sharing, Embargo, and Big Science

June 27, 2013 by Dan Koboldt

Data sharing is essential in the fields of genetics and genomics. It remains one of the core principles of federally-funded “big science” — large consortium efforts to conduct research at incredible scale. The Human Genome Project and its descendants — HapMap, 1000 Genomes, ENCODE, TCGA — are prime examples of such efforts. The resources that they have created (and provided, virtually without restriction) for the research community are priceless.

What makes these resources long-lasting and significant is the fact that they’re open-access for all. Perhaps more importantly, these impressive datasets were made available during the project, not afterward. This is no small thing. Anyone who’s participated in large-scale genomics projects probably understands just how much effort is required to QC and submit data incrementally, make data freezes, and ensure that they’re available to the community.

The Downside of Data Sharing

There are, of course, some disadvantages to sharing data in real time. During the HGP, a certain maverick took advantage of the public genome data as it was generated, using it to scaffold his company’s private, competing human genome draft assembly. Today, the “scoop” of public data is more ubiquitous for a number of reasons:

Central repositories. Most large-scale projects submit their data to a single place (e.g. dbGaP) where it’s relatively easy to find.
Rapid access to compressed data. High-speed internet access is everywhere — heck, Google’s floating balloons in New Zealand to bring it to distant tribes — and it’s faster than ever. An exome BAM file is about 10 gigabytes; even at modest speeds, you can download one in a matter of minutes. Compressed VCFs come even quicker.
Democratization of sequencing. Most investigators now have access to rapid, inexpensive sequencing of exomes or whole genomes. They can do 20 or 30 exomes, find a gene of interest, and then quickly look in large datasets such as TCGA for recurrence.

In short, obtaining and utilizing the datasets of “big science” initiatives is easier than ever. This is good news for the research community, so long as everyone plays fair. But let’s be honest, we live in the real world where that doesn’t always happen.

Data Embargo and First Rights

Most of these ambitious, expensive, long-term projects have a data use policy deisgned to protect the investment of money, samples, time, and other resources. Some data may be under embargo for a certain time, meaning that it’s submitted to public repositories but not available for download. This isn’t really open access, though, so it’s a policy that’s being used less and less. Instead, there’s usually a publication embargo — an understanding that the participants in the project get first rights to publish on their data.

In TCGA, the data use policy seems quite clear: no one gets to publish on a TCGA cancer type’s data until the first major publication, the “marker paper”, has come out. This is understood quite well, at least by TCGA participants. Remember that they’re in the unique position of generating the data, meaning that they see it before anyone else and are usually quite capable of analyzing it on their own. Nevertheless, everyone waits for and collaborates on the marker paper before going off on their own.

The Enforcement Problem

There’ s a major problem with this policy, however, and that’s enforcement. When the data are made available, there’s no way to physically stop outside investigators from using it, even from writing up manuscripts and submitting them for publication. Unless the editor or peer reviewers are aware of a project’s data embargo status, it’s quite possible for those manuscripts to reach publication before the marker paper. Just this month, there was a paper in a high-profile journal that used (among other things) embargoed TCGA data.

Obviously the two lines of defense — the data use policy, and the editors and referees who reviewed the manuscript, failed to prevent this from happening. What happens now? I have heard anecdotal evidence that, in the past, such violations ultimately resulted in a paper being withdrawn. This is perhaps what should come to pass, though I don’t know if it will. In either case, the damage is done.

Points Against Publication Embargo

It needs to be said that there are questions about whether these embargo policies are in the best interest of the research community. Data that was generated using public funding belongs to the public, and that includes the project’s competitors. I’m keenly aware of the fact that some people disagree with funding “big science” projects. It means that many smaller grant proposals must go without funding. There are also those of the opinion that, if the participants in a project can’t get their shit together and publish before someone else does, that’s their problem.

Unfortunately, it’s not quite so simple, as anyone who’s tried to write a marker paper with a consortium understands quite well. With the vast amount of data (and egos) involved, these projects are cumbersome. I would argue, however, that the landmark publications coming out of these studies are worth the wait. Look at the incredible resources that “big science” projects have provided our community:

The HGP provided the reference, enabling us to annotate and find variants in the genome.
The HapMap Project yielded a comprehensive genetic map and gave rise to high-throughput genotyping, without which GWAS would not have been possible.
The 1000 Genomes Project helped spur the development of NGS technologies, algorithms, and file formats, as well as dramatically expanding the catalogue of human genetic variation
The Cancer Genome Atlas has and continues to yield critical comprehensive molecular profiles of common cancer types
The ENCODE project has laid the groundwork for understanding the composition and function of elements in the genome.

These resources are priceless, and I would argue that they would not exist without the big science projects behind them. And yet, such efforts are doomed if investigators, journal editors, and peer referees fail to respect and enforce their data use policies.

« Previous Page