Sequencing Finnish Population Isolates (SISu)

sequencing in finnish suomiIf you compare any individual’s genome to the human reference sequence, you’ll find around 3 million differences. Most of these (95%) area already known, and have been catalogued in databases like dbSNP. Many are common, and shared by 5% or more of human populations. They may still have biomedical relevance, of course; genome-wide studies of common genetic variation (GWAS studies) have found thousands of genetic loci associated with disease susceptibility and other complex traits.

But there are still huge numbers of rare (MAF<0.5%) and low-frequency (MAF<5%) genetic variants. Their contribution to human health is harder to understand, particularly because such variants:

  • Are usually not included on high-density SNP arrays
  • Occur in few individuals, and thus require large cohorts
  • Have low individual power for genetic association

One way to address the challenges of rare variants is to study them in founder populations in which such variants are more common. Ashkenazi Jews and Amish families, for example, have undergone population bottlenecks effects: a limited number of founders gave rise to the current populations.

This breeding isolation, whether cultural or geographic in nature, increases the frequency of some variants that are otherwise quite rare in broad populations.  And if those variants underlie a genetic disorder, the risk of the disease is increased. Ashkenazi Jews, for example, have increased risk of many uncommon genetic disorders.

Finland has a unique population history — a bottleneck followed by geographic isolation — the result of which is a Finnish “disease heritage”: a high incidence of 40+ Mendelian disorders. Dozens of rare Mendelian disease genes were mapped in Finns, and that knowledge is valuable for understanding disease biology. What about rare variants underlying common, complex disease? Here the Finns have an important resource: nationalized health records with decades of follow-up data.

Sequencing Initiative Suomi (SISu)

The Sequencing Initiative Suomi (SISu) aims to combine the unique population structure, the health records, and the substantial Finnish interest in genetics. The first study from SISu, just out in PLoS Genetics, compares the exomes of 3,000 Finns to an equal number of non-Finnish Europeans (NFEs). They found:

  • A depletion of “singletons” (variants only seen in one individual) in Finns: 3.7 times fewer singletons than NFEs
  • An excess of low-frequency variants (MAF 0.5-5%) in Finns relative to NFEs
  • Similar patterns of common variants between Finns and NFEs
Finnish loss of function variants

Lim et al, PLoS Genetics 2014

All of these are consistent with the expected bottleneck effect on Finnish populations. When variants were stratified by annotation (i.e. their predicted effect on genes), Finns had a higher proportion of likely-deleterious missense variants and more severe loss-of-function (LoF, or protein-truncating) variants. The average Finn had 0.160 homozygous LoF variants, whereas the average NFE had 0.095.

To determine if some of these enriched LoF variants have phenotypic effects, the authors genotyped 83 of them in 36,262 individuals from three large Finnish cohorts. Using the deep phenotype data — quantitative traits like blood pressure, lipids, etc. — they found 5 significant associations.

LPA association

LPA level (Lim et al, PLoS Genet 2014)

One of these was an association between splice site variants in the gene encoding lipoprotein A (LPA) and decreased levels of circulating lipoprotein A. As it happens, circulating LPA is a risk for coronary heart disease. Looking at the medical records showed that LPA splice variants are protective for cardiovascular disease.

This is only a proof-of-principle study, the tip of the SISu iceberg. Yet it shows the value of sequencing Finnish populations to identify rare variants contributing to complex diseases. Undoubtedly, as large-scale sequencing of Finnish cohorts continues at places like WashU and the Broad Institute, we’ll have even more power to identify genes relevant for common diseases.




Population Whole-genome Sequencing: Dutch Edition

Genomes of the netherlands

GoNL Consortium, Nat. Gen. 2014

The last time I checked, the database of human genetic variation (dbSNP) contained over 50 million unique sequence variants. And yet, as anyone who analyzes exome or whole-genome sequencing data can tell you, every individual harbors a significant number of variants (usually around 5% of single nucleotide variants, or SNVs) that dbSNP has never seen.

These “private” or rare variants undoubtedly contribute to important phenotypes, such as disease susceptibility. Non-SNV variants, like indels and structural variants, are also under-represented in public databases. The only way to fully elucidate the genetic basis of a trait is to consider all of these types of variants, and the only way to find them is by large-scale sequencing.

In this month’s issue of Nature Genetics, the Genome of the Netherlands (GoNL) Consortium reports the whole-genome sequencing of 250 Dutch families from 5 biobanks across the Netherlands. The families comprised mostly parent-child trios (n=231), along with some family quartets with monozygotic (n=11) or dizygotic (n=8) twins. All told, it was 769 individuals whose genomes were sequenced ~13x depth.

Variant Calling

Granted, this is a modest coverage depth, considering that sequencing to 30x or 40x might be considered standard. To help address this, the authors performed joint sample calling with GATK. That, along with a combination of 10 indel/SV calling tools, yielded the following:

  • 20.4 million biallelic SNVs
  • 1.2 million biallelic indels of 1-20 bp
  • 27,500 larger deletions (>20 bp)

Here’s a quick tour of the highlights in each variant class

Single Nucleotide Variants (SNVs)

Dutch genome SNPs dbSNP

GoNL Consortium, Nat. Gen. 2014

Half of the 20.4 million SNVs discovered in this study were rare, with MAF < 0.5%. The others were not quite evenly divided between low-frequency SNVs (4.0 million with MAF 0.5-5%) and common SNVs (6.2 million with MAF >5%). Altogether, there were around 7.6 million SNVs that were novel to dbSNP 137. Most of those 75%) were singletons, meaning that they were observed in just one individual. If we consider only the 500 unrelated individuals sequenced (the parents), that’s about 15,200 novel contributed variants per sequenced genome.

Among the ~2 million singletons uncovered in the European panel of a different project (1,000 Genomes), 16.5% were observed in GoNL. The authors therefore expect that a “substantial number” of singleton variants reported by these projects will be seen again as larger European cohorts are sequenced. Even so, that’s a lot of “private” variation. Remember, too, that these cohorts are from northwest Europe, arguably one of the best-characterized ancestry groups thus far.

Indels and Structural Variants

Compared to SNV calling, the detection of indels and larger SVs remains a considerable challenge. Anyone working in NGS informatics can tell you that. The authors have put forth a good effort in this arena by combining the results of 10 different variant callers: GATK UnifiedGenotyper, Pindel, 1-2-3SV, Breakdancer, DWAC, CNVnator, FACADE, MATE-CLEVER, GenomeSTRiP and SOAPdenovo. These are all different algorithms, but they boil down to five approaches for uncovering indels and SVs:

  1. Gapped reads alignments to the reference (e.g. GATK)
  2. Split reads, an approach pioneered by Pindel
  3. Paired-end read distance/orientation (e.g. BreakDancer)
  4. Overall read depth changes (e.g. CNVnator)
  5. De novo assembly of SV breakpoints.
Dutch genome SV calls

GoNL Consortium, Nat. Gen 2014

Some of the tools use one approach, while others employ multiple approaches. No single indel/SV caller has emerged as vastly superior to all others, so combining the results from a suite of different tools seems like a good strategy. The authors divided variants into three size categories (1-20 bp, 20-100 bp, and >100 bp) and kept any SV detected by at least two orthogonal tools. Their validation rate (138/144, or 96.5%) for randomly-chosen SVs of at least 20 bp is impressive.

The size distribution of consensus calls showed peaks at +/- 4 bp (microsatellite instability), ~300 bp (SINEs), and ~6 kbp (LINEs). Not remarked upon in the manuscript is the largest peak right near zero, since 1-2 bp indels are by far the most common. While 54.4% of short indels (<20 bp) were already in dbSNP, virtually all of the mid-size deletions (30-500 bp) were not (98.4%). Thus, this study helps fill an important gap in the catalogue of human sequence variation.

Functional Variation

LOF variants in dutch genomes

Rare, low freq, and common variant distributions (GoNL, Nat. Gen 2014)

Because these families were not obtained “on the basis of phenotype or disease,” their patterns of genetic variation provide a useful model for apparently healthy individuals.

Rare Loss-of-Function Variants

Among rare variants identified in this study, the authors observed an excess of nonsense SNVs and frameshift indels, consistent with the expectation that damaging variants would be under strong purifying selection.

A similar excess-of-rare-events was evident for larger deletions that removed the first exon or >50% of the coding sequence of a gene. The effect was even stronger when considering only genes in the OMIM database, reflecting strong purifying selection against structural changes in key genes.

On average, each individual in GoNL had about 60 nonsense or splice-site SNVs. Most of these, however, were common in the cohort (MAF>5%, and thus unlikely to be deleterious), which illustrates the need for cautious interpretation of apparent loss-of-function (LOF) variants. Looking at rare variants, and using synonymous SNVs as a baseline, the authors estimate that each individual might have 4-5 rare loss-of-function SNVs.

Compound Heterozygous Events

Individuals that were compound-heterozygous (i.e. one variant on each parental haplotype) for rare loss-of-function SNVs/indels/SVs were extremely rare. The authors found just 3 such instances across the cohort (an average of 0.01 events per individual). Such events are thus of considerable interest for disease studies.

Compound heterozygosity for common LOF variants should have been far more prevalent, because these are less likely to be truly deleterious. Indeed it was; there were about 3 compound heterozygous events of common LOF variants per individual. Interestingly, the 1,917 such events observed across the entire cohort were confined to 11 genes (C11orf40, DEFB126, GSTT2, HTR3D, KRTAP4-8, MS4A14, OR13C2, SIGLEC12, TRY6, VWDE, and WNK1) which all seem to have high mutation tolerance.

HGMD False Positives

The human gene mutation database (HGMD) is a commercial repository of “disease causing” variation in humans. When the authors of this study annotated variants with HGMD information, each individual harbored about 20 such variants. In other words, HGMD annotation would suggest that a large number of GoNL individuals have diseases with profound physical (or even lethal) consequences. Whoops.

It’s possible that the HGMD variants simply cause disease in non-Dutch populations, or have low penetrance. An alternative possibility is that HGMD has a lot of false positives. Among the 1,093 HGMD variants in GoNL, almost a third had MAF>1%, which is much higher than the frequency of the diseases they’re reported to cause.

De Novo Mutations

de novo mutation calling

De novo mutation calling performance, GoNL, Nat. Gen 2014

One of the most fascinating aspects of this study was the exploration of de novo mutations (variants present in a child but absent from both parents). These events are extremely rare (occurring at a rate of around 1 in 100 million bases), and identifying them absolutely requires sequencing at least three genomes: an individual and both biological parents.

Even then, they’re very difficult to find: Across the 258 independent offspring in GoNL there were 4.5 million apparent Mendelian violations. The authors applied a method (PhaseByTransmission) to refine this to around 29,162 candidate autosomal de novo mutations. That’s about 63 per offspring, far too many.

So the authors attempted to independently validate over 1,000 candidate de novo mutations by orthogonal sequencing, and found that around 50% were false positives. Some independent Complete Genomics data for 19 parents and 1 child revealed another 1,137 events that were false positives. From these 2,270 observations, the authors developed a random forest classifier to predict whether a predicted mutation would be truly de novo or not based on a number of different properties. This is something that many other groups (including ours) have done for somatic mutation calling in cancer genomes.

The classifier in this study, which had an estimated accuracy of 92%, relied primarily on factors related to the sequencing depth and read counts, which happens to be the basis for mutation detection with VarScan 2. When applied to the GoNL dataset, the classifier nominated 11,020 high confidence de novo mutations — roughly 42.7 per offspring — with a range of 18 to 74 per offspring. That’s still a bit higher than it should be, but still reasonable for downstream analysis.

Paternal Age and de novo Mutation Rate

paternal age and mutation rate

GoNL Consortium, Nat. Gen 2014

The authors observed a significant correlation between the father’s age at conception and the number of de novo mutations in the child. This is the third study to report such a trend, and the largest sample size yet. Although the ages of mother and father are highly correlated, its effect on de novo mutation rate was primarily due to paternal influence.

The authors estimate that each additional year of paternal age caused a 2.5% increase in the number of de novo mutations in the child. Under their model, about 75% of de novo mutations come from the father, and 25% from the mother. Phase analysis using read pairs (a complex process I won’t go into) revealed that 76% of de novo mutations were indeed on the paternal haplotype. So you can thank your dad for 3/4 of your de novo mutations.

De Novo Indels and SVs

The authors attempted to find de novo indels and structural variants. It didn’t go well.

In Summary

The authors have employed moderate-coverage whole genome sequencing to build a resource of 1,000 haplotypes for a small, densely-populated country in northwestern Europe. They added 7.6 million SNVs to dbSNP, and also characterized a large number of new indels and SVs. Many more studies which apply genome sequencing to large population cohorts will be necessary to fill out the catalogue of human genetic variation.

New Challenges of Next-Gen Sequencing

I first started MassGenomics in the early days of next-gen sequencing, when Illumina was called “Solexa” and came in fragment-end, 35-bp reads. Even so, the unprecedented throughput of NGS and the nature of the sequencing technology brought a whole host of difficulties to overcome, notably:

  • Bioinformatics algorithms developed for capillary-based sequencing didn’t scale.
  • Sequencing reads were shorter and more error-prone.
  • The instruments were expensive, limiting access to the technology
  • Most of the genetics/genomics/clinical community had no experience with NGS

All of these are essentially solved problems: new bioinformatics tools and algorithms were developed, the reads became longer and more accurate, benchtop sequencers and sequencing-service-providers hit the market, and NGS was widely adopted by the research community. Mission accomplished!

Yet these victories were short-lived, because we find ourselves facing new challenges. Harder challenges. Here are a few of them.

1. Data storage

You’ve probably seen the plot of Moore’s Law compared to sequencing throughput. In short, the cost of DNA sequencing has plummeted much faster than the cost of disk storage and CPU. A run on the Illumina HiSeq2000 provides enough capacity for about 48 human exomes. Even if you don’t keep the images, each exome requires about 10 gigabytes of disk space to store the bases, qualities, and alignments in compressed (BAM) format. At three runs a month, each instrument is generating 1.4 terabytes of data files. It adds up quickly.

Analysis of sequencing data — variant calling, annotation, expression analysis, genetic analysis — also requires disk space. Most non-BGI research budgets are finite, so investigators must choose between (1) deleting data, (2) spending money, or (3) holding up data production/analysis. None of those sound very appealing, do they?

2. Achieving Statistical Significance

NGS is no longer an exploratory tool, and descriptive studies reporting a dozen or a couple hundred genomes/exomes are harder and harder to publish. This is particularly true for common diseases, in which large numbers of samples are typically required to achieve statistical significance. The number 10,000 has been discussed as an appropriate number. Even if that many samples could be found, the cost of sequencing so many is substantial. If you had an Illumina X Ten system and could do whole genomes for $1,000 each (that only covers reagents, by the way), it’s still ten million dollars. That’s probably over budget for most groups, so they’ll have to take another tack:

  • Sequencing fewer samples, which will make the work harder to fund/publish
  • Combining some sequencing with follow-up genotyping, which limits the discovery power
  • Collaborating with other labs/consortia, whose sample populations, phenotypes, or study designs may vary

How many of your project planning meetings have ended with someone saying, “Well, maybe we’ll get lucky.” ?

3. Finding Samples

Getting access to large sample cohorts is another challenge. As I’ve previously written about, given the widespread availability of exome and genome sequencing, samples are the new commodity. High-quality DNA samples from informative sources — tumor tissue, diabetes patients, families with rare disorders, even healthy members of minority populations — are increasingly valuable. Why should an investigator collaborate with you, when they might send the samples off for sequencing on their own?

Sequencing samples with public funds (i.e. NIH grants) adds another layer of difficulty: all sequencing data must be submitted to public repositories. This means that the volunteer must have given informed consent not just for study but for data sharing. Local IRBs even need to sign off. The net result is that many of the samples that come to us for sequencing don’t meet the criteria, and must be returned.

4. Privacy

Even if you have an outstanding, comprehensive informed consent document, it might be difficult to get volunteers to sign it. There’s a growing public concern about the privacy of genetic information. As Yaniv Erich demonstrated by hacking the identities of CEPH sample contributors, genetic profiles obtained from SNP arrays, exome, or genome sequencing can be used to identify individual people. They also contain some very private details — like ancestry and disease risk alleles — that might be exploited, made public, or used for discrimination.

How long is it before genetic profiling replaces Google-stalking as a screening tool for job candidates or romantic interests? Thanks for coming in, Mr. Johnson. All we need now is your Facebook password and a cheek swab.

5. Functional Validation of Genomic Findings

Numerous research groups have demonstrated the immense discovery power of NGS. The mere fact that dbSNP — the NCBI database of human sequencing variation — has swelled to more than 50 million distinct variants tells us something about what pervasive genome sequencing capabilities might uncover. And yet, the variants implicated in sequencing-based studies of human disease are increasingly difficult to “sell” to peer reviewers on genetic information alone. Our inability to predict the phenotypic impact of genetic variants lurks beneath the veneer of genetic discoveries like a shark following a deep-sea trawler.

Referees of most high-impact journals want to see some form of functional validation of genomic discoveries. That’s a daunting challenge for many of us accustomed to the rapid turnaround, high-throughput nature of NGS. Most functional validation experiments are slow and laborious by comparison.

6. Translation of NGS to the Clinic

We all know that NGS is destined for the clinic. Targeted sequencing panels are already in routine use at many cancer centers; in time, this will likely become exome/genome sequencing. Possibly transcriptome (RNA-Seq) and methylome (Methyl-Seq) as well. Undiagnosed inherited diseases, and rare genetic disorders whose genetic cause is unknown, are two other common-sense applications. There are many hurdles to overcome in order to apply a new technology to patient care. CLIA/CAP certification is a complex, expensive, and time-consuming process.

The reporting is more difficult, too. Unlike the research setting in which most NGS results have arisen, a clinical setting requires very high confidence in order to report anything back to the patient or treating physician. This is a good thing, since patient care decisions might be made based on genomic findings. Yet it means that we have a considerable amount of work ahead to ensure that genomic discoveries are followed up, replicated, and otherwise vetted to the point where they can be of clinical use.

Share This on Twitter

It’s time for us to spread the word about the new challenges facing NGS.

Click to Tweet Great post by @MassGenomics on 6 new challenges of next-gen sequencing: Challenge #1: Data storage. #genomics
Click to Tweet 6 new challenges of next-gen sequencing by @MassGenomics: Challenge #2: Statistical significance. #genomics
Click to Tweet A post by @MassGenomics on 6 new challenges of next-gen sequencing: Challenge #5: Functional validation. #genomics

Genetic Studies in Isolated Populations: Greenland

Many large-scale genetic studies are conducted in broad/homogenous population cohorts, like the panels chosen for the 1,000 Genomes Project. Yet there are advantages to studying smaller and more isolated founder populations, as exhibited in a study in this month’s Nature. Moltke et al report a genetic association study in the Greenlandic population, a small and historically isolated population (57,000 people). Of particular interest: type II diabetes (T2D), the prevalence of which has skyrocketed over the past 25 years.

Granted, T2D is a complex disease with many contributing factors. From a genetics perspective, the small and historically isolated population offers some advantages. The limited genetic diversity means that linkage disequilibrium is extended, which boosts the power of genetic association studies. Further, deleterious variants can reach higher frequencies due to the founder effect (genetic drift in a small population across many generations). This phenomenon has already been documented in the Finnish population, whose unique population history has contributed to the increased prevalence of at least 40 genetic disorders.

Part I: The Genome-Wide Association Study

GWAS manhattan plot

Moltke et al, Nature 2014

In the current study, the authors began with a GWAS in ~2,700 samples collected from 12 locations in Greenland. They did the genotyping on the Illumina Metabochip, a customized SNP array with 200K variants marking regions implicated in metabolic, cardiovascular, and anthropometric traits. Notice how the authors are stacking odds in their favor? A founder population, with an increased prevalence of T2D, and a custom genotyping array to study the relevant traits.

The authors employed a linear-mixed model to search for association of variant alleles with T2D status. You can see the hit on chromosome 13. That’s rs7330796, a variant in intron 11 of the TBC1D4 gene that was selected for inclusion on the Metabochip because it was in the top 5,000 candidate SNPs for association with waist-to-hip ratio.

Part II: Exome Sequencing and Fine Mapping

GWAS find mapping in T2d

Moltke et al, Nature 2014

Even though this was a specialized SNP chip, it was unlikely that rs7330796 was the causal variant. It’s intronic, and not in a large LD block. So the authors performed exome sequencing on nine trios, and identified four coding variants in high LD with rs7330796. One of these was a nonsense variant, introducing an early stop codon (p.Arg684Ter) in the longer isoform of TBC1D4.

When genotyped in the main cohort, p.R684X was strongly associated with 2-h plasma glucose levels. When conditioning on rs733096 alleles, it was also associated with 2-h serum insulin levels.

Functional Assessments and Follow-up

Functional impact of T2D variant

Moltke et al, Nature 2014

The fascinating thing about this nonsense variant was that it had an almost Mendelian effect on plasma glucose levels. Heterozygous carriers had a slight elevation, but subjects homozygous for the alternate allele had significantly elevated plasma glucose. Strikingly, the frequency of T2D in homozygous carriers was also about 5x higher.

Thus, the p.R684X variant confers increased risk of a subset of diabetes that features deterioration of glucose homoeostasis. And with an odds ratio of 10.3, the variant exhibits an effect size much higher than any other variants associated with T2D to date.

Interestingly, though p.R684X has a MAF of ~17% in Greenlanders, the variant was essentially absent in other sequenced European populations (the 1,000 Genomes panels, 6500 ESP exomes, and 2,000 Danish exomes). Except for one Japanese sample from 1,000 Genomes. However, the Greenlandic population is an admixed population with European and Inuit heritage. Sure enough, when the authors looked at the Inuit population, they found the variant at a MAF of 23%. Thus, though this variant is not unique to Greenland, it’s far more common there than in other populations.

One criticism of classic GWAS approaches is that they typically identify genetic markers associated with certain phenotypes, not the causal variants themselves. This study went beyond that with exome sequencing and follow-up genotyping. The beauty is that TBC1D4 makes a lot of sense as a diabetes gene. It acts as a mediator of insulin-stimulated, Akt-induced glucose uptake by regulating the GLUT4 transporter. Knockout mice have decreased basal plasma glucose levels, and are resistant to insulin-stimulated glucose uptake in muscle and adipose tissue.

TBC1D4, Glucose, and Diabetes

As I mentioned, TBC1D4 has two different isoforms: a full-length protein, and a shorter isoform that’s missing exons 11 and 12. The p.R684X variant is in exon 11, so it likely only impacts the full-length isoform. Interestingly, that isoform is only expressed in skeletal muscle. Therefore it doesn’t affect TBC1D4 signaling in B-cells, the liver, or adipose tissue. In short, this variant’s disruption of the full-length protein causes higher plasma glucose levels, insulin resistance, and thereby confers risk of type II diabetes. It’s a sensible, straightforward story.

The work demonstrates that isolated founder populations are an important resource for mapping physiologically-relevant genetic variation associated with complex disease. 

Click to Tweet A new post on @MassGenomics on the value of isolated founder populations for genetic studies of complex disease: