The release of the Illumina HiSeq X Ten sequencing system, and its current use restriction (only human, only whole-genome sequencing) are going to cause a major paradigm shift in human genetics studies over the next few years. Until now, we’ve seen relatively few large-scale efforts to apply whole-genome sequencing (WGS) to large numbers of samples. But the capability of a single X Ten installation to sequence ~18,000 genomes per year at a relatively low cost means that, for the first time, it may become easier to apply WGS as the primary discovery tool.
I’ve already written about the realities of the sequencing GWAS to discuss some of the considerations in going from genotyping (SNP arrays) to sequencing (next-gen) for genetic association studies. Unlike genotyping, sequencing enables both discovery and genotyping, with the caveat that you’ll end up with:
- Many rare variants private to an individual or family
- Increased missingness in the resulting genotypes
- More false-positive variants
- Additional QC challenges
These are simply the reality of going from clean, defined SNP array datasets (>99.1% call rate) to next-gen sequencing data, which depends on alignment and variant calling and depth/breadth of coverage.
Data Storage Demon
One of the major practical considerations for whole-genome sequencing data is on the computational requirements side: data processing, storage, and retention. A binary alignment/map (BAM) file — which contains the sequences, base qualities, and alignments to a reference sequence — for a 30x whole genome is about 80-90 gigabytes in size. The BAM files for a modest sample size (1,000) might consume 80 terabytes of disk space. And that disk space is not free. It costs actual dollars to purchase and maintain over time.
I’m resisting the urge to show you that cost of sequencing / Moore’s law comparison plot here.
Because disk space is both finite and costly, and these files are so huge, at some point researchers will have to choose between getting new data and actually deleting some old data. Kind of like a “one in, one out” policy at a crowded bar. No one likes throwing data away. We NGS analysts shudder at the idea of not being able to go back to the BAM file to run yet another variant caller, or review that interesting variant. At some point we may have to call the sample’s analysis DONE and leave it that way. Because, let’s be honest, 99% of the bases in a BAM file match the reference. It’s the variants that we’re truly interested in.
Data Transfer: Traffic Jams Ahead
Another consideration is the simple act of moving data around. With a $10 million price tag, few research groups will be able to afford an X Ten cluster, but those who can’t will be unable to stay competitive on the cost of WGS. On the other side of the table, the lucky X Ten installation sites will need to find samples. This means that most whole-genome sequencing will take place at a few locations, and the resulting data transferred to the investigators who sent in the samples.
Have you tried to download an 80 gigabyte file lately? The regular internet is just not going to work for this.
You There, with the Samples!
A couple of years ago, I wrote that in a world with widespread genome sequencing capacity, samples are the new commodity. That has never been more true than in the world of the X Ten. The institutions that have them will need to find several thousand samples per year in order to achieve the optimal per-genome cost.
I don’t know too much about the details of consenting samples, but I know that many, many research samples are not consented for whole genome sequencing. Because whole-genome has everything: your Y-haplogroup (for males), your APOE allele, your BRCA1/2 risk variants, etc. There’s no “we will only look at this gene or region” nonsense.
The Awkward Question
Who is going to pay for sequencing all of these samples? Don’t count on the X Ten centers to do it; remember, they had to shell out $10 million just to buy the thing. Even at a reagents/personnel cost of $1,000 per genome, an X Ten running at full capacity will cost $18 million per year. That’s a lot of cash, in an era when research budgets seem to be flat (if not shrinking). So now you need samples and the funds to sequence them.
It may actually be more difficult to persuade researchers to make the switch to sequencing, because it will still be five times more expensive than running a SNP array.
The Promise Ahead
I know that this post has had a bit of a negative tone, but I felt it necessary to get people thinking about the challenges ahead. Now, perhaps, we should talk about the promise of large-scale whole genome sequencing. At last, we’ll have sequencing studies that aren’t biased towards coding regions or certain genes. Every sequenced genome will harbor over 3 million sequence variants. We can go after non-SNP variation, too: indels and structural variants are far easier to detect by WGS, though SV calling is still a nascent area of bioinformatics.
The wonderful thing about WGS is that it both enables and forces us to look beyond the obvious (e.g. the nonsynonymous variants in known protein-coding genes). We’re headed into the unknown, the dark matter of the genome, whether we like it or not. And that is a good thing.
BrianKrueger says
It costs about $12 a year to store 200GB of data (BAM/FASTQ) on a commodity object store system, even less on Tape (but who wants to deal with that!). If the RAW data is important then it makes sense to keep it. I fully agree though that if you are confident in your variant calls then that’s really all that needs to be stored, but I don’t think we’re very confident yet. Minor pipeline tweaks and major upgrades have and probably will have big impacts on those calls as the pipelines improve. It might make sense to keep the raw data around since it still costs $1500-2000 to get a genome fully prepped, sequenced and analyzed using an X. (Illumina’s $1000 genome calculation forgets to include informatics personnel and IT hardware costs.) This will keep the data available for re-analysis or combination with other techniques as they become more widely available, ex: re-align samples with long reads, without having to pay that up front cost again.
keiranmraine says
The BAM storage size issue has been largely solved by the gradual uptake of the CRAM format.
Dan Koboldt says
Thank you for the comment! I agree that CRAM may ultimately replace BAM format, reducing file sizes by 40% or more. Even those improvements, however, won’t entirely solve the problem when the X Ten can do 18,000 genomes per year.