RSS 2.0
  • Home
  • About
  • Aligners
  • Genomes
  • Subscribe
  • VarScan
  •  

    The Search for Somatic Changes

    October 29th, 2009

    As cancer genome sequencing ramps up here and pretty much everywhere around the world, I got to thinking about strategies for identifying somatic changes, with confidence, from massively parallel sequencing data.  As part of the Cancer Genome Atlas Project (TCGA), we’ve been applying both targeted (capture-based) and whole-genome sequencing approaches to tumor samples and matched normal controls.  Ideally, the resulting data will yield high (>20x) coverage in both tumor and normal across our positions of interest.  What happens next, at least at WashU, is the culmination of a multiple-year effort to develop a comprehensive pipeline for detecting somatic variants.

    First up: Single Nucleotide Variants (SNVs)

    With more than 15 million entries in dbSNP, single nucleotide polymorphisms (SNPs) remain the most common form of DNA sequence variation in humans.  In cancer, most of the well-characterized somatic mutations are single nucleotide changes as well.  Conceptually, SNVs should be the easiest things to find in next-gen sequencing data.  They occur at a single position that can be directly compared between tumor and normal.  They should have minimal effects on sequence alignments to the reference genome. For example, here’s a putative somatic variant in TP53:

    somatic-snp

    What you see above is SAMtools “pileup” output at a single position (7518990 on chr17), for Normal and Tumor.  The Normal shows 4 reads that all support the reference on the – strand (,,,,).  The Tumor, however, shows 6 reads that all support a G variant, 2 on the + strand (GG) and 4 on the – strand (gggg).  It seems reasonable that, given this output across the entire genome for Normal and Tumor, one can compare them at every position and look for differences such as these.

    Yet we struggle to validate even high-confidence SNVs that look to be somatic.  Some are real, but Germline (probably under-sampled or missed in the Normal); most are simply false positives in the tumor. These might arise from a number of causes – homopolymers, paralogs, repeats, sequencing error, alignment error, etc.  Only a small fraction of variants that appear somatic in NGS data will validate as such.

    Why is that?  In general, it’s because by screening for somatic variants, we remove all of the variants that are most likely to be real. First, we exclude any variants that are present in the normal (germline) – which account for the majority of true sequence variations.  We also exclude known variants from dbSNP and 1,000 Genomes databases, which are also likely to be real but almost certainly germline events.  Then, we prioritize variants that are predicted to have functional effects – on protein coding, on splicing, in conserved regions, etc.  Such regions are often under negative selection for damaging mutations, meaning that variants should be exceedingly rare.  Every one of these filters selects for variants that are less likely to be valid.

    Small Indels

    With longer (>50 bp) fragment-end reads and/or paired-end libraries, it’s possible to detect small insertion/deletion variants (indels) in next-gen sequencing data.  Here, detection and specificity are the challenges.  In 454 data, the reads are [hopefully] of sufficient length (250 bp) for accurate gapped alignment to a reference sequence, and indeed, aligners commonly used with 454 data (Newbler, BLAT, cross_match, SSAHA2) do so.  Unfortunately, indels are both the strength and the weakness of 454 data - due to the underlying pyrosequencing, homopolymeric regions are often under- or over-called, resulting in numerous false positives.  Many can be filtered, but often homopolymer-associated errors cause mis-alignment of reads, yielding indels that might not look like homopolymer artifacts.

    Indel detection is also possible with Illumina data, though the shorter read lengths make this challenging.  Few short read aligners can handle the throughput of Illumina data and allow for gaps in read alignments, because speed and gapped alignment are at odds with one another.  Fortunately, paired-end sequencing on Illumina offers a solution implemented by Maq some time ago – first, map all reads that you can without gaps, and then, look for gapped alignments in unplaced reads whose mate is mapped nearby.  This reduces the search space considerably for gapped alignment, and also limits the query space to reads that likely contain indels (gaps).

    In cancer sequencing, small indels present one additional problem – determining whether they are present in the normal.  Even the best aligners can’t always precisely define where an indel starts or stops.  Thus, a germline indel might have different coordinates in the tumor than in its matched control; when comparing the samples, it might appear to be somatic.

    Loss of Heterozygosity (LOH)

    It is well known that the genomes of tumor show extensive loss of heterozygosity (LOH).  Generally, this occurs because a position that is heterozygous in the germline is affected by some kind of structural event – deletion, gene conversion, chromosome loss, etc. – that results in the loss of one allele.  Of course, to detect LOH, one needs a variant that’s heterozygous in the Normal, and to precisely define the region of LOH, one needs a dense set of heterozygotes.  Even so, the maximum precision for the start and stop of an LOH region is the interSNP distance, since only SNPs can inform on LOH, and that can be hundreds or thousands of bases.  But LOH calls do tend to cluster, and detection of LOH regions is not really the problem.  Even lower-resolution array technologies identify recurrent LOH regions in tumor samples.

    But what exactly does LOH mean in terms of cancer development and growth? It’s hard to say.  Quite possibly, a tumor suppressor gene was deleted, or an oncogenic allele was duplicated.  Unfortunately, LOH regions tend to be kilobases or megabases in size, containing dozens or hundreds of genes, and identifying which ones are truly affected in terms of cancer remains challenging.  We see a lot of LOH in cancer, but sadly, it never seems to get anyone excited.

    Structural and Copy Number Variation

    Image Credit: Wikipedia

    Image Credit: Wikipedia

    Last and most difficult to characterize are the sub-microscopic structural changes – insertions, deletions, inversions, translocations, duplications, etc. – that often occur in tumor genomes. These tend to be large, complex events that are tough to infer from NGS data.  We run Ken Chen’s breakDancer, of course, and it predicts numerous SVs.  But how do you validate a massive, complex variant spanning thousands of bases? We do our best with PCR and 3730/454 sequencing, but until read lengths get really really long (perhaps on single-molecule sequencing), validating such events and determining their breakpoints is tough.

    There are well-characterized, recurrent copy number alterations in cancer, like EGFR amplification on chromosome 7.  Here’s my question: where are all of those extra copies? Are they just tandem duplications of part of a chromosome, or are they duplications that get inserted elsewhere in the genome?  In the absence of a complete, linear, high-confidence genome, I’m not sure we can tell.

    Fruits of Our Labors

    It occurs to me that this is a bit of a negative article – focusing entirely on the challenges and failures, without highlighting the successes.  And there are many successes.  Every cancer genome tells us something, and every new piece of knowledge goes into our arsenal in the war against cancer.  As sequencing ramps up, we’ll see exponential growth in the number of known somatic mutations across a wide array of cancers. With the help of cancer biologists, these data will be leveraged to better understand the genes, proteins, and pathways underlying tumorigenesis.  Greater understanding will undoubtedly improve the detection, diagnosis, prognosis, and treatment of cancer patients.

    AddThis Social Bookmark Button

    Capture and Illumina Sequencing of Human Exomes

    September 24th, 2009

    This month in Nature, a group from Jay Shendure’s lab reported perhaps the most ambitious targeted resequencing study to date – the whole exome sequences of 12 individuals.

    Targeted capture and massively parallel sequencing of human exomes

    Using an array-based hybridization capture method (2 microarrays, 10 mug of input DNA), Ng et al selectively targeted CCDS regions totaling 26.6 Mb of sequence (~0.83% of the human genome). Capture specificity was similar to that of other published methods (35-55% of reads mapping to targets), but the completeness was astonishing – on average, 99.7% of target bases covered at least once and 96.3% covered at 8x with q>=30.

    By focusing on coding exons, the authors achieved 51x coverage (on average) with just 6.4 Gb of mappable sequence per individual.  Illumina 76-bp single-end sequencing was the platform of choice.  If I make some rough empirical estimates of mapping rate and reads per lane, they generated a single Illumina run of data (7-8 lanes) per individual.  Compared to whole-genome sequencing, the authors claim a 20-fold reduction in the amount of sequence required.  I’d say this estimate is pretty close.  Our second leukemia genome, which had 23x haploid coverage, took 16.5 Illumina runs to complete.

    Strong Illumina Pipeline

    It’s not simply the technological feat that impressed me about this study.  The presentation of the work and underlying analytical approaches are just outstanding.  While reading through the methods, I couldn’t help but think that nearly every step the authors took in processing their data was something that we’ve implemented here – Maq alignment, start site de-duplication, mining Maq-unplaced reads for indels, etc.  We have a bit of a friendly rivalry with University of Washington (since we are, after all, Washington University), so I looked for weak points.  Try as I might, I couldn’t find much to criticize about the analysis.  When it comes to Illumina sequencing, UW seems to know what they’re doing.

    How to Write A Nature Paper

    And paper itself is just clear, concise, well-written – everything I’d expect from a Nature publication.  Take Figure 1, for example.  Figure 1, in general, is the focal point of most research papers, and for that reason I think many authors try to cram way too much into it.  Not this time.  Four histograms that all have “Number of observations of minor allele” as their X-axis.  Yet each one tells a different story: (a), how novel-to-dbSNP variants were rare; (b), how nonsynonymous variant frequencies are shifted to lower values relative to those of synonymous variants, (c), how this shift in allele frequencies is more pronounced for damaging nsSNPs, consistent with natural selection, and (d), how the sizes of observed indels are enriched for non-frameshift events divisible by 3.

    Illumina Sequencing and Deduplication

    Early into our days of Illumina/Solexa sequencing, we observed a strange phenomenon in the data: lots of reads with identical start sites and orientations.  The theory was that these occasional pileups were PCR-related, and each one arose from a single molecule that somehow was sequenced over and over again.  Since just about every downstream analysis (coverage, mutation detection, etc.) relies on unbiased read counts, it’s important to normalize for such events.  This requires a “de-duplication” step in which multiple reads with the same start site and orientation (presumably the same molecule) are discarded and only one is kept.

    Credit: Nature 461:272-276 (2009)

    Credit: Nature 461:272-276 (2009)

    The implications of this deduplication requirement, as pointed out by Ng et al, are that the maximum read depth for any given position in the genome is twice the read length for single-end libraries.  In their case, 152x.  One might be concerned that even with de-duplication there would be substantial bias in targeted capture.  But look at the bell curve of the coverage distribution from supplemental figure 1 (left).

    Someone had better call O’Reilly, because that’s just beautiful data.  Importantly, the deduplication paradigm changes somewhat for paired-end sequencing, which is largely what we do here.  With paired ends, you have two reads from each molecule, each with a start site and orientation.  So the maximum coverage immediately jumps to 4 times the read length.  Furthermore, due to the variation in fragment sizes of sheared DNA, insert sizes add further distinction for different molecules, allowing for read depths of 1000x or more after de-duplication for paired-end reads.

    Identifying Disease-Causing Mutations

    What pleased me most about this study is that the authors didn’t just present exome capture and sequencing of “undiseased” individuals.  In addition to 8 HapMap samples, they included four samples from unrelated individuals with Freeman–Sheldon syndrome (FSS), an autosomal-dominant disorder caused by mutations in MYH3.  After collecting the set of coding variants in each individual, the authors asked a simple question: could we have pinpointed the disease gene from mutation data? With the knowledge in hand that this was a monogenic, autosomal-dominant disorder, the authors assumed that the same gene might be mutated in most (or all) samples.  And since the disease itself is uncommon, the authors inferred that common variants could be excluded. So, with the full set of mutations for each affected individual in hand, the authors looked for genes where:

    1. There was at least one (but not necessarily the same) nonsynonymous SNP, splice-site SNP, or coding indel in all four samples.
    2. The mutations were novel; that is, they weren’t found in dbSNP or the other 8 HapMap samples.
    3. The mutations were predicted to damage the encoded protein

    When these criteria were applied, the authors whittled down a list of 4,510 genes with mutations in at least one sample to just 1, and that gene was MYH3.  Thus, whole-exome sequencing allowed for direct identification of a disease-causing gene with just a few samples from affected individuals.  Granted, the authors got lucky.  The causal mutations might have been SVs, or missed by variant callers, or not covered sufficiently by sequence data.  Or, the disorder might be caused by a single mutation in one of several genes, as is the case of autosomal dominant RP, a monogenic disorder for which at least 16 genes have been implicated.

    Even so, the authors applied a relatively straightforward approach and got the right answer.  With whole-exome sequencing capability within reach, finding the genes behind autosomal disorders is only a matter of time.

    References
    Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, & Shendure J (2009). Targeted capture and massively parallel sequencing of 12 human exomes. Nature, 461 (7261), 272-6 PMID: 19684571

    AddThis Social Bookmark Button

    Second Cancer Genome in New England Journal

    August 6th, 2009

    Today our group published the second cancer genome, AML2, in the New England Journal of Medicine. In this study, we sequenced the complete genomes of tumor cells and matched normal (skin) cells from a patient with cytogenetically normal de novo FAB M1 AML.  This is an exciting publication for many reasons, the foremost of which may be the venue: with an impact factor of 52.59, the NEJM is almost certainly the most widely read biomedical journal in the world.

    nejm-aml2-screenshot

    Diagnosed with Leukemia: It Could Happen to You

    The story begins three years ago, with a previously healthy 38-year-old man of European ancestry who went to his doctor complaining of fatigue and a persistent cough.  After an elevated white blood cell count, his physician ordered a bone marrow biopsy, which revealed 90% cellularity and 86% blasts.  Diagnosis: Leukemia.

    The patient underwent ten days of chemotherapy with cytarabine (7 days) followed by daunorubicin (3 days).  Five weeks later he’d obtained morphologically complete remission and recovered counts.  Now, three years later, he remains in complete remission.  According to my conversations with an oncologist, this kind of happy ending is not very common with leukemia.  Most leukemia patients are diagnosed at an advanced age, and don’t do as well.

    AML Cells. Credit: Univ. of Virginia

    Acute myelogenous leukemia cells. Credit: Univ. of Virginia

    Moving Beyond Cytogenetics

    At the time of his diagnosis, routine cytogenetic analysis of the patient’s tumor cells showed a normal 46XY karyotype.  Bone marrow and skin samples were banked with informed consent for whole genome sequencing in accordance with our IRB.    There was no family history of leukemia, though the patient’s mother had developed breast cancer and later non-Hodgkins lymphoma.  Her half sister had also developed breast cancer.  The field for discovery of mutations underlying this AML was wide open.

    Whole Genome Sequencing with Illumina

    We sequenced the genomes of tumor cells and matched normal (skin) cells to high depth (23.3x and 21.3x, respectively) on the Illumina/Solexa platform.  The tumor sample required just 16.5 runs (most of which were 2×75 PE) to reach 98% diploid coverage. That’s a dramatic improvement over our first cancer genome, AML1, which took 98 runs (36 bp SE) to achieve 91% diploid coverage.  At current rates, we really can sequence a genome a week.  As any bioinformatician knows, however, the analysis usually takes a bit longer.

    Dave Larson in my group really deserves the credit for the whole genome variant detection pipeline applied to AML2.  With direction from Elaine Mardis, Rick Wilson, and Tim Ley, and others, Dave created a pipeline for automated variant calling, somatic scoring, and tiered classification of variants for cancer genomes (see Figure 1 of the paper).  We identified 3.87 million single nucleotide variants (SNVs) in the tumor genome, of which 97.5% were in the skin genome and another 1.7% were previously described (i.e. dbSNP).  That left 20,256 putative somatic variants which we classified as follows:

    • Tier 1 variants were coding variants that alter amino acid sequences, like nonsynonymous, nonstop, and splice-site mutations.
    • Tier 2 variants were variants in evolutionarily conserved or regulatory-potential sequences of the genome.
    • Tier 3 were the remaining variants that were in non-repetitive regions of the genome.
    • Tier 4 were the remaining variants that were in repetitive regions of the genome.

    Validation and Deep 454 Read Counts

    We used 3730 sequencing to validate somatic variants in Tiers 1 and 2.  Some 62 mutations were validated, of which 10 were tier 1 (amino acid-altering) mutations.  Additionally, we validated two somatic indels, one of which (NPM1) was previously described; the other was an insertion in the CEP170 gene predicted to add a leucine residue to the encoded protein.

    In the absence of true functional validation, there are at least two approaches to evaluating whether or not a somatic mutation is a driver – a mutation that confers some advantage to drive tumor development – or a passenger – a background mutation that’s just along for the ride.  First, driver mutations should be present most tumor cells, since the dominant clone will be the most “fit” in the tumor population.  To assess mutation frequencies in our patient’s tumor cells, we applied 454 sequencing of mutation-containing amplicons in the tumor DNA, tumor cDNA, and skin DNA.  Deep read counts for somatic events on the X and Y chromosomes showed allele frequencies of around 98%, consistent with the fact that nearly all cells in the bone marrow sample were part of the malignant clone.  For the rest of the somatic mutations, variant frequencies hovered near the 50% mark (as expected) with a few exceptions.  The CEP170 indel had a reduced (~35%) frequency in tumor DNA, suggesting that perhaps it’s not a driver mutation.

    Recurrence of Mutations in Other AMLs

    The other measure of importance of a somatic mutation is recurrence in other tumors of the same type.  Thus, we screened for the presence of validated somatic mutations in a panel of 187 additional leukemia patients to see if any were recurrent.  Most, unfortunately, were not.  However, two variants were found in other samples, suggesting an important role in the development of AML.  One was a noncoding conserved mutation (tier 2) on chromosome 10 which was detected in one other sample.  Recurrence in just one other sample might not seem impressive, but by our estimation, the odds of such an event happening by chance are 1.1 x 10^-9.  Thus, we may have uncovered a noncoding functional mutation that contributes to carcinogenesis via an as-yet-unknown mechanism.

    The other was a nonsynonymous (tier 1) mutation in IDH1 at residue 132.  Sixteen of 187 other leukemia samples carried mutations at the same residue in IDH1, suggesting an important role for this gene in the development of AML. Somatic mutations in IDH1 were recently characterized in glioblastoma (GBM) by our friends at Johns Hopkins, but this is the first time that IDH1 mutations were described in AML.

    Conclusions: Lots of Passengers, Not Many Drivers

    After sequencing the complete tumor genomes of two AML patients, we have estimated that these cancers carry an estimated 750 somatic events.  Most such events will be background passenger mutations, acquired in the progenitor tumor cell before it became cancerous.  Admittedly, that means there’s much more work to do to fully characterize the sequence changes underlying development of AML and other cancers.  Our group is eager for the challenge.  With the ever-growing throughput of the Illumina platform and our automated pipelines for whole-cancer-genome analysis, we hope to sequence at least a hundred more cancers in the coming year.

    References

    Mardis, E., Ding, L., Dooling, D., Larson, D., McLellan, M., Chen, K., Koboldt, D., Fulton, R., Delehaunty, K., McGrath, S., Fulton, L., Locke, D., Magrini, V., Abbott, R., Vickery, T., Reed, J., Robinson, J., Wylie, T., Smith, S., Carmichael, L., Eldred, J., Harris, C., Walker, J., Peck, J., Du, F., Dukes, A., Sanderson, G., Brummett, A., Clark, E., McMichael, J., Meyer, R., Schindler, J., Pohl, C., Wallis, J., Shi, X., Lin, L., Schmidt, H., Tang, Y., Haipek, C., Wiechert, M., Ivy, J., Kalicki, J., Elliott, G., Ries, R., Payton, J., Westervelt, P., Tomasson, M., Watson, M., Baty, J., Heath, S., Shannon, W., Nagarajan, R., Link, D., Walter, M., Graubert, T., DiPersio, J., Wilson, R., & Ley, T. (2009). Recurring Mutations Found by Sequencing an Acute Myeloid Leukemia Genome New England Journal of Medicine DOI: 10.1056/NEJMoa0903840

    AddThis Social Bookmark Button

    ABI SOLiD Joins the WGS Party

    July 1st, 2009

    At last published in early access at Genome Research is the whole-genome sequencing of a Yoruban male on ABI SOLiD technology.  A year ago, this might have merited a Nature or Science publication.  That window seems to have closed for whole-genome sequencing of a single, undiseased individual.  By my count, this is the sixth published individual genome sequenced on next-gen platforms.  I begin to wonder if this ABI SOLiD paper is too little, too late.

    gr-abi-solid-paper-screenshot

    Well, it’s probably not too little.  The advance access PDF is over 60 pages, and I must admit that the authors did a substantial amount of work to identify, characterize, and discuss the sequence variation in this genome.  Despite a relatively modest coverage level (18x), the combination of paired-end sequencing and two-base encoding made it possible to simultaneously detect SNPs, small indels (3-11 bp), large indels (30 bp-97 kbp), and structural variants.

    Two-Base Encoding in Colorspace for Calling SNPs

    My central interest, however, is how much the two-base encoding aids distinguishing SNPs from sequencing errors.  The ABI SOLiD study identified ~3.8 million SNPs in the genome, compared to 4.1 million SNPs identified by Illumina sequencing of the same individual, an anonymous African male from the HapMap collection.  However, the ABI study did it with less than half the coverage (18x compared to 40x), and called a greater fraction of novel-to-dbSNP SNPs (19% compared to 12.7%).  Experimental validation confirmed 280 of 299 (94%) of the novel SNPs, suggesting that most of these variants are real.

    The authors performed a rather elegant comparison with HapMap data for this individual, by comparing not only SNP genotypes but the phase of the genotypes, which they inferred on the basis of mate pair information.  Some 21.74% of HapMap-phased heterozygotes were covered by at least one ABI read pair, and the phase agreement was 98.95%.  Thus, the read-pairing strategy employed by ABI can serve to produce more accurate and complete haplotyping of the sequenced individual.  I find this side-benefit of whole-genome sequencing to be very valuable, given the huge amount of money and efforts spent to build the human haplotype map.

    Lots of Indels and Structural Variants

    Perhaps the greatest strength of this study is that it represents, to my knowledge, the most extensive and detailed effort to characterize indels/SVs from WGS of a single individual.  Small intra-read indels (<=13 bp) had a high dbSNP concordance (67%), perhaps benefited by the terminating chemistry and two-base encoding of ABI SOLiD.  Using mate pair information to identify discordant insert clones, the authors called 1,515 insertions (30-1,287 bp in size) and 4,075 deletions (86-96,957 bp in size), many of which were also detected in Venter, Watson, and CHB (?) genomes.

    Cross-WGS Comparisons: Key Illumina Study Ignored

    In a direct comparison, 20% of the SNPs identified in the ABI study were also seen in Watson, Venter, and CHB genomes.  Fewer structural variants were shared between genomes, but this very well may be related to the difficulty in calling such types of variation on different platforms, rather than true biological diversity.  Here’s something I find both irritating and amusing.  The ABI study authors made no comparisons whatsoever to the results from the Bentley et al. (Illumina WGS) study, which is surprising since BOTH STUDIES SEQUENCED THE SAME INDIVIDUAL. I refer you to:

    “We sequenced the genome of a male Yoruba from Ibadan, Nigeria (YRI, sample NA18507).” [Bentley et al]
    “We compared the SNPs and structural variations identified in NA18507 to those found in the Venter (Levy et al. 2007), Watson (Wheeler et al. 2008) and YH (Wang et al. 2008) genomes.” [McKernan et al].

    I’m sorry, but when you do whole-genome sequencing on an individual that’s been sequenced already on a different technology, you have to do that comparison.  Whatever their reasons, the ABI study authors’ decision to blatantly avoid comparisons with Bentley et al results is outright negligence.

    Functional Consequences of Genetic Variation

    The authors embarked on a long exploration of the putative phenotypic impact of variants in NA18507 using OMIM and HGMD databases along with a comprehensive literature review.  They developed a pipeline to map the poorly-formatted OMIM entries to genomic coordinates, and successfully obtained 9,239 uniquely mapped nonsynonymous OMIM variants.  I’d hoped for a supplemental table of these, or better yet that the results might be shared back with OMIM, but alas.  No dice.  NA18507 is apparently a carrier for over 50 disease-associated alleles, including five which appear to be homozygous.  These are all listed in supplemental tables 4 and 5, however, no supplemental data appears to be available at present.

    There were 2,477 large indels in NA18507 that potentially disrupted genes.  Among 2,015 genes affected, some 303 were disease-associated genes from OMIM, HGMD, or the literature review.  The authors conclude “we can see a trend for disruption events to cluster around genes, but no clear preference to cluster around disease genes.  Further analysis of these disruption events along with an evaluation of whether an exon is disrupted is warranted.”  This is why individual HapMap genomes no longer merit Nature papers. Without a phenotype to study, “further investigation is warranted” is as far as such studies can go to assess the functional impact of many mutations.

    Signatures of Natural Selection

    All gripes aside, the study did provide evidence of purifying selection, notably an under-representation of damaging nsSNPs, and an under-representation of variation inside exons in general.  Using the Panther database, the authors identified several protein families with evidence of purifying selection (fewer than expected damanging nsSNPs) – nucleic acid binding proteins, ligases, transferases, transcription factors, and of course kinases.  There were also categories over-represented for damaging nsSNPs, which may reflect either higher mutation rates or positive selection.  These included G-protein coupled receptors, extracellular matrix glycoproteins, cell adhesion molecules, as well as genes related to olfactory perception.  Ah yes, sense-of-smell diversity.

    The Outlook for ABI SOLiD

    With a high-profile publication of an individual human genome, ABI SOLiD officially joins the ranks of WGS-enabling platforms.  In my opinion, they’re a little late to the game.  I recall seeing a poster presenting much of this data about a year ago, and even that was after Illumina had taken the lead in whole genome sequencing.  According to a report by Julia Karow on Genomeweb, SOLiD accounts for just 17% of next-gen sequencers at major genome centers, just ahead of Roche/454 (14%) but well behind Illumina, which claims 2/3 of the market.  ABI can’t compete with 454 on read length, and it can’t compete with Illumina on data throughput or market share.  In short, SOLiD needs to find a niche, and find it quickly, or this platform will go the way of the dodo.

    McKernan, K., Peckham, H., Costa, G., et al. (2009). Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two base encoding Genome Research DOI: 10.1101/gr.091868.109
    AddThis Social Bookmark Button