Zerbino on denovo genome assembly

This week we had a visit from Daniel Zerbino, of Velvet fame, who’s on a world tour of about a dozen institutions interested in hiring him. He gave a talk on de Bruijn graphs and Velvet’s assembly programs, and despite the intimidating subject matter it was a full house. There’s obviously a lot of interest in de novo sequence assembly. d_zerbino

De Bruijn Graphs

Zerbino gave a nice introduction to de Bruijn graphs; between that and the papers I’ve read, I can say that I almost understand the approach. The idea is to build a dictionary of all “words” (in this case, 4-bp sequences) present in the dataset (the reads). They do this by sliding a 4-bp window along every read, counting the # of occurrences of each word. It becomes a graph because you track which sequences were connected in reads as the windows slide along. You count the number of each times a particular word is seen. At least, that’s my understanding.

Velvet’s Assembly Algorithm

The Velvet assembler has four steps (that no doubt greatly oversimplify the underlying algorithms):

Build the initial de Bruijn graph using the read set
Simplify the graph, focusing on regions with higher coverage
Remove “tips” – low-frequency words at the end of branches that don’t connect to another word.
Remove “bubbles” – splits in a single branch that rejoin together – by selecting the side with more read coverage. This, as I understand it, is absolutely key to the algorithm because it uses the high sequences coverage of NGS technologies to correct for sequencing errors.

On to the Simulated Data

I admit that it makes sense to first evaluate assembly algorithms (and aligners, for that matter), on simulated data where you already know the correct “answer” – the correct assembled sequence. What I liked about Zerbino’s talk is that they did four simulations, each with an organism of increasing complexity – E. coli, S. cerevisiae, C. elegans, and H. sapiens. Obviously as the genome gets larger and more complex, the assembly becomes more difficult. Add errors and SNPs into the mix, and you’ve got even more problems. But the simulated data was useful because it showed how certain key steps in Velvet’s algorithms improve the resulting assemblies.

Targeted denovo Assembly

One interesting anecdote that Zerbino gave was not from animals, but from plants A group was sequencing wild strains of Arabidopsis (or was it rice?). It seems that a certain region of the genome (2 base segment) had no coverage in a certain strain despite being flanked by regions of high coverage. So they took the unmapped reads, along with the regions that mapped near the region, and put in an assembly.

You know what they found? A 7-base insertion, which introduced a gap between reads and reference causing 2 bases to never get coverage. My guess is that 1-2 bases at the start and end of the insertion got coverage because most aligners allow 1-2 mismatches, but no gaps. It’s another observation supporting what many of us already know: when you use aligners that don’t allow gaps, the unmapped reads are enriched for gap-inducing sequences.

The Future of Assembly

Since my focus is targeted resequencing for variant discovery, I’m less interested in denovo assembly. However, there’s a large community of researchers sequencing new organisms with NGS platforms, and I can see an immediate and pressing need for assembly algorithms like Velvet. Unfortunately, once you reach a certain genome complexity, it simply won’t be possible to get the full genome assembly by computation alone. For that reason, Zerbino highlighted a few groups working on approaches to reduce genome complexity – by reduced representation libraries (Margulies), by combined remapping and denovo assembly (Cheetham et al., Erin Pleasance at Sanger), etc.

He’s clearly not only a pioneer, but keeping tabs on others in the field as well. I must admit that Daniel Zerbino impressed me. He could takes his skills to any industry, but wants to do something important which is why he’s in genome resesarch. Also, he speaks several languages – English (obviously), French (which I vetted; turns out he’s French, so I was actually out of my league), Portuguese (which I vetted; turns out he’s lived in Brazil, so I was again out of my league), among others. Quite an interesting fellow – we will be watching his career with great interest.