The loblolly pine genome is big. Bloated with retrotransposons and other repetitive sequences, it is seven times larger than the human genome and easily big enough to overwhelm standard genome assembly methods.
This forced the loblolly pine genome sequencing team, led by David Neale at the University of California, Davis, to look for ways to reduce the enormous complexity of their task. The draft genome sequence, described in the latest issue of GENETICS and the journal Genome Biology, was pieced together from over 16 billion sequence reads. Spanning around 23 billion base pairs, it only just beats out the Norway spruce as the largest genome ever sequenced, but it is substantially more complete. For example, the N50 scaffold size of the current loblolly assembly is 66.9 Kbp, compared to 0.72 Kbp in the Norway spruce.
So how did they do it? One strategy was to generate most of the sequence from part of a single pine nut. This tiny source material was the mega-gametophyte, which is the haploid tissue that provides nutrients to the developing diploid embryo. Despite the limited amount of DNA that can be extracted from this source, the reduced complexity of a haploid genome makes it easier to assemble. To link up all the sequence fragments from the haploid genome, the team also created DNA libraries from diploid needles of the parent genotype.
But this still left the assembly team, led by Steven Salzberg at Johns Hopkins University and James Yorke at the University of Maryland, with more data than their computational methods could handle.
The solution was a method of pre-processing the data into “super reads”, or larger chunks of contiguous haploid sequence that condensed many individual reads. In essence, they were dealing with the unambiguous parts of the problem first, and getting rid a huge amount of overlapping and redundant data in the process.