I was in Rome for two days attending Next Generation Sequencing Workshop organized by EMBRACE (EU FP6 NoE), UPPMAX and CASPUR with the support of the Italian Society of Bioinformatics. It was pretty interesting event and I want to share with you couple of interesting things I’ve learned there.
First day was devoted mainly to the hardware side of NGS. It started with a presentation from Tony Cox from Sanger Institute who described a hardware setup used to support their sequencing projects. At 400 gigabases a week (current output) Sanger IT infrastructure is stretched in every direction (capacity, availability, redundancy) and Tony pointed out that each sequencing laboratory is going to face similar issues did sooner or later. His advice for such labs was to estimate first number of bases produced and then use multipliers to assess storage requirements for the project. A minor thing that I’ve noticed in his talk was exposing databases as filesystem via FUSE layer – I might use that approach in some projects too.
George Magklaras from The Biotechnology Centre of Oslo described a number of approaches they took during implementation of their infrastructure. He talked about FCoE, Fibre Channel over Ethernet, and pointed out that it’s cheaper and almost as efficient as Fibre Channel alone. At the Centre they use Lustre (Sanger is too), high performance networked file system, but they benchmark other solutions too, because some situations/projects require transparent and efficient data encryption (mostly medical data). Similarly to Tony, George pointed out that compartmentalization of data is necessary, as moving large amounts of files over the network creates a unnecessary bottleneck.
Other interesting talk was from Guy Cochrane from EBI about Sequence Read Archive. It was an overview of the project, but again with few interesting tidbits that drawn my attention. One of them was Aspera, much faster alternative (and secure at the same time) to good old FTP. He also presented a data reduction strategy that if I understood correctly is not yet implemented over at SRA, but might be some day in the future. First point was deletion of intensity data – that’s something perfectly reasonable but is heavily opposed by a number of scientists. Then, all only consensus is preserved plus second most frequent base (important for polymorphism studies). The minimum for long-term storage was proposed to consist only of sequence and quality data.
Majority of second day was devoted to software. It doesn’t make sense to list all described projects – I will share with you only my general impression.
Despite large number of scientists devoting their time to develop new tools for next generation sequencing data, I think that software lags a little behind other technological advances in this area. In case of really large amount of data, assembly becomes hard or impossible, mapping erroneous, annotation too slow (pilot study of 1000 genomes project generated so much data, that computing farm was busy for full 60 days – on single CPU it would take 25 000 days). Software development for NGS differs dramatically compared to scientific software in general and needs much much better programmers than we usually are. For example, Desmond Higgins was praising open source software – they found extremely fast implementation of UPGMA algorithm (much faster then their own), and they could speed up their tool (SeedMap) so much that it’s running it on even largest family of sequences in a reasonable time.
Another bottleneck was data presentation layer – there are some attempts to make digging into data easier, but having a biologically meaningful overview is as hard as it was before. Other people pointed our that problem too (I wasn’t the only biologist there).
Need for stronger community
Probably the most funny part of the workshop was the discussion about creating an organized community of people working with next generation sequencing technologies. It was funny is this sense, that some consensus about community emerged quite fast. How to build it – that was another story. Obviously lots of participants were sure that if they build a site, people will come. Yeah, sure. I’ve suggested using wiki in the first place and additionally hire a community manager if they really want to gather people from many different forums, sites, groups etc. Lot’s of people didn’t buy these ideas, suggesting more traditional approach, so curious if they were right, I’m going to follow development of this community.
NGS = high tech
Probably the most important lesson was to realize that sequencing is a field with very high requirements for infrastructure and even higher requirements for skilled staff. Basically every element of the infrastructure may become a bottleneck and if you want to avoid it, cost of data maintenance and analysis exceeds very fast cost of producing the data. When I talked about it to many people during the last year (I’m involved in some sequencing projects at the analysis/annotation step) often people felt I overestimate infrastructure needs. Now I have some specific number to back it up .