Software portability and virtual appliances

27 Nov

Bioinformatics can mean developing new algorithms for biological data analysis. Scientists who code and release the software face often an issue of making the program portable. I see three clear solutions to that issue. First, one can spend a lot of time porting the source to other platforms (plus testing, fixing and yelling at incompatibilities). This is not easy even within the linux OSes (remember broken HMMER binary packages with Debian and Ubuntu?), not to mention porting to OSX or Windows. What can we do? Second solution is to build a web interface around the software. This is extremely popular and makes almost everyone’s life easier. However there are drawbacks: maintenence of the service (it costs money and grant agencies are not willing to spend a dime on it) and batch access requests from some users (there’s always somebody who wants to feed into your software 5 millions sequences or 50 thousands structures). The third solution to the software portability issue can address at least the second of these drawbacks: one can create a virtual machine with a proper enviroment for developed software, and release it together. Yes, release a software together with the whole enviroment. And it’s not that difficult, as it seems.

We face computing clouds, internet companies that do not have a single server, virtual appliances for quick installation of, let’s say, blog server with WordPress, without any knowledge about software requirements. Virtual appliances, this is complete virtual machines, can contain already configured software (most trivial example would be LAMP – Linux, Apache, MySQL and PHP). So far I found only one such appliance for bioinformatics: it’s called DNALinux Virtual Desktop Edition and contains, among others, BLAST, EMBOSS, Pymol, BioPerl and Biopython. Since VMWare server is free (although registration is required), this makes pretty nice alternative for those with Windows machines, as it allows for running windowed linux at a speed of ca. two-thirds of a native system. VMWare software can create a virtual machine out of the working system, but I wouldn’t recommend that as we usually have much more software installed than it’s needed to run our own programs. So creating a virtual appliance for, let’s say, BLAST, would mean installing a fresh copy of our favourite linux under VMWare Server with nothing more than necessary libraries, copy of BLAST executables and possibly a web interface. Voilla. Virtual appliance for BLAST, anybody?

While it may seem a bit of overkill at first, I don’t think it is in the long run. Porting the software to other operating systems is only part of the story – maintenance to keep it working with newer version of the libraries is another. There’s a lot of programs that are not actively maintained for a long time. I have two quick examples where virtual appliance approach would save them from forgetting: PovChem (rendering of molecules, depends on some ancient libraries) or MACAW (it doesn’t work on anything but Mac OS 9, Windows version crashes the system). OK, MACAW may be not fair, as we face here legal issues with the operating system, but I believe any heavy software user already didn’t count how many times hadn’t tried some well-thought software because of its requirements.

Have a look and try. I’m already running two operating systems (good bye dual-boot) and this is definitely a future for our desktops with already too much processing power. But honestly I dream about a day, when all possible bioinformatics algorithms and biological data will be available at some computing cloud and running Taverna will be a good alternative to all day data munging.



Posted by on November 27, 2007 in bioinformatics, Services, Software


Tags: , , ,

9 responses to “Software portability and virtual appliances

  1. nsaunders

    November 29, 2007 at 00:59

    I suppose the only comparable service is BioKnoppix – basically a live Knoppix CD with a bunch of bioinformatics apps.

    I’d also like to see the day when we all use workflows and clouds, but I suspect it’s still a long way off. Let’s face it, convincing many biologists to learn how to use a new computing resource of any kind is difficult. I’ve always felt that the most effective solution is for every academic department to employ a “Bio IT” specialist – someone who provides IT support but aimed specifically at bioinformatics. Give them a few cheap servers, let them install whatever local resources people need and make sure that people know what they offer and where to find them. Need a custom BioPerl script, a local BLAST database, a Pise interface to a software package? Go to the Bio IT person. Might even convince a few researchers to extend their computing skills. Any kind of “one size fits all” solution to Bio IT needs is sure to be deficient in some way to some person with some problem, IMHO.

  2. freesci

    November 29, 2007 at 18:06

    Good point Neil, I forgot about LiveCDs. Of course both solutions have their advantages and disadvantages…

    Concerning BioIT – I did think about similar approach, but I couldn’t name it so well, as you did. Hiring technical personel is already a fact in experimental biology, so maybe it’s time to do the same in biology in silico? So far I didn’t hear about such case…

  3. lazy

    November 29, 2007 at 22:17

    I like to use bioinformatics platform that contains many essential bioinformatics applications such as Biomatters’ Geneious Pro.

    Sequence Alignment
    Phylogenetic Tree Building
    Primers Design
    Motifs and ORFs
    Restriction analysis
    Sequence editing
    interactively view 3D molecular structures
    restriction analysis

  4. hanif

    November 30, 2007 at 21:31

    Most of my time is spent on the boundary between genomic data analysis and “Bio IT” (munging data and building tools aplenty).

    I recently spent some time reading up on recent developments in distributed scientific computing – not quite the same issue as “getting everything to run everywhere”, but more like “being able to run lots of things faster more easily”.

    In that respect, there are a number of packages that seem similar to Taverna, like Knime and running MapReduce-transformed algorithms on Amazon Web Services. The latter is basically a way of accessing a cloud cheaply without buying any hardware. Maybe there should be a community-driven project implementing Emboss in Knime on AWS…

  5. Animesh Sharma

    December 5, 2007 at 06:19

    Like Hanif points out, I feel that doing analysis in such high dimension and low sample space (e.g., Microarray) requires some good power (multiple sampling so as to have high probability that we have sampled global minima), especially when one is doing prototyping on real data using high level language such as PERL (to save coding time). I am sure even the people who work on simple mammalian genome comparisons will admit that laptop is not good enough and we have to move in the MapReduce direction, xcore-cluster in general.
    Hope gconsole (google+gdrive+glcuster) comes soon into existence. May be then we can do brute-force searches as well 🙂

  6. Asif M. Khan

    December 6, 2007 at 21:00

    Our lab has constructed the APBioKnoppix software ( and set up a second generation liveCD system, BioSLAX (, to facilitate bioinformatics software usage. Both APBioKnoppix3, a remaster of the popular Knoppix (specifically version 5.01), and BioSLAX, a new live CD suite of bioinformatics tools, have since been deployed in the use of teaching and training (international conference tutorials/workshops, practical courses offered by the S* Life Science Informatics Alliance (, and undergraduate courses offered by the National University of Singapore (>500 students a year)). Distribution of the liveCD systems (as bootable CDs or their ISO images, ~700Mb) is free to educational institutions and so far has been done via http, ftp, BitTorrent p2p and physical CD. Given the bandwidth constraints in developing countries with low-bandwidth, distribution by BitTorrent p2p and physical CD has been the preferred option.

    With regards to the “Bio IT” specialist, the server versions of our liveCD systems enable one to set up a fully operational bioinformatics node within half a day on a system with large capacity hard disks, such as a system with two 500Gbyte hard disks, which are relatively cheap nowdays, ~USD1,000). We recently set up such a server in Vietnam at the Institute of Biotechnology, Hanoi, as part of our workshop funded by UNESCO-IUBMB-FAOBMB ( With these portable and scalable systems, it is now easy to set-up bioinformatics labs, fully equipped with LAMP, Mediawiki and more than 200 freely available bioinformatics software. New software packages can always be made available by remastering the liveCD systems.

    LiveCDs (or perhaps LiveThumbdrives in the future) may be the way to go forward 🙂

  7. freesci

    December 10, 2007 at 07:24

    Hanif, thanks for pointing to Knime – I wasn’t aware of that.

    Animesh – I don’t do large scale analysis anymore, so my computing needs are pretty small. But of course I wouldn’t run the whole genome analysis on my desktop computer anyway. Virtualization, from the end user point of view, removes some overhead of running software at all. If speed is an issue, maybe virtualization similar to AWS/S3 services would become more popular? 🙂

    Asif, thanks for the information about these Bio-LiveCDs. As I said, both solution have their advantages, and they complement each other. Can you comment on scalability of LiveCD solution? Does it require remastering of the CD?

  8. Timothy Edwards

    February 21, 2008 at 18:25

    I agree with nsaunders on BioKnoppix…I’m kinda new to this stuff but I’m definitely a fan.

%d bloggers like this: