Category Archives: bioinformatics

Complex systems and biology – introduction

What you can read in here is a set of my loose notes on complex systems and biology. I want to learn about the topic as fast as I can, so if I’m wrong anywhere, please point that to me. This post is an overview and indication of issues I’d like to cover.

Complex adaptive systems (CAS) are the heart of many phenomenas we observe every day, such as global trade, ecosystems, human body, immune system, internet and even language. Complexity of CAS does not equall to amount of information, rather it’s a indication of complex, positive and negative interactions of its components. All CAS feature a common set of dualisms:

  • distinct/connected – CAS are built of a large number of agents that interact simultaneously and independently but all together become tightly regulated system (other names: individual/system or distributed/collective)
  • robust/sensitive – CAS are pretty robust, yet at the same time are quite sensitive to initial conditions and some signals (see butterfly effect); both features are unpredictable
  • local/global – protein is a CAS, protein network is a CAS, cell is a CAS, tissue is a CAS, organism is a CAS, society is a CAS; agents of a CAS, can be CAS themselves
  • adaptive/evolving – CAS is able to adapt as a system and usually its agents are also mutually adaptive, and at the same time CAS is evolving; even if local landscape prefers simpler solutions (adaptation) CAS usually evolve toward bigger complexity

These dualisms are in some sense as artificial as wave-particle dualism. Complex system has all these features at the same time – their visibility depends only on design of a experiment. As a result, CAS present a common set of features: they are self-organizing, coherent, emergent and non-linear.

Probably the best so far representation of CAS is a network, which has a number of important features: it is scale-free (distribution of links in the network tends to follow power law), clustered (“friend of my friend is likely my friend too”) and small-world-like (diameter of a network is small, aka “six degrees of separation”). Such representation has been applied to biological complex systems, such as metabolic networks, or protein-protein interaction networks with a great success. However please remember that it’s only representation and many times people argued that scale-free networks may not be the best approximation of natural networks (see for example this recent paper).

Scale-free or not, network representation doesn’t address all dualities mentioned above, especially last two. Naturally emerging levels of organisation and relation between adaptation and evolution of complex systems are rarely studied from biological point of view, probably because we don’t have a clear idea how to reduce these phenomenas to something measurable.

In the next posts, I will try to cover other CAS representations and computational approaches to CAS modeling.


Posted by on December 4, 2009 in bioinformatics


Science 2.0 in Poland – getting popular, recognized as important

Few days ago I had a chance to speak about Science 2.0 at the Institute of Biochemistry and Biophysics of Polish Academy of Sciences (the one I’m affiliated with). Compared to the seminar on the same topic I gave at the same place (but for much smaller audience) 4 years ago, I had much more stories to tell, way more real-life examples and better idea of where the whole “2.0” meme is leading us. I also got better at speaking (4 years ago some of my colleagues literally slept on my seminar). So, message got clearer, and messenger had improved.

But given wide interest in the topic from inside and outside of academic environment already before the seminar I think two things had happened in Poland in the last 4 years. First, internet got recognized as a game changing technology, and people simply are interested in any new way they can use this tool (yes, I know it’s 2009 – if you live on the nets it’s hard to realize how slow adoption rate is outside of virtual worlds). Second thing is, that internet as a tool is also recognized as important – for example people had ideas to include Science 2.0 topics into program of PhD studies (I will follow up on this topic in a week or two). Getting popular, important… Only wide adoption is what we need :).

Comments Off on Science 2.0 in Poland – getting popular, recognized as important

Posted by on November 28, 2009 in bioinformatics


Notes from Next Generation Sequencing Workshop in Rome

I was in Rome for two days attending Next Generation Sequencing Workshop organized by EMBRACE (EU FP6 NoE), UPPMAX and CASPUR with the support of the Italian Society of Bioinformatics. It was pretty interesting event and I want to share with you couple of interesting things I’ve learned there.

Hardware layer

First day was devoted mainly to the hardware side of NGS. It started with a presentation from Tony Cox from Sanger Institute who described a hardware setup used to support their sequencing projects. At 400 gigabases a week (current output) Sanger IT infrastructure is stretched in every direction (capacity, availability, redundancy) and Tony pointed out that each sequencing laboratory is going to face similar issues did sooner or later. His advice for such labs was to estimate first number of bases produced and then use multipliers to assess storage requirements for the project. A minor thing that I’ve noticed in his talk was exposing databases as filesystem via FUSE layer – I might use that approach in some projects too.

George Magklaras  from The Biotechnology Centre of Oslo described a number of approaches they took during implementation of their infrastructure. He talked about  FCoE, Fibre Channel over Ethernet, and pointed out that it’s cheaper and almost as efficient as Fibre Channel alone. At the Centre they use Lustre (Sanger is too), high performance networked file system, but they benchmark other solutions too, because some situations/projects require transparent and efficient data encryption (mostly medical data). Similarly to Tony, George pointed out that compartmentalization of data is necessary, as moving large amounts of files over the network creates a unnecessary bottleneck.

Other interesting talk was from Guy Cochrane from EBI about Sequence Read Archive. It was an overview of the project, but again with few interesting tidbits that drawn my attention. One of them was Aspera, much faster alternative (and secure at the same time) to good old FTP. He also presented a data reduction strategy that if I understood correctly is not yet implemented over at SRA, but might be some day in the future. First point was deletion of intensity data – that’s something perfectly reasonable but is heavily opposed by a number of scientists. Then, all only consensus is preserved plus second most frequent base (important for polymorphism studies). The minimum for long-term storage was proposed to consist only of sequence and quality data.


Majority of second day was devoted to software. It doesn’t make sense to list all described projects – I will share with you only my general impression.

Despite large number of scientists devoting their time to develop new tools for next generation sequencing data, I think that software lags a little behind other technological advances in this area. In case of really large amount of data, assembly becomes hard or impossible, mapping erroneous, annotation too slow (pilot study of 1000 genomes project generated so much data, that computing farm was busy for full 60 days – on single CPU it would take 25 000 days). Software development for NGS differs dramatically compared to scientific software in general and needs much much better programmers than we usually are. For example, Desmond Higgins  was praising open source software – they found extremely fast implementation of UPGMA algorithm (much faster then their own), and they could speed up their tool (SeedMap) so much that it’s running it on even largest family of sequences in a reasonable time.


Another bottleneck was data presentation layer – there are some attempts to make digging into data easier, but having a biologically meaningful overview is as hard as it was before. Other people pointed our that problem too (I wasn’t the only biologist there).

Need for stronger community

Probably the most funny part of the workshop was the discussion about creating an organized community of people working with next generation sequencing technologies.  It was funny is this sense, that some consensus about community emerged quite fast. How to build it – that was another story. Obviously lots of participants were sure that if they build a site, people will come. Yeah, sure. 🙂 I’ve suggested using wiki in the first place and additionally hire a community manager if they really want to gather people from many different forums, sites, groups etc. Lot’s of people didn’t buy these ideas, suggesting more traditional approach, so curious if they were right, I’m going to follow development of this community.

NGS = high tech

Probably the most important lesson was to realize that sequencing is a field with very high requirements for infrastructure and  even higher requirements for skilled staff. Basically every element of the infrastructure may become a bottleneck and if you want to avoid it, cost of data maintenance and analysis exceeds very fast cost of producing the data. When I talked about it to many people during the last year (I’m involved in some sequencing projects at the analysis/annotation step) often people felt I overestimate infrastructure needs. Now I have some specific number to back it up :).


Posted by on November 21, 2009 in bioinformatics


Tags: , ,

All 2.0 – an attempt to connect disciplines

All 2.0Last year I bought a domain name Initially I had an idea to launch a huge portal around “2.0” meme – essentially tracking changes in communication methods across various areas. I wanted to quit science and start a consulting career in helping people to communicate more efficiently (new channels and tools, efficient visual communication, etc.). However, a market for such services in Poland is nonexistent, and I didn’t have a mood for relocation, so I’ve turned to other opportunities (and as effect, I’ve stayed in science). Neverthess, I still had a domain but no clear idea what to use it for.

So, with only a little time left, the next option I took was a tracker/aggregator. In theory, once done, it didn’t need much maintenance. There’s quite a lot of services for such purpose out there, but they didn’t necessarily allowed for certain things I wanted to have, so I had to code my own script. As I didn’t have much time, the resulting site is a little rough (it cannot compete with wonderful sites Euan is coding, such as recently released preview of Streamosphere). However, you should get an idea what I’m aiming for. Currently it tracks blog posts and conversations in areas of Science 2.0, Health 2.0 and Culture 2.0 (with Enterprise and Government to follow). Because within these types I sort all entries by date, I had to remove some bloggers from “Key People” list, as their high-speed blogging did not allow others to appear in the box at all. 🙂

At this stage, the set of sources is far from perfect – outside of science, conversations seem to be highly homogenous. When I improve the sources (maybe will use Twitter and custom FriendFeed searches), I plan to add some kind of visual summary to the tracked conversations to see if I can find some patterns that will let me establish a connection between disciplines. Let’s see…

While I was collecting links, I’ve found one interesting thing: you can find people interested in these three areas both over at FriendFeed and over at Twine. However, it seems that only scientists are actively talking with each other at these services – where are other groups storing their discussions?


Posted by on June 28, 2009 in bioinformatics


Open Science, what is your message?

It recently occured to me that maybe Open Science could be marketed more efficiently by simplyfying its messages and better targeting. I often find it difficult to convince scientists to support the idea, because Open Science idea does not seem to solve their problems. Western scientists have the main problem: not enough money – the rest are just details (I will be happy to be proven wrong, but I constantly notice that majority of scientists will happily play in the current academic system as long there’s enough money for their research). How about having the main message of OS movement along the lines of “Open Science = Cheaper Less Expensive Science”  (that’s something that Jean-Claude and Cameron say for some time)? I know that we don’t have enough evidence to say so, but on the other hand nobody seems to care that there are better measurements of scientific productivity than impact factor (and have some evidence for that).

Simple message – but also better targeting

Such message is not going to resonate at places that have much more significant problems than lack of money. To me, there are several places in the world that suffer from other issue – isolation. Thomas Erren in his short commentary on Phil’s Bourne “Ten simple rule for getting published” cites Rosalyn Yalow, a Nobel prize laureate:

… I am in full sympathy with rejecting papers from unknown authors working in unknown institutions. How does one know that the data are not fabricated? … on the average, the work of established investigators in good institutions is more likely to have had prior review from competent peers and associates even before reaching the journal.

And it’s just only one side of isolation – there are many more. So, maybe in such places the message of OS should be along the lines of “Open Science = Connected Science” (following one of Deepak’s blog themes), explaining that openness creates connection through which knowledge, experience and recognition can flow both ways?


Posted by on June 22, 2009 in bioinformatics


Dreaming about bio-spreadsheet

One of the often occuring task in my work is to present results of an analysis in some kind of table. I have used for such purpose quite a number of approaches, starting from generating simple HTML file, through fetching of SQL data into table stored in a wiki, up to using Rails. One of the dreams I have recently is a web-based spreadsheet that would allow me to apply some specific piece of code over every row/column and show resulting table.

ScreenshotA simple mockup is shown above. In this example, a code:

print " <img src="{column_1}_bio_r_250.jpg>"

… iterated over first column containing PDB codes, would substitute these codes with an image of a protein from PDB server.

In other words I dream about simple (single file would be the best – I like the approach Sinatra framework is taking) web-based programmable spreadsheet. Something like Resolver One, but simpler. Is there anything like that available?


Posted by on May 19, 2009 in bioinformatics, Software


HMMER3 testing notes – my skills are (finally) becoming obsolete

Hidden Markov Model with Output
Image via Wikipedia

It’s already quite a while since I’ve started to extensively test performance of HMMER3. As many other people noticed before, speed of the search has improved dramatically – I’m really impressed how fast it is. However, it’s only part of the story. The smaller part actually.

As some of readers may know, most of my projects so far were revolving around protein sequence analysis and sequence-structure relationships. Mainly I was doing analysis of sequences that had no clear similarity to anything known, without functional annotation. Usual task was to run sequence comparison software and look at the end of the hit list, trying to make sense from hits beyond any reasonable E-value thresholds (for example I often run BLAST at E-value of 100 or 1000). I use very limited number of tools, because it takes quite a while to understand on which specific patterns a particular software fails.

The high-end tool I use most often is HHpred – HMM-HMM comparison software. It’s slow but very sensitive – my personal benchmarks show that it is able to identify very subtle patterns in sequence formed slightly above level of similar secondary structures (in other words, from the set of equally dissimilar sequences with identical secondary structure order, it correctly identifies the ones with similar tertiary structure).

The most surprising thing about HMMER3 is that in my personal benchmarks it’s almost as sensitive as HHpred. I wasn’t expecting that HMM-sequence comparison can be as good as HMM-HMM.  This observation suggests that there’s still a room for improvement for the latter approach, however it has already big implications.

PFAM will soon migrate to HMMER3 (the PFAM team is now resolving overlaps between families that arose due to increased sensitivity) and the moment it is be available, it will make a huge number of publications obsolete, or simply wrong. There are thousands of articles that discuss in detail evolutionary history of some particular domain (many of these will become obsolete) or draw some conclusions from the observation that some domain is not present in analyzed sequence/system (many of these will need to be revised). It will also make my skills quite obsolete, but that is always to be expected, no matter in what branch of science one is working. I also imagine that systems biology people will be very happy to have much better functional annotation of proteins.

I don’t want to call development of HMMER3 a revolution, but it will definitely have similar impact on biology as BLAST and HMMER2 had. Not only because of its speed, but also because it will create a picture of similarities between all proteins comparable to the picture state-of-the-art methods could only calculate for their small subset.

Reblog this post [with Zemanta]

Posted by on April 22, 2009 in bioinformatics, Research, Software


Tags: , , , ,