RSS

Category Archives: bioinformatics

Complex systems and biology – introduction

What you can read in here is a set of my loose notes on complex systems and biology. I want to learn about the topic as fast as I can, so if I’m wrong anywhere, please point that to me. This post is an overview and indication of issues I’d like to cover.

Image source: http://commons.wikimedia.org/wiki/File:Biocomplexity_spiral.jpg

Complex adaptive systems (CAS) are the heart of many phenomenas we observe every day, such as global trade, ecosystems, human body, immune system, internet and even language. Complexity of CAS does not equall to amount of information, rather it’s a indication of complex, positive and negative interactions of its components. All CAS feature a common set of dualisms:

distinct/connected – CAS are built of a large number of agents that interact simultaneously and independently but all together become tightly regulated system (other names: individual/system or distributed/collective)
robust/sensitive – CAS are pretty robust, yet at the same time are quite sensitive to initial conditions and some signals (see butterfly effect); both features are unpredictable
local/global – protein is a CAS, protein network is a CAS, cell is a CAS, tissue is a CAS, organism is a CAS, society is a CAS; agents of a CAS, can be CAS themselves
adaptive/evolving – CAS is able to adapt as a system and usually its agents are also mutually adaptive, and at the same time CAS is evolving; even if local landscape prefers simpler solutions (adaptation) CAS usually evolve toward bigger complexity

These dualisms are in some sense as artificial as wave-particle dualism. Complex system has all these features at the same time – their visibility depends only on design of a experiment. As a result, CAS present a common set of features: they are self-organizing, coherent, emergent and non-linear.

Probably the best so far representation of CAS is a network, which has a number of important features: it is scale-free (distribution of links in the network tends to follow power law), clustered (“friend of my friend is likely my friend too”) and small-world-like (diameter of a network is small, aka “six degrees of separation”). Such representation has been applied to biological complex systems, such as metabolic networks, or protein-protein interaction networks with a great success. However please remember that it’s only representation and many times people argued that scale-free networks may not be the best approximation of natural networks (see for example this recent paper).

Scale-free or not, network representation doesn’t address all dualities mentioned above, especially last two. Naturally emerging levels of organisation and relation between adaptation and evolution of complex systems are rarely studied from biological point of view, probably because we don’t have a clear idea how to reduce these phenomenas to something measurable.

In the next posts, I will try to cover other CAS representations and computational approaches to CAS modeling.

2 Comments

Posted by Pawel Szczesny on December 4, 2009 in bioinformatics

Science 2.0 in Poland – getting popular, recognized as important

28 Nov

Few days ago I had a chance to speak about Science 2.0 at the Institute of Biochemistry and Biophysics of Polish Academy of Sciences (the one I’m affiliated with). Compared to the seminar on the same topic I gave at the same place (but for much smaller audience) 4 years ago, I had much more stories to tell, way more real-life examples and better idea of where the whole “2.0” meme is leading us. I also got better at speaking (4 years ago some of my colleagues literally slept on my seminar). So, message got clearer, and messenger had improved.

But given wide interest in the topic from inside and outside of academic environment already before the seminar I think two things had happened in Poland in the last 4 years. First, internet got recognized as a game changing technology, and people simply are interested in any new way they can use this tool (yes, I know it’s 2009 – if you live on the nets it’s hard to realize how slow adoption rate is outside of virtual worlds). Second thing is, that internet as a tool is also recognized as important – for example people had ideas to include Science 2.0 topics into program of PhD studies (I will follow up on this topic in a week or two). Getting popular, important… Only wide adoption is what we need :).

Comments Off

Posted by Pawel Szczesny on November 28, 2009 in bioinformatics

Notes from Next Generation Sequencing Workshop in Rome

21 Nov

I was in Rome for two days attending Next Generation Sequencing Workshop organized by EMBRACE (EU FP6 NoE), UPPMAX and CASPUR with the support of the Italian Society of Bioinformatics. It was pretty interesting event and I want to share with you couple of interesting things I’ve learned there.

Hardware layer

First day was devoted mainly to the hardware side of NGS. It started with a presentation from Tony Cox from Sanger Institute who described a hardware setup used to support their sequencing projects. At 400 gigabases a week (current output) Sanger IT infrastructure is stretched in every direction (capacity, availability, redundancy) and Tony pointed out that each sequencing laboratory is going to face similar issues did sooner or later. His advice for such labs was to estimate first number of bases produced and then use multipliers to assess storage requirements for the project. A minor thing that I’ve noticed in his talk was exposing databases as filesystem via FUSE layer – I might use that approach in some projects too.

George Magklaras from The Biotechnology Centre of Oslo described a number of approaches they took during implementation of their infrastructure. He talked about FCoE, Fibre Channel over Ethernet, and pointed out that it’s cheaper and almost as efficient as Fibre Channel alone. At the Centre they use Lustre (Sanger is too), high performance networked file system, but they benchmark other solutions too, because some situations/projects require transparent and efficient data encryption (mostly medical data). Similarly to Tony, George pointed out that compartmentalization of data is necessary, as moving large amounts of files over the network creates a unnecessary bottleneck.

Other interesting talk was from Guy Cochrane from EBI about Sequence Read Archive. It was an overview of the project, but again with few interesting tidbits that drawn my attention. One of them was Aspera, much faster alternative (and secure at the same time) to good old FTP. He also presented a data reduction strategy that if I understood correctly is not yet implemented over at SRA, but might be some day in the future. First point was deletion of intensity data – that’s something perfectly reasonable but is heavily opposed by a number of scientists. Then, all only consensus is preserved plus second most frequent base (important for polymorphism studies). The minimum for long-term storage was proposed to consist only of sequence and quality data.

Software

Majority of second day was devoted to software. It doesn’t make sense to list all described projects – I will share with you only my general impression.

Despite large number of scientists devoting their time to develop new tools for next generation sequencing data, I think that software lags a little behind other technological advances in this area. In case of really large amount of data, assembly becomes hard or impossible, mapping erroneous, annotation too slow (pilot study of 1000 genomes project generated so much data, that computing farm was busy for full 60 days – on single CPU it would take 25 000 days). Software development for NGS differs dramatically compared to scientific software in general and needs much much better programmers than we usually are. For example, Desmond Higgins was praising open source software – they found extremely fast implementation of UPGMA algorithm (much faster then their own), and they could speed up their tool (SeedMap) so much that it’s running it on even largest family of sequences in a reasonable time.

Another bottleneck was data presentation layer – there are some attempts to make digging into data easier, but having a biologically meaningful overview is as hard as it was before. Other people pointed our that problem too (I wasn’t the only biologist there).

Need for stronger community

Probably the most funny part of the workshop was the discussion about creating an organized community of people working with next generation sequencing technologies. It was funny is this sense, that some consensus about community emerged quite fast. How to build it – that was another story. Obviously lots of participants were sure that if they build a site, people will come. Yeah, sure. 🙂 I’ve suggested using wiki in the first place and additionally hire a community manager if they really want to gather people from many different forums, sites, groups etc. Lot’s of people didn’t buy these ideas, suggesting more traditional approach, so curious if they were right, I’m going to follow development of this community.

NGS = high tech

Probably the most important lesson was to realize that sequencing is a field with very high requirements for infrastructure and even higher requirements for skilled staff. Basically every element of the infrastructure may become a bottleneck and if you want to avoid it, cost of data maintenance and analysis exceeds very fast cost of producing the data. When I talked about it to many people during the last year (I’m involved in some sequencing projects at the analysis/annotation step) often people felt I overestimate infrastructure needs. Now I have some specific number to back it up :).

2 Comments

Posted by Pawel Szczesny on November 21, 2009 in bioinformatics

Tags: next generation sequencing, ngs, workshop

All 2.0 – an attempt to connect disciplines

28 Jun

Last year I bought a domain name AllTwoPointZero.com. Initially I had an idea to launch a huge portal around “2.0” meme – essentially tracking changes in communication methods across various areas. I wanted to quit science and start a consulting career in helping people to communicate more efficiently (new channels and tools, efficient visual communication, etc.). However, a market for such services in Poland is nonexistent, and I didn’t have a mood for relocation, so I’ve turned to other opportunities (and as effect, I’ve stayed in science). Neverthess, I still had a domain but no clear idea what to use it for.

So, with only a little time left, the next option I took was a tracker/aggregator. In theory, once done, it didn’t need much maintenance. There’s quite a lot of services for such purpose out there, but they didn’t necessarily allowed for certain things I wanted to have, so I had to code my own script. As I didn’t have much time, the resulting site is a little rough (it cannot compete with wonderful sites Euan is coding, such as recently released preview of Streamosphere). However, you should get an idea what I’m aiming for. Currently it tracks blog posts and conversations in areas of Science 2.0, Health 2.0 and Culture 2.0 (with Enterprise and Government to follow). Because within these types I sort all entries by date, I had to remove some bloggers from “Key People” list, as their high-speed blogging did not allow others to appear in the box at all. 🙂

At this stage, the set of sources is far from perfect – outside of science, conversations seem to be highly homogenous. When I improve the sources (maybe will use Twitter and custom FriendFeed searches), I plan to add some kind of visual summary to the tracked conversations to see if I can find some patterns that will let me establish a connection between disciplines. Let’s see…

While I was collecting links, I’ve found one interesting thing: you can find people interested in these three areas both over at FriendFeed and over at Twine. However, it seems that only scientists are actively talking with each other at these services – where are other groups storing their discussions?

2 Comments

Posted by Pawel Szczesny on June 28, 2009 in bioinformatics

Open Science, what is your message?

22 Jun

It recently occured to me that maybe Open Science could be marketed more efficiently by simplyfying its messages and better targeting. I often find it difficult to convince scientists to support the idea, because Open Science idea does not seem to solve their problems. Western scientists have the main problem: not enough money – the rest are just details (I will be happy to be proven wrong, but I constantly notice that majority of scientists will happily play in the current academic system as long there’s enough money for their research). How about having the main message of OS movement along the lines of “Open Science = Cheaper Less Expensive Science” (that’s something that Jean-Claude and Cameron say for some time)? I know that we don’t have enough evidence to say so, but on the other hand nobody seems to care that there are better measurements of scientific productivity than impact factor (and have some evidence for that).

Simple message – but also better targeting

Such message is not going to resonate at places that have much more significant problems than lack of money. To me, there are several places in the world that suffer from other issue – isolation. Thomas Erren in his short commentary on Phil’s Bourne “Ten simple rule for getting published” cites Rosalyn Yalow, a Nobel prize laureate:

… I am in full sympathy with rejecting papers from unknown authors working in unknown institutions. How does one know that the data are not fabricated? … on the average, the work of established investigators in good institutions is more likely to have had prior review from competent peers and associates even before reaching the journal.

And it’s just only one side of isolation – there are many more. So, maybe in such places the message of OS should be along the lines of “Open Science = Connected Science” (following one of Deepak’s blog themes), explaining that openness creates connection through which knowledge, experience and recognition can flow both ways?

4 Comments

Posted by Pawel Szczesny on June 22, 2009 in bioinformatics

Dreaming about bio-spreadsheet

19 May

One of the often occuring task in my work is to present results of an analysis in some kind of table. I have used for such purpose quite a number of approaches, starting from generating simple HTML file, through fetching of SQL data into table stored in a wiki, up to using Rails. One of the dreams I have recently is a web-based spreadsheet that would allow me to apply some specific piece of code over every row/column and show resulting table.

A simple mockup is shown above. In this example, a code:

print " &lt;img src="http://www.pdb.org/pdb/images/#{column_1}_bio_r_250.jpg&gt;"

… iterated over first column containing PDB codes, would substitute these codes with an image of a protein from PDB server.

In other words I dream about simple (single file would be the best – I like the approach Sinatra framework is taking) web-based programmable spreadsheet. Something like Resolver One, but simpler. Is there anything like that available?

6 Comments

Posted by Pawel Szczesny on May 19, 2009 in bioinformatics, Software

HMMER3 testing notes – my skills are (finally) becoming obsolete

22 Apr

: Image via Wikipedia

It’s already quite a while since I’ve started to extensively test performance of HMMER3. As many other people noticed before, speed of the search has improved dramatically – I’m really impressed how fast it is. However, it’s only part of the story. The smaller part actually.

As some of readers may know, most of my projects so far were revolving around protein sequence analysis and sequence-structure relationships. Mainly I was doing analysis of sequences that had no clear similarity to anything known, without functional annotation. Usual task was to run sequence comparison software and look at the end of the hit list, trying to make sense from hits beyond any reasonable E-value thresholds (for example I often run BLAST at E-value of 100 or 1000). I use very limited number of tools, because it takes quite a while to understand on which specific patterns a particular software fails.

The high-end tool I use most often is HHpred – HMM-HMM comparison software. It’s slow but very sensitive – my personal benchmarks show that it is able to identify very subtle patterns in sequence formed slightly above level of similar secondary structures (in other words, from the set of equally dissimilar sequences with identical secondary structure order, it correctly identifies the ones with similar tertiary structure).

The most surprising thing about HMMER3 is that in my personal benchmarks it’s almost as sensitive as HHpred. I wasn’t expecting that HMM-sequence comparison can be as good as HMM-HMM. This observation suggests that there’s still a room for improvement for the latter approach, however it has already big implications.

PFAM will soon migrate to HMMER3 (the PFAM team is now resolving overlaps between families that arose due to increased sensitivity) and the moment it is be available, it will make a huge number of publications obsolete, or simply wrong. There are thousands of articles that discuss in detail evolutionary history of some particular domain (many of these will become obsolete) or draw some conclusions from the observation that some domain is not present in analyzed sequence/system (many of these will need to be revised). It will also make my skills quite obsolete, but that is always to be expected, no matter in what branch of science one is working. I also imagine that systems biology people will be very happy to have much better functional annotation of proteins.

I don’t want to call development of HMMER3 a revolution, but it will definitely have similar impact on biology as BLAST and HMMER2 had. Not only because of its speed, but also because it will create a picture of similarities between all proteins comparable to the picture state-of-the-art methods could only calculate for their small subset.

The curse of BLAST (mndoci.com)

3 Comments

Posted by Pawel Szczesny on April 22, 2009 in bioinformatics, Research, Software

Tags: bioinformatics, biology, HMM, HMMER, PFAM

Structure prediction without structure – visual inspection of BLAST results

03 Feb

portschema My recent post on visual analytics in bioinformatics lacked a specific example, but I’m happy to finally provide one (happiness comes also from the fact that respective publication is finally in press). The image above shows a multiple pairwise alignment from BLAST of a putative inner membrane protein from Porphyromonas gingivalis. Image is small but it does not really matter – colour patches seem to be visible anyway.

Regions marked with ovals are clearly less conserved, than other part of the protein. There are five hydrophobic (green patches, underlined with blue lines) regions in this alignment (I ignore N-terminus, as this is likely the signal peptide), however the three inner ones appear to be of similar length, while the outer ones seem to be of the half as long as the inner ones. If we assume that the single unit is the short one, we can summarize the protein as follows: 8 beta structures, four long loops, for short loops. It looks like an eight-stranded outer membrane beta-barrel. Almost structure prediction, but without a structure.

I could end the story here, but the model didn’t fit previously published data. Its localization in the inner membrane was confirmed by an experiment, however pores in the inner membrane are considered very harmfull 😉 . Fortunately, one of my colleagues explained to me that particular localization technique is not 100% reliable, so I gathered more evidence, created detailed description of topology and the other group has designed experiments which confirmed my visual analysis.

Lessons learned? Maybe without this feedback on quality of that experimental technique, I would still claim that this is OM beta-barrel. Or maybe not. But I’ve learned that to safely ignore experimental results, one needs a more than a intuition. Also, it shows that sometimes looking at the results, is all one needs to make a reasonable prediction (I still have no idea what were E-values of these BLAST hits, but does it matter?).

7 Comments

Posted by Pawel Szczesny on February 3, 2009 in bioinformatics, Research, Visualization

Tags: bioinformatics, biology, Inner membrane, Membrane protein, Porphyromonas gingivalis, Visual analytics

Database query and ranked results

22 Jan

: Image via Wikipedia

Already some time ago I’ve read a piece by Marcelo Calbucci: Is it a database or a search engine?. While it deals with search information within a real estate database, I think his comments are applicable in the many areas of life sciences.

In short, Marcelo points out that people miss a lot of interesting entries while looking for a house, because of inflexibility of the query; number of bedrooms, price, distance from some point – these are all set. However, users are flexible and in such case need rather a search engine that gives them close enough answer or allows to specify weight to each filter.

In life sciences we do search for similarities and analogies all the time. Sometimes it’s direct comparison of sequences, on other occasion is high-level meta-comparison between two systems. And while we have various (statistical) metrics of similarities and they sometimes become a part of a database designs, interfaces of biological databases don’t allow to rank query results according to these metrics. For example I can easily find all human proteins related to disease X or disease Y or disease Z, although I cannot specify that I want proteins related to Z AND Y first on the list. Other example would be searching PubMed – I can look for articles related to “synthetic biology”, but I have no way to specify, that I want papers by James Collins from HHMI AND articles related to these papers to be first on the list. I guess it is possible to obtain such results without going through the whole list, but I doubt the method will be very simple. Filtering still seems to be neglected aspect of database design in life sciences.

My dream biological search engine would have a series of sliders (or ideally, I would like to have a device with series of mechanical knobs attached to the computer) and would allow me to dynamically change weights of various aspects of the query and see immediately how it affects the results. It would be something resembling interactivity of Gapminder World, but on dynamically generated data. Technology and proof of concept seems to be there, but I guess we need to wait quite a few years before this approach will be adopted within life sciences.

4 Comments

Posted by Pawel Szczesny on January 22, 2009 in bioinformatics, Data mining, Software

Tags: bioinformatics, Database, PubMed, Search, Web search engine

Science and art. New theme for the new year.

11 Jan

: Image via Wikipedia

In 2007 this blog was mainly scientific. Last year I’ve explored possibilities of being a freelance scientist. As I’ve announced earlier on Twitter, theme for this year will be science and art. And I should already explain: I’m not going to write about such extraordinary artistic endeavours like creating music from DNA/protein sequence, try to convince you that science is beautiful or state that my pictures of molecules are the true art. I’m more interested to see if there’s anything I can learn from The Art, its history and its approach. While I’m not yet sure what I will end up writing about, here are two topics I may start with to see in which direction this theme unfolds.

Holistic approach to science

This is something I was thinking about for a while. I didn’t come up with anything interesting, but I think it’s worth exploring further. Some first ideas were coming from reading Wikipedia entry about lateralization of brain functions or Steve Brenner’s comments about “middle-out approach” (as opposed to top-bottom or bottom-up). I’ve also found peculiar Mihaly Csikszentmihalyi‘s answer to Edge 2009 question, where he wrote about “The end of analytic science”. Very recently I’ve also found interesting interview with Daniel Tammet, autistic savant, who explains his theory of exceptional creativity coming from “hyper-connectivity” of distinct brain regions. I have no yet idea whether there’s anything practical to find in such theories, but their exploration will be appealing enough.

Dashboard design for scientific data

This is something more practical, although again I expect to get no points for that topic. Information dashboard is a very cool concept rarely used in life sciences. One of the best known examples in bioinformatics may be InterPro domain page (here’s example entry on pore-forming lobe of aerolysins) – almost everything is on the single page, it has some nice graphical overviews of particular features (like species distribution), etc. It’s not the prettiest dashboard around, but at least you don’t need to click anywhere to have an overview of stored information (compare it to PFAM approach to similar domain). I hope to learn what makes a great dashboard, experiment a little and see if the result is worth the effort.

Other topics

I still will be blogging about bioinformatics, visualizations and open science – that stays in place. Especially the last topic is something I expect to write about quite a lot – my feeling is that this year will bring couple of interesting events in this area (and I hope to initiate some of them). So if you don’t like the “science and art” theme, I think I will give you some other reasons to visit this blog once in a while.

5 Comments

Posted by Pawel Szczesny on January 11, 2009 in bioinformatics