Category Archives: bioinformatics

Domain annotation in trimeric autotransporter adhesins

First major outcome of my PhD project has just appeared in the Bioinformatics (open access). It describes a system we have design to annotate specific group of bacterial proteins.

Trimeric autotransporter adhesins (TAAs) form one of the many families of bacterial surface proteins. In medically relevant species they adhere to host cells (in non-pathogenic species we don’t know what they adhere to), therefore they are considered essential virulence factors. They are autotransporters, which means that they are passing the outer membrane by themselves – C-terminal part makes a pore through which the rest of the protein goes out. In contrary to many other autotransporters, exported part is not cut but stays attached to the membrane by the C-terminal autotransport domain. TAAs are also trimeric – the pore is made of three subunits and the exported fiber is also a trimer. The last feature is pretty unique – so far it’s the only family of bacterial surface proteins which forms fibrous trimers. Interestingly, they differ in length between few hundred and five thousands residues.

What’s so special about these proteins for bioinformatician? Structure of the fiber is not homogenous – it is a mixture of globular domains and coiled-coils. On a sequence level, they have lots of internal repeats (see the picture), heavily biased residue composition, their domain composition and architecture varies by protein. The only conserved part in all TAAs in the autotransport domain. Systems designed to identify and annotate typical protein domains (such as PFAM) don’t handle them very well – average coverage of PFAM annotation of TAAs is about 30%. The server we have built relies on the fact that domains of TAAs are exclusive for this family (they do not appear anywhere else because its unique structural constrains). Therefore we could use different thresholds, manually curated alignments and domain-context derived rules to improve the annotation.

Manual analysis of TAAs sequences is pretty tedious (well, it was, now we have the server), but on the other hand I have learnt a lot about how to read a protein sequence. I mean really read and understand how particular combination of letters influences its structure.


Posted by on April 10, 2008 in bioinformatics


BioBrick as a functional role


When I initially saw The MIT Registry of Standard Biological Parts, I just fell in love with the idea. However, after closer inspection I realized that it’s not what I hoped to find. The Registry collects an interchangeable functional modules that can be assembled into novel biological systems. And it does it as good as it sounds, but to a certain extent. Pedro wrote some time ago about unavoidable complexity and potential issues with collected parts. I completely agree with his arguments but I have even more doubts about the Registry’s current approach.

First of all, my feeling is that DNA-centric view of life starts to limit us in understanding what is happening at a molecular level. It moved forward science a lot and it is still extremely useful, but with genetics we are not going to understand and avoid emergent properties of biological systems. DNA, RNA, proteins at a sequence and structure level are all interacting with each other. This properties are encoded in DNA, I agree. However, as Pedro pointed out, we have no way to predict what happens after transferring a part to other organism. It is possible to select for mutations that will render this part usable in the other organism, but I don’t think this approach would be of much use if we deal with organisms that are hard to grow (imagine you want to insert a specific system into extremophile organism). And what is more, it’s not necessarily practical if we transfer the part to an organism which already has a similar element encoded in the genome.

In my humble opinion, the Registry can be extended in two directions, transforming parts into a containers that have a specific functional role and include sub-gene elements, like domains or tectons. Let me describe both in more detail.

Currently a BioBrick is assigned a function and a sequence. I would rather see a functional role, that can be fulfilled by many different sequences. For example, if we have an enzymatic function the BioBrick would include not only single DNA sequence from a single organism, but also a protein sequence, domains, sequence motifs and a structure (whatever is available), and all these should be available for all organisms for which we can assign reliably this information. To clarify, I’m far from populating the Registry with BLAST results. I would rather have it done manually, or at least in the way The SEED allows experts to create subsystems and assign a functional roles to proteins. In this way we could just take a gene from a target organism instead of mutating the original one. Having a container would mean that we could include there different flavors of the same gene (for example, after optimization).

For the second thing, I’m a big fan of creating novel functions out of existing elements. That’s a reason why I believe the Registry should include building blocks of proteins as well as other fancy things, like riboswitches. One of the obvious example would be a signal transduction element, where one can attach different receptor domains to the same membrane component. This has been done already thousands of times – why not to standardize it?

Maybe with these two changes maybe we could finally connects some dots and make a complexity of biological systems more understandable or at least traceable. Future directions of the Registry are not very well defined, so I believe there’s a space for at least discussion about such ideas.


Tags: , ,

Semi-automated workflows – Taverna Interaction Service

I was still thinking about recent Neil’s wondering about possibility of automating every scientific workflow, when I saw this (Bioinformatics Advance Access abstract):

The Taverna Interaction Service: enabling manual interaction in workflows by Anders Lanzén and Tom Oinn

Taverna is an application that eases the integration of tools and databases for life science research by the construction of workflows. The Taverna Interaction Service extends the functionality of Taverna by defining human interaction within a workflow and acting as a mediation layer between the automated workflow engine and one or more users.

I have not tried it yet but this Taverna plugin is very likely an answer to doubts I often have when automation of bioinformatics workflows is discussed: we shouldn’t always remove ourselves from the workflow, as interaction with software can be often critical in making a discovery. For example conscious decision about which sequences should go in during PSI-BLAST search can dramatically influence quality of resulting profile. So I agree with Neil that not every workflow can be automated, but more importantly not every workflow should be. Possibility of wrapping one’s mind around a problem is gone when there’s no feedback loop on the process.

Comments Off on Semi-automated workflows – Taverna Interaction Service

Posted by on March 12, 2008 in bioinformatics, Papers, PubMed


Tags: , ,

Mining PubMed – another tools available

There are two new tools available that mine semantically PubMed abstracts, e-LiSe and Anne O’Tate. First one was made by my colleagues from Institute of Biochemistry and Biophysics in Warsaw, while the second is from University of Illinois, Chicago. Female-sounding names is not the only thing that makes them look similar, they both provide analogous functionality, like keywords or author names associated with user query.

There’s quite a lot of third party interfaces to PubMed (see David Rothman’s excellent list), so I couldn’t resist to run few queries on both servers and compare them to GoPubmed, which currently wins hands down in terms of features and interface. I wasn’t surprised to see that results overlap significantly, although not completely. Each of servers found valuable keywords other two did not have – that’s understandable, they use different algorithms. I wonder if we will see a meta-server of PubMed data-miners, like there are for protein structure prediction (for example In theory, exhaustive search for meaningful keywords by different methods and then their classification and analysis should work better than any single method, but this is just a guess.


Posted by on March 5, 2008 in bioinformatics, Data mining, PubMed


Tags: , , ,

Importance of null models – slides by Kevin Karplus

Again, a short note today (but I have some longer posts on the way). I’ve just fished reading slides of the talk Kevin Karplus had given on the 3DSig (satellite conference of the last ISMB in Vienna). The talk was entitled: Better than chance: the importance of null models. If you haven’t been there, I hope take-home messages will convince you to have a look:

  • Base your null models on biologically meaningful null hypotheses, not just computationally convenient math.
  • Generative models and simulation can be useful for more complicated models.
  • Picking the right model remains more art than science.

Very good connection of math skills and a feeling of biological problems.

Comments Off on Importance of null models – slides by Kevin Karplus

Posted by on February 21, 2008 in bioinformatics, Structure prediction


Tags: , , , ,

Can a biologist fix a radio?

[via Molecule of the Day] Go and read (if you haven’t before) this brilliant piece on modern biology: Can a Biologist Fix a Radio? Read it twice if you call yourself bioinformatician…


Posted by on February 5, 2008 in bioinformatics, Fun


Tags: , ,

“Startup weekends” in science

News about yet another “startup-weekend-like” event keep hitting me more and more often. They do not always are about creating a company or a product. Sometimes it’s about collaborative coding a game or writing a novel – all in very short time. In many cases it works amazingly well – being so tight on time forces people to be ultra-productive and to be focused only on important parts of the project. I envy people attending such meetings, not necessarily because of possible outcomes, but because of the energetic atmosphere that is present there.

Deepak wrote some time ago about “Bursty work” – idea, that work can be done by distributed teams focused around high value projects, instead of teams gathered around company/startup. That actually made me think if we can join these two ideas in science: to have ultra-productive and distributed team working on time-constrained project.

Lets assume that the average publication in the field of bioinformatics/computational biology takes six months of work of one scientist. It doesn’t really matter if it’s new server, database or protein family annotation. So a team of four people should do the same work in six weeks or faster (why faster? knowledge and skills are not distributed evenly, so someone else may code the necessary script faster than I would do it). If we would increase even further the number of people involved, create a distraction-free environment and prepare enough coffee for everyone, the whole process could be done in a week. Even if the assumptions here are not really correct, I’m pretty sure that quite a number of valuable papers could be done this way in a week.

So what do you think? What about creating a platform that allows for:

  • creating a project that has a clear and appealing outcome (for example publication, or at least manuscript in Nature Precedings)
  • creating a project workspace with all necessary tools (wiki, chat, svn, etc. plus small computational backend for testing)
  • creating a number of roles, that need to be filled by people with certain skills
  • joining the project if the skills match requirements
  • setting an clear deadline (for example, a countdown clock that will forbid to commit changes to the project after certain amount of time, leaving the workspace read-only)

I agree that science takes time, especially the quality science. But on the other hand, I have a feeling that we waste a lot of time learning things by ourselves, instead of learning form others, we waste this time because the outcome is not well defined, and finally we waste time solving everything ourselves instead of bouncing the idea against other people (this is what collaboration is all about). So what about creating an artificial environment that forbids wasting time?

Utopian? Maybe. Naive? Most likely. Worth considering? I hope so. Let me know.


Tags: , , ,

Jane – Journal/Author Name Estimator

Jane – Journal/Author Name Estimator is a new web based application that can suggest potential reviewers or target journals for a manuscript based on its title and abstract. It was just published by Bioinformatics under Advance Access (but unfortunately it’s not an open access article). I have tested two of my upcoming publications and Jane performed well: I wasn’t surprised by most of predicted names and journal titles. The topic I’m writing about in these papers is rather narrow, so don’t treat it as any performance measure – test it yourself, if you are interested.

Probably I’m not going to use it as authors suggested – I consider this application a helpful literature research tool.

1 Comment

Posted by on January 28, 2008 in bioinformatics, Papers, Research, Services


Tags: , , ,

Visualization of internal repeats in proteins (or DNA)

There’s a number of protein families that have internal repeats (like TPR, Armadillo, ankyrin etc.). I’m very interested in many of them for reasons I will explain in other post. Assessing arrangement of these repeats is straightforward in majority of cases – most of them tend to occur next to each other, with little or no insertions between them (finding them at first is completely different story). However, there are proteins where internal repeats are separated by other domains or repeats, which can result in a real mess (or in scientific language: mosaic-like architecture). When couple of months ago I looked for some visualization method that would allow me to have a quick overview of internal structure of such proteins, I’ve stumbled across The Shape of Song – visualization method developed by Martin Wattenberg, researcher at IBM. This fitted my requirements so I’ve implemented it with some help of Processing (and which I’ve added later to a protein analysis server that has a chance to be published next month). Resulting visualization is below:

Internal repeats in a protein

Repeats are colored according to repeat type and are connected according to repeat family. If you think about it in terms of SCOP (Structural Classification of Proteins) hierarchy, colors represent class, while arcs connect superfamilies. The longer and more complicated analysed sequence is, the more useful this approach seems to be, so for short proteins typical domain bubbles would work better.

People that are into genomic sequences may notice similarity of this approach to Circos developed by Martin Krzywinski (whose work I really admire, especially on HDTR). Basically the idea behind both is pretty much the same, but I’ve never thought about straightening that circle until I saw The Shape of Song. My thinking is sometimes dramatically schematic…


Tags: , , , , ,

CLANS – java tool for cluster analysis of sequences

As frequent visitors of this blog have already noticed, I am a big fan of different tools for data visualization. Today I would like to point you to java software called CLANS (CLuster ANalysis of Sequences) developed by my former colleague Tancred Frickey. CLANS runs (PSI)BLAST on your sequences, all vs all, and clusters them in 2D or 3D according to their similarity. This method allows for rapid classification of huge datasets and has the advantage over, lets say, phylogenetic tree, that one can quickly assess results of the clustering in a visual way (I cannot imagine making any sense of looking at phylogenetic tree with 1500 branches, while the graphical output, as on the animation below, is pretty easy to read).

CLANS animation

Beauty of the idea behind CLANS is that you can apply this method almost to any dataset which can be translated into all-vs-all relations. CLANS page has examples from protein clustering, microarray analysis and (which I like the most) image showing how standard aminoacids cluster in space according to BLOSUM62.


Tags: , , ,