Tag Archives: bioinformatics

HMMER3 testing notes – my skills are (finally) becoming obsolete

Hidden Markov Model with Output
Image via Wikipedia

It’s already quite a while since I’ve started to extensively test performance of HMMER3. As many other people noticed before, speed of the search has improved dramatically – I’m really impressed how fast it is. However, it’s only part of the story. The smaller part actually.

As some of readers may know, most of my projects so far were revolving around protein sequence analysis and sequence-structure relationships. Mainly I was doing analysis of sequences that had no clear similarity to anything known, without functional annotation. Usual task was to run sequence comparison software and look at the end of the hit list, trying to make sense from hits beyond any reasonable E-value thresholds (for example I often run BLAST at E-value of 100 or 1000). I use very limited number of tools, because it takes quite a while to understand on which specific patterns a particular software fails.

The high-end tool I use most often is HHpred – HMM-HMM comparison software. It’s slow but very sensitive – my personal benchmarks show that it is able to identify very subtle patterns in sequence formed slightly above level of similar secondary structures (in other words, from the set of equally dissimilar sequences with identical secondary structure order, it correctly identifies the ones with similar tertiary structure).

The most surprising thing about HMMER3 is that in my personal benchmarks it’s almost as sensitive as HHpred. I wasn’t expecting that HMM-sequence comparison can be as good as HMM-HMM.  This observation suggests that there’s still a room for improvement for the latter approach, however it has already big implications.

PFAM will soon migrate to HMMER3 (the PFAM team is now resolving overlaps between families that arose due to increased sensitivity) and the moment it is be available, it will make a huge number of publications obsolete, or simply wrong. There are thousands of articles that discuss in detail evolutionary history of some particular domain (many of these will become obsolete) or draw some conclusions from the observation that some domain is not present in analyzed sequence/system (many of these will need to be revised). It will also make my skills quite obsolete, but that is always to be expected, no matter in what branch of science one is working. I also imagine that systems biology people will be very happy to have much better functional annotation of proteins.

I don’t want to call development of HMMER3 a revolution, but it will definitely have similar impact on biology as BLAST and HMMER2 had. Not only because of its speed, but also because it will create a picture of similarities between all proteins comparable to the picture state-of-the-art methods could only calculate for their small subset.

Reblog this post [with Zemanta]

Posted by on April 22, 2009 in bioinformatics, Research, Software


Tags: , , , ,

Structure prediction without structure – visual inspection of BLAST results

portschemaMy recent post on visual analytics in bioinformatics lacked a specific example, but I’m happy to finally provide one (happiness comes also from the fact that respective publication is finally in press). The image above shows a multiple pairwise alignment from BLAST of a putative inner membrane protein from Porphyromonas gingivalis. Image is small but it does not really matter – colour patches seem to be visible anyway.

Regions marked with ovals are clearly less conserved, than other part of the protein. There are five hydrophobic (green patches, underlined with blue lines) regions in this alignment (I ignore N-terminus, as this is likely the signal peptide), however the three inner ones appear to be of similar length, while the outer ones seem to be of the half as long as the inner ones. If we assume that the single unit is the short one, we can summarize the protein as follows: 8 beta structures, four long loops, for short loops. It looks like an eight-stranded outer membrane beta-barrel. Almost structure prediction, but without a structure.

I could end the story here, but the model didn’t fit previously published data. Its localization in the inner membrane was confirmed by an experiment, however pores in the inner membrane are considered very harmfull 😉 . Fortunately, one of my colleagues explained to me that particular localization technique is not 100% reliable, so I gathered more evidence, created detailed description of topology and the other group has designed experiments which confirmed my visual analysis.

Lessons learned? Maybe without this feedback on quality of that experimental technique, I would still claim that this is OM beta-barrel. Or maybe not. But I’ve learned that to safely ignore experimental results, one needs a more than a intuition. Also, it shows that sometimes looking at the results, is all one needs to make a reasonable prediction (I still have no idea what were E-values of these BLAST hits, but does it matter?).

Reblog this post [with Zemanta]

Posted by on February 3, 2009 in bioinformatics, Research, Visualization


Tags: , , , , ,

Microblogging in PLoS

I don’t usually repost news, as my FriendFeed stream (also available from the sidebar of this blog)  is a more efficient way to let you know about interesting things, but this one deserves a special mention. Recent coverage of ISMB 2008 conference over at FriendFeed ended up as a publication in PLoS Computational Biology:

Saunders, N., Beltrão, P., Jensen, LJ, Jurczak, D., Krause, R., Kuhn, M. and Wu, S. (2009).
Microblogging the ISMB: A New Approach to Conference Reporting.
PLoS Comput Biol 5(1): e1000263

This is very exciting, but it also has some interesting implications. Of course it means that more and more people will participate in our community and finally BioGang projects will start to take off (hopefully), but I think also about something else. Do you remember Neil’s post about why you should have online presence? I think there’s one more thing to add to this list. Authors of this publication and lots of others scientists over at FriendFeed will sooner or later climb to to a PI-equivalent positions, where they can decide about hiring somebody. And strong online presence will be for them an important asset in CV. Much more important than you’d think today ;).

Reblog this post [with Zemanta]
Comments Off on Microblogging in PLoS

Posted by on January 30, 2009 in Community


Tags: , , , ,

Database query and ranked results

The Autophagy network extracted from the recen...
Image via Wikipedia

Already some time ago I’ve  read a piece by Marcelo Calbucci: Is it a database or a search engine?. While it deals with search information within a real estate database, I think his comments are applicable in the many areas of life sciences.

In short, Marcelo points out that people miss a lot of interesting entries while looking for a house, because of inflexibility of the query; number of bedrooms, price, distance from some point – these are all set. However, users are flexible and in such case need rather a search engine that gives them close enough answer or allows to specify weight to each filter.

In life sciences we do search for similarities and analogies all the time. Sometimes it’s direct comparison of sequences, on other occasion is high-level meta-comparison between two systems. And while we have various (statistical) metrics of similarities and they sometimes become a part of a database designs, interfaces of biological databases don’t allow to rank query results according to these metrics. For example I can easily find all human proteins related to disease X or disease Y or disease Z, although I cannot specify that I want proteins related to Z AND Y first on the list. Other example would be searching PubMed – I can look for articles related to “synthetic biology”, but I have no way to specify, that I want papers by James Collins from HHMI AND articles related to these papers to be first on the list. I guess it is possible to obtain such results without going through the whole list, but I doubt the method will be very simple. Filtering still seems to be neglected aspect of database design in life sciences.

My dream biological search engine would have a series of sliders (or ideally, I would like to have a device with series of mechanical knobs attached to the computer) and would allow me to dynamically change weights of various aspects of the query and see immediately how it affects the results. It would be something resembling interactivity of Gapminder World, but on dynamically generated data. Technology and proof of concept seems to be there, but I guess we need to wait quite a few years before this approach will be adopted within life sciences.

Reblog this post [with Zemanta]

Posted by on January 22, 2009 in bioinformatics, Data mining, Software


Tags: , , , ,

Bioinformatics is a visual analytics (sometimes)

Short description of my research interest is “I do proteins” (I took this phrase from my friend Ana). I try to figure out what particular protein, protein family, or set of proteins does in the wider context. Usually I start where automated methods have ended – I have all kinds of annotation so I try to put data together and form some hypothesis. I recently realized that the process is basically visualizing different kind of data – or rather looking at the same issue from many different perspectives.

It starts with alignments. Lots of alignments. And they all end up in different forms of visual representation. Sometimes it’s a conservation with secondary structure prediction (with AlignmentViewer or Jalview):


Sometimes I look for transmembrane beta-barrels (with ProfTMB):


Sometimes I try to find a pattern in hydrophobicity and side-chain size values across the alignment (Aln2Plot):


Afterwards I seek for patterns and interesting correlations in domain organization (PFAM, Smart):


Sometimes I map all these findings onto a structure or a model that I make somewhere in the meantime based on found data (Pymol, VMD, Chimera):


I also try to make sense out of genomic context (works for eukaryotic organisms as well – The SEED):


I investigate how the proteins cluster together according to their similarity (CLANS):


And figure out how the protein or the system I’m studying fits into interaction or metabolic networks (Cytoscape, Medusa, STRING, STITCH):


If there’s some additional numerical information I dump it into analysis software (R, for simpler things DiVisa):


And I make note along the process in the form of a mindmap (Freemind, recently switched to Xmind, because it allows to store attachments and images in the mindmap file, not just link to them like Freemind does):blog-0010

So it turns out that I mainly do visual analytics. I spend considerable amount of time on preparing various representations of biological data and then the rest of the time I look at the pictures. While that’s not something every bioinformatician does, many of my colleagues have their own workflows that also rely heavily on pictures. For some areas it’s more prominent, for others it’s not, but the fact is that pictures are everywhere.

There are two reasons I use manual workflow with lots looking at intermediate results: I work with weak signals (for example, sometimes I need to run BLAST at E-value of 1000) or I need to deeply understand the system I study. Making connections between two seemingly unrelated biological entities requires wrapping one’s brain around the problem and… lots of looking at it.

And here comes the frustration. I counted that I use more than twenty (!) different programs for visualization. And even if I’m enjoying monitor setup 4500 pixels wide which is almost enough to put all that data onto screen, the main issue is that the software isn’t connected. AlignmentViewer cannot adjust its display automatically based on the domain I’m looking at or a network node I’m investigating – I need to do it by myself. Of course I can couple alignments and structure in Jalview, Chimera or VMD but I don’t find such solution to be usable on the long run. To have the best of all worlds, I need to juggle all these applications.

I’ve been longing for some time already for a generic visualization platform that is able to show 2D and 3D data within the single environment, so I follow development of SecondLife visualization environment and Croquet/Cobalt initiatives. While these don’t look very exciting right now, I hope they will provide a common platform for different visualization methods (and of course visual collaboration environment).

But to be realistic, visual analytics in biology is not going to become a mainstream. It’s far more efficient to improve algorithms for multidimensional data analysis than to spend more time looking at pictures. I had already few such situations when I could see some weak signal and in a year or two it became obvious. But I’m still going to enjoy scientific visualization. I came to science for aesthetic reasons after all. 🙂

Reblog this post [with Zemanta]

Tags: , , , , , , , ,

Data from Bioinformatics Career Survey posted

Data analysis of Bioinformatics Career Survey

Data analysis of Bioinformatics Career Survey

Michael Barton did a great job of collecting and cleaning data for First Bioinformatics Career Survey. Raw results are available at Github and please read also details on the analysis and sharing results over at OWW page.

Michael encouraged to go wild with an analysis, so here’s my quick look at the data. On the image above you can see a scatter plot of salary vs years in the field (top), histogram of salaries (bottom left), histogram of planned years in the field and histogram of positions (bottom right). All plots are colored according to the positions.

There some obvious things in these graphs, such as correlations between position and salary or between years in the field and position (see also the video below). But what strikes me is the plot showing estimated number of years in the field. There are some local maxima at around 5, 20 and 30 years, but its very interesting to see that ca. half of the people see themselves in bioinformatics for another 25-30 years and longer, and there’s no clear correlation between positions of these people and these predictions (other than senior/PI-level staff doesn’t like an idea of working for another 30-40 years). The reason I find it interesting is that I have no idea how bioinformatics will look like in these 20-30 years (and that was the reason I’ve put conservative 5 years in this field). Do you know? Do you have an idea how bioinformatics will look like so much time ahead?

Reblog this post [with Zemanta]
Comments Off on Data from Bioinformatics Career Survey posted

Posted by on September 2, 2008 in bioinformatics, Career


Tags: , ,

BadA head structure

Modularity is one of the most interesting features of the trimeric autotransporter adhesins, and probably one of the most frustrating. As I wrote before, domain annotation is quite difficult, especially that these proteins can have often few thousands residues in length.

BadA, the major adhesin of Bartonella henselae, is probably the best known large TAA out there. Its sequence served us as a unofficial benchmark for domain annotation tool. Its head consist of three domains, one resembling head of YadA and two others which we claimed are similar to Hia head domains. The claim at the moment of starting this project wasn’t supported very well – Evalues of HHpred alignments were around 1 (of course all less sensitive tools didn’t see anything), but we knew they must be similar (because that two,three conserved residues were at exactly where we expected). Crystal structure of these two domains from BadA couldn’t be solved directly, so we’ve attempted molecular replacement and that worked. On the picture above you can see three known head structures for TAAs, BadA (ours), Hia and YadA (full BadA head model in on the right) and arrangement of corresponding domains in all three proteins. The whole story and lots of pretty pictures (you must see EM figures) was published today yesterday in PLoS Pathogens (OA).

Today the story isn’t so exciting as it was at the beginning. Currently HHpred easily finds domains from Hia and BadA similar with high probability – it’s an advantage of bigger database size and more mediating sequences. But I’m still pretty happy about how it went – such projects build confidence in one’s analysis skills.

Zemanta Pixie

Posted by on August 9, 2008 in bioinformatics


Tags: , , , ,

Surprises in biological databases – nr

If you wonder why clustering with cd-hit of a recent nr database from NCBI takes ages, here’s an answer:

>gi|10955428|ref|NP_053140.1| hypothetical protein pB171_078 [Escherichia coli]gi|16082681|ref|NP_395228.1|
 transposase/IS protein [Yersinia pestis CO92]gi|16082847|ref|NP_395401.1| transposase/IS protein [Yersinia
 pestis CO92]gi|16120383|ref|NP_403696.1| transposase/IS protein [Yersinia pestis CO92]gi|16120444|ref|NP_4
03757.1| transposase/IS protein [Yersinia pestis CO92]gi|16120514|ref|NP_403827.1| transposase/IS protein [
Yersinia pestis CO92]gi|16120586|ref|NP_403899.1| transposase/IS protein [Yersinia pestis CO92]gi|16120719|
ref|NP_404032.1| transposase/IS protein [Yersinia pestis CO92]gi|16120857|ref|NP_404170.1| transposase/IS p
rotein [Yersinia pestis CO92]gi|16120894|ref|NP_404207.1| transposase/IS protein [Yersinia pestis CO92]gi|1
6120962|ref|NP_404275.1| transposase/IS protein [Yersinia pestis CO92]gi|16121092|ref|NP_404405.1| transpos
ase/IS protein [Yersinia pestis CO92]gi|16121136|ref|NP_404449.1| transposase/IS protein [Yersinia pestis C
O92]gi|16121228|ref|NP_404541.1| transposase/IS protein [Yersinia pestis CO92]gi|16121314|ref|NP_404627.1|
transposase/IS protein [Yersinia pestis CO92]gi|16121385|ref|NP_404698.1| transposase/IS protein [Yersinia
pestis CO92]gi|16121430|ref|NP_404743.1| transposase/IS protein [Yersinia pestis CO92]gi|16121620|ref|NP_40
4933.1| transposase/IS protein [Yersinia pestis CO92]gi|16121706|ref|NP_405019.1| transposase/IS protein [Y
ersinia pestis CO92]gi|16121792|ref|NP_405105.1| transposase/IS protein [Yersinia pestis CO92]gi|16121890|r
ef|NP_405203.1| transposase/IS protein [Yersinia pestis CO92]gi|16121951|ref|NP_405264.1| transposase/IS pr
otein [Yersinia pestis CO92]gi|16121988|ref|NP_405301.1| transposase/IS protein [Yersinia pestis CO92]gi|16
122008|ref|NP_405321.1| transposase/IS protein [Yersinia pestis CO92]gi|16122148|ref|NP_405461.1| transposa
se/IS protein [Yersinia pestis CO92]gi|16122266|ref|NP_405579.1| transposase/IS protein [Yersinia pestis CO
92]gi|16122324|ref|NP_405637.1| transposase/IS protein [Yersinia pestis CO92]gi|16122408|ref|NP_405721.1| t
ransposase/IS protein [Yersinia pestis CO92]gi|16122588|ref|NP_405901.1| transposase/IS protein [Yersinia p
estis CO92]gi|16122620|ref|NP_405933.1| transposase/IS protein [Yersinia pestis CO92]gi|16122738|ref|NP_406
051.1| transposase/IS protein [Yersinia pestis CO92]gi|16122852|ref|NP_406165.1| transposase/IS protein [Ye
rsinia pestis CO92]gi|16122926|ref|NP_406239.1| transposase/IS protein [Yersinia pestis CO92]gi|16123007|re
f|NP_406320.1| transposase/IS protein [Yersinia pestis CO92]gi|16123118|ref|NP_406431.1| transposase/IS pro
tein [Yersinia pestis CO92]gi|16123368|ref|NP_406681.1| transposase/IS protein [Yersinia pestis CO92]gi|161
23410|ref|NP_406723.1| transposase/IS protein [Yersinia pestis CO92]gi|16123439|ref|NP_406752.1| transposas
e/IS protein [Yersinia pestis CO92]gi|16123584|ref|NP_406897.1| transposase/IS protein [Yersinia pestis CO9
2]gi|16123688|ref|NP_407001.1| transposase/IS protein [Yersinia pestis CO92]gi|16123734|ref|NP_407047.1| tr
ansposase/IS protein [Yersinia pestis CO92]gi|16123839|ref|NP_407152.1| transposase/IS protein [Yersinia pe
stis CO92]gi|16123892|ref|NP_407205.1| transposase/IS protein [Yersinia pestis CO92]gi|16123908|ref|NP_4072
21.1| transposase/IS protein [Yersinia pestis CO92]gi|16124133|ref|NP_407446.1| transposase/IS protein [Yer
sinia pestis CO92]gi|22123963|ref|NP_667386.1| transposase/IS protein [Yersinia pestis KIM]gi|22124031|ref|
NP_667454.1| transposase/IS protein [Yersinia pestis KIM]gi|22124203|ref|NP_667626.1| transposase/IS protei
n [Yersinia pestis KIM]gi|22124372|ref|NP_667795.1| transposase/IS protein [Yersinia pestis KIM]gi|22124391
|ref|NP_667814.1| transposase/IS protein [Yersinia pestis KIM]gi|22124420|ref|NP_667843.1| transposase/IS p
rotein [Yersinia pestis KIM]gi|22124556|ref|NP_667979.1| transposase/IS protein [Yersinia pestis KIM]gi|221
24665|ref|NP_668088.1| transposase/IS protein [Yersinia pestis KIM]gi|22124814|ref|NP_668237.1| transposase
/IS protein [Yersinia pestis KIM]gi|22124844|ref|NP_668267.1| transposase/IS protein [Yersinia pestis KIM]g
i|22124913|ref|NP_668336.1| transposase/IS protein [Yersinia pestis KIM]gi|22125025|ref|NP_668448.1| transp
osase/IS protein [Yersinia pestis KIM]gi|22125118|ref|NP_668541.1| transposase/IS protein [Yersinia pestis
KIM]gi|22125219|ref|NP_668642.1| transposase/IS protein [Yersinia pestis KIM]gi|22125447|ref|NP_668870.1| t
ransposase/IS protein [Yersinia pestis KIM]gi|22125565|ref|NP_668988.1| transposase/IS protein [Yersinia pe
stis KIM]gi|22125833|ref|NP_669256.1| transposase/IS protein [Yersinia pestis KIM]gi|22125913|ref|NP_669336
.1| transposase/IS protein [Yersinia pestis KIM]gi|22126032|ref|NP_669455.1| transposase/IS protein [Yersin
ia pestis KIM]gi|22126111|ref|NP_669534.1| transposase/IS protein [Yersinia pestis KIM]gi|22126227|ref|NP_6
69650.1| transposase/IS protein [Yersinia pestis KIM]gi|22126294|ref|NP_669717.1| transposase/IS protein [Y
ersinia pestis KIM]gi|22126458|ref|NP_669881.1| transposase/IS protein [Yersinia pestis KIM]gi|22126621|ref
|NP_670044.1| transposase/IS protein [Yersinia pestis KIM]gi|22126672|ref|NP_670095.1| transposase/IS prote
in [Yersinia pestis KIM]gi|22126967|ref|NP_670390.1| transposase/IS protein [Yersinia pestis KIM]gi|2212702
6|ref|NP_670449.1| transposase/IS protein [Yersinia pestis KIM]gi|22127088|ref|NP_670511.1| transposase/IS
protein [Yersinia pestis KIM]gi|22127284|ref|NP_670707.1| transposase/IS protein [Yersinia pestis KIM]gi|22
127489|ref|NP_670912.1| transposase/IS protein [Yersinia pestis KIM]gi|22127607|ref|NP_671030.1| transposas
e/IS protein [Yersinia pestis KIM]gi|22127670|ref|NP_671093.1| transposase/IS protein [Yersinia pestis KIM]
gi|22127690|ref|NP_671113.1| transposase/IS protein [Yersinia pestis KIM]gi|22127900|ref|NP_671323.1| trans
posase/IS protein [Yersinia pestis KIM]gi|31795384|ref|NP_857837.1| transposase/IS protein [Yersinia pestis
 KIM]gi|31795462|ref|NP_857912.1| transposase/IS protein [Yersinia pestis KIM]gi|32470047|ref|NP_862989.1|
putative ATP-binding protein [Escherichia coli]gi|45439896|ref|NP_991435.1| transposase/IS protein [Yersini
a pestis biovar Microtus str. 91001]gi|45439948|ref|NP_991487.1| transposase/IS protein [Yersinia pestis bi
ovar Microtus str. 91001]gi|45440109|ref|NP_991648.1| transposase/IS protein [Yersinia pestis biovar Microt
us str. 91001]gi|45440257|ref|NP_991796.1| transposase/IS protein [Yersinia pestis biovar Microtus str. 910
01]gi|45440297|ref|NP_991836.1| transposase/IS protein [Yersinia pestis biovar Microtus str. 91001]gi|45440
401|ref|NP_991940.1| transposase/IS protein [Yersinia pestis biovar Microtus str. 91001]

But to tell the honest true, this is not a problem – this is less than 10% of only one of many other problems. This particular protein (gi number: 10955428) has over three hundred other gi numbers in its header in non-redundant database from NCBI, which apparently made cd-hit stand still in amusement of such a lengthy description for weeks. Quick fix in Perl, and now the clustering is going to be finished within few hours, as it should.


Posted by on May 25, 2008 in bioinformatics


Tags: ,

Joining ONS club – classification and prediction of bacteriocins

It’s finally the time to jump in into Open Notebook Science pool with my small project: classification and prediction of bacteriocins. Main page of this project is on Freelancing Science wiki: After reading recent post by Michael Barton on ONS , I’ve decided to stick only to wiki – I had already another blog set up for this project, but if blog doesn’t work very well for Michael, I doubt it will work for me. Since it’s completely side project, updates on the project blog on would be embarassingly rare. So far the wiki doesn’t contain much of a data, nothing more than a plan in fact. But I think it’s important to at least start somewhere.
Direct inspiration for the project was this post at Microbiology Blog. It describes results of some experiments on growth inhibition of bacteria by haloarcheal organisms, which could be in some cases explained by novel archeocins, peptide or protein antibiotics from Archea. After quick look I realised, that I see sequence similarity between seemingly non-related bacteriocins. That of course lead to a question if I am able to repeat the procedure from my PhD project – understand the protein family, and then write an annotation/prediction tool. I don’t expect outstanding results but at least this will be a good occasion to document my approach to protein sequence annotation. So if not scientific, it should have at least a little of educational value.


Posted by on May 3, 2008 in bioinformatics, Research


Tags: , , , ,

Bug tracking systems in science

I’m not going to describe painful process of correcting entries in biological databases or errors in publications when one is not the author – we all know how difficult and unrewarding it is. All major databases contain wrong entries – I see misannotated (or nonexistent) genes in Genbank, artificial domains in PFAM or poorly solved structures in PDB. It’s even worse in publications, where across the whole spectrum of journals I see errors which in theory shouldn’t slip through peer review (this includes such prominent publishers like NPG).

One of the best idea I heard that addressed this issue was to build a bug tracking system (I would like to give credit to the author, but I cannot find the source; wasn’t that one of biobloggers?). It’s simple and efficient. Something is wrong? Fill a bug report. It would be linking to the original entry, would be available for aggregation (for example to track report’s author activity), and possibly could be closed by somebody else than database maintainers or authors if it’s wrong. Because it would be external to all databases, maybe it could grow to provide “community corrected” versions of these databases?

What do you think? How useful such system could be?


Posted by on April 18, 2008 in Comments, Community, Software


Tags: , , ,