HMMER3 testing notes – my skills are (finally) becoming obsolete

RSS

HMMER3 testing notes – my skills are (finally) becoming obsolete

22 Apr

: Image via Wikipedia

It’s already quite a while since I’ve started to extensively test performance of HMMER3. As many other people noticed before, speed of the search has improved dramatically – I’m really impressed how fast it is. However, it’s only part of the story. The smaller part actually.

As some of readers may know, most of my projects so far were revolving around protein sequence analysis and sequence-structure relationships. Mainly I was doing analysis of sequences that had no clear similarity to anything known, without functional annotation. Usual task was to run sequence comparison software and look at the end of the hit list, trying to make sense from hits beyond any reasonable E-value thresholds (for example I often run BLAST at E-value of 100 or 1000). I use very limited number of tools, because it takes quite a while to understand on which specific patterns a particular software fails.

The high-end tool I use most often is HHpred – HMM-HMM comparison software. It’s slow but very sensitive – my personal benchmarks show that it is able to identify very subtle patterns in sequence formed slightly above level of similar secondary structures (in other words, from the set of equally dissimilar sequences with identical secondary structure order, it correctly identifies the ones with similar tertiary structure).

The most surprising thing about HMMER3 is that in my personal benchmarks it’s almost as sensitive as HHpred. I wasn’t expecting that HMM-sequence comparison can be as good as HMM-HMM. This observation suggests that there’s still a room for improvement for the latter approach, however it has already big implications.

PFAM will soon migrate to HMMER3 (the PFAM team is now resolving overlaps between families that arose due to increased sensitivity) and the moment it is be available, it will make a huge number of publications obsolete, or simply wrong. There are thousands of articles that discuss in detail evolutionary history of some particular domain (many of these will become obsolete) or draw some conclusions from the observation that some domain is not present in analyzed sequence/system (many of these will need to be revised). It will also make my skills quite obsolete, but that is always to be expected, no matter in what branch of science one is working. I also imagine that systems biology people will be very happy to have much better functional annotation of proteins.

I don’t want to call development of HMMER3 a revolution, but it will definitely have similar impact on biology as BLAST and HMMER2 had. Not only because of its speed, but also because it will create a picture of similarities between all proteins comparable to the picture state-of-the-art methods could only calculate for their small subset.

The curse of BLAST (mndoci.com)

3 Comments

Posted by Pawel Szczesny on April 22, 2009 in bioinformatics, Research, Software

Tags: bioinformatics, biology, HMM, HMMER, PFAM

3 responses to “HMMER3 testing notes – my skills are (finally) becoming obsolete”

Richard Karpinski

April 22, 2009 at 16:24

I was fascinated to discover the link from several unique forms of an apparently unrelated gene in a few families to subsequent muscular distrophy. The middle of the RNA became “sticky” and hybridized with RNA of a signaling protein affecting expression of a hundred or so other genes which are related to the distrophy. Accidental creation of a double helix of RNA acting as RNAi. Perfectly sensible once you think of it, but weird.
alexbateman

April 30, 2009 at 11:48

Thanks for the interesting article. I’d be interested to read more about your sequence comparison benchmark. I found that PRC outperformed HHsearch on my Pfam Clans benchmark [1]. You might also be interested to know about a recent publication that appeared to significantly improve the performance of HHsearch by looking at a network of matches [2].

References

1. Bateman A, Finn RD. SCOOP: a simple method for identification of novel protein superfamily relationships. Bioinformatics. 2007;23(7):809-814.
2. Jung I, Kim D. SIMPRO: simple protein homology detection method by using indirect signals. Bioinformatics. 2009;25(6):729-735.
Pawel Szczesny

May 2, 2009 at 10:49

Thank you for the interesting comment dr Bateman, I’ve indeed missed the publication about SIMPRO.

As for my sequence comparison benchmark, I’ll prepare a separate post about that, but in short: it’s not a large scale benchmark – more of a CASP-like (I have access to a few structures that are not yet published). Also, I don’t mind false positives (even in cases I don’t know the answer I usually can filter these out by looking at the alignment), so it’s very likely that I sort tools in different order that most of other people.