I got the privilege of hosting the next edition of Bio::Blogs. If you have anything you would like to have included please send an email to szczesny dot pawel at gmail dot com or to bioblogs at gmail dot com before 1st of November.
Genome Commons – knowledgebase of human genetic variation
The title says it all – have a look at Steve Brenner’s commentary in Nature (looks like its freely accessible) and the Genome Commons web page.
|
Blender in visualization of molecules
Yes, you can use Blender to prepare figures for your next paper and the results for sure will look different than the ones obtained with a standard software (hemoglobin [1HBG] as example below)… But given amount of work and really steep learning curve (at least for somebody who tries that for the very first time), I would not recommend Blender that much… 🙂

UPDATE: if you look for a way to import a PDB file into Blender, some instructions are at the bottom of this page.
Type VII secretion system
Yet another secretion system was described, this time from Gram-positive bacteria (types I to VI were from Gram-negative). I expect that the further microbiology will go from E. coli, the more secretion systems will be found. Within the large spectrum of bacterial species we still know very little on bacteria outside proteobacterial group.
This is from Nature Reviews Microbiology, and subscription may be required.
|
iBioSeminars
Another example of web-based educational site: iBioSeminars was launched by The American Society for Cell Biology and contains seminars on medicine, cell biology and biological mechanisms. All available for download in QuickTime, mp4, iPodVideo or Powerpoint formats. Via ScienceRoll.
iBioSeminars is a freely available library of seminars from outstanding scientists. Our mission is to host lectures that describe on-going research in leading laboratories (they are not basic, survey-style lectures as might be found in undergraduate or graduate student biology courses). However, iBioSeminars features a more extensive introduction into the subject matter than a typical 50 min university seminar. Thus, these lectures are intended to be more accessible than many typical department seminars to advanced undergraduates/beginning graduate students and researchers outside of the specific field.
Healia and third party PubMed/Medline tools
David Rothman describes Healia, easy to use interface to the PubMed. But it’s just one of many third party PubMed/Medline tools David had described. Check out his posts related to the one about Healia.
|
Tenure dossier
Janet D. Stemwedel from Adventures in Ethics and Science publishes photographs of the three-ring binder containing her tenure dossier. She ends this post with the sentence: “I seem to recall that there are important aspects of life that you can’t cram into a three-hole punch.”
|
Manual sequence analysis – some common mistakes
This is a topic I probably will come back to on many occasions. Publication with very wrong sequence analysis like the one Stephen Spiro pointed out on his blog is not an exception. I may agree that large scale analysis can stand quick and dirty treatment of protein sequence (and some error propagation at the same time). In large scale analysis nobody cares if the domain assignment is 100% right (it isn’t), if there are false positives (there are) or even if the material to begin with (protein sequences for example) is free of errors (it is not) – as long as the overall quality of the work is acceptable. However, this optimistic approach cannot be applied to the manual protein sequence analysis. Simply errors introduced in such cases are a way more important. How to avoid some of these errors? A few common mistakes that come to my mind are:
- lack, not accurate or quick and dirty domain annotation: this probably is a topic for separate post, but in short – relying on a single method or strict E-value, excluding overlaps, ignoring internal repeats, forgetting about structural elements like transmembrane helices etc. lead to mistakes in domain annotation
- running PSI-BLAST search on unclustered databases: the profile for many query sequences will get biased and diverge in a random direction if the PSI-BLAST runs on the unclustered database (remember 500 copies of the same protein in the results?); after all these years I still don’t get why NCBI does not provide nr90 (non-redundant db clustered at 90% identity threshold) for the PSI-BLAST
- running PSI-BLAST without looking at the results of each run: if you don’t assess what goes in, you risk allowing some garbage
- masking low-complexity, coiled-coils and transmembrane regions in BLAST search on every single occasion: while most of the times this is a valid approach, there are cases where the answer is revealed after turning the masking off
- skipping other tools for sequence analysis like predictors of signal sequences, motifs, functional sites
- skipping analysis of a genomic context: while not applicable to all systems, analysis of the genomic context may influence dramatically function prediction
It’s so far all I could think of. Do you have any other suggestions? Let me know.
Survey of domain bubbles in protein sequence analysis
One of the key step in the analysis of unknown protein sequence is identification of domains that constitute that protein. There are many online tools that will search for a presence of known domains or identify them ab inito. Usually only the former present results in a graphical way called “domain bubbles”. Below you can find examples of common approaches to presenting results of a sequence annotation. Since most of them use the same domain definitions, names of the hits are the same in almost all cases.
One note: it’s not a comparison of the servers’ performance. The sequence is the same in all cases, but that was to show the differences between visualization methods, not the quality of the annotation.

This is example of sequence annotation by the SMART server. Domains are colored according to their source (SMART has a collection of domain definitions from various different sources), and non-domain sequence features (like transmembrane segments, low-complexity, disorder) are clearly differentiated from domains. The picture is generated with GIMP and it’s Perl-Fu extension and the script is available for download from a homepage of Ivica Letunic.

Color schema by PFAM is quite clear – the same domains have the same colors. PFAM (as well as following two servers) shows in the picture partial hits – this is the case where similarity between the domain and the protein spans only fragment of the domain (that may indicate many things, like genomic rearrangements, frameshifts, weak domain definition, etc). But PFAM script can actually plot many other sequence features onto the picture. You can use the script with your own annotation data here – the input is coded as a xml file conforming PFAM’s schema.

CDD looks pretty similar to the PFAM and shares some visual features. However, CD-Search page shows in a graphical way more than one line of hits. Usually the first line contains the best hits for the particular fragment, and following lines show overlapping hits with worse score. Here is shown only the first line.

OK, I may be biased here, since the HHpred is coded by my former colleagues, but I really like the domain bubbles from this server. Color schema is different from any other servers: bubbles are colored according to the score, from red (the best) to blue (the worst). Also it shows partial and overlapping hits (here are shown only few, the actual results page spans few screens in my browser). Similar to CDD, HHpred does not plot any other sequence features than domains.
So here are the major domain annotation servers which present results of the prediction in a nice graphical way (there are many others, but not all of them are using this simple way of presenting data, just to mention InterPro). Are these, after all pretty similar, approaches exploring all possible ways of presenting domain structure of a protein? I don’t think so. Watch this site, I may have something to add pretty soon.





Thoughts on CASP – Critical assessment of methods of protein structure prediction
I’ve just read an introduction to the supplemental issue of the journal PROTEINS, dedicated to the most recent round of the CASP experiment. It describes the progress of the protein structure prediction over the last few CASP editions.
The list of advancements include:
I believe that this was possible thanks to the progress that has been made in the area of sequence homology searches. Finding similarity between two sequences well beyond any reasonable identity thresholds is now doable thanks to profile-to-profile comparison, meta-servers (joining predictions from many different methods) or recent hmm-to-hmm algorithms (comparison of Hidden Markov Models). If you can find a suitable template for your protein, the rest is then much easier, isn’t it?
There are of course fields that still need some work. One of these often stirs a lot of discussion: automated assessing of model similarity to the real structure. The current methods have proven their suitability, I definitely agree. However I hope that at some point the protein structure comparison software will refuse to superimpose eight- and ten-stranded beta-barrels or left- and right-handed coiled-coil with a message: “It doesn’t make sense.”
Posted by Pawel Szczesny on October 10, 2007 in Comments, Papers, Research, Structure prediction
Tags: bioinformatics, casp, Proteins, Research, Structure prediction