Configuring Torque and InterProScan

10 07 2008
Image via Wikipedia

If by the chance, you want to use InterProScan with Torque Resource Manager (queueing system based on PBS project) it doesn’t work by default (it’s tested with LSF, configuration files are supplied for original PBS and Sun Grid Engine). Fortunately there are two small changes needed in the InterProScan config files to make it work. First, during iprscan configuration, choose PBS54 as your queueing system. Then, in the file pbs54.conf (IPRSCANHOME/conf) remove “-d” switch from following two lines:

asyncsub=qsub [%optqueue][%optresource] -d -o /dev/null -e /dev/null "[%toolcmd]"
syncsub=qsub [%optqueue][%optresource] -d -o /dev/null -e /dev/null -I "[%toolcmd"]

Assumming that Torque binaries are available in the global PATH (qsub, qdel etc., on my machine they sit under /usr/local/bin), change in default shell in the enviroment file pbs54env.sh - from #!/bin/sh to #!/bin/bash. Also, you can add another directories to the PATH in that file (I didn’t). Voilla. InterProScan jobs are now queued.

Zemanta Pixie




Human genobiome and disease risk assesment

6 07 2008
Schematic diagram of the life cycle of {{w|Esc...Image via Wikipedia

I’ve recently attended a talk on the advancements of human metagenomics projects. As the speaker admitted, the whole field is a researchers’ gold mine - almost all they find is new and interesting. There were couple of interesting points - mainly concerning how limited our knowledge about things in here is. For example, there was a unconfirmed feeling among microbiologists that in fact all modern microbiology is nothing more than biology of E. coli and relatives. Now we know that for sure - number of known to us microbial species is estimated at 0.5% of all existing microbial species. Also, I heard a nice story about polish doctor who described in 19th century Helicobacter pylori and its role in gastric diseases (there was a Nobel prize for that in 2005), wrote a book and then trashed the whole thing because he couldn’t grow the bacteria in a pure culture. Another important issue was amount of data and lack of new ways of handling them.

But the most interesting for me was a connection between human microbiome and diseases. Or rather a possibility of such connection. I am not aware of any single case when composition of human microbiome have been proven to influence chance of getting ill and I don’t think there will be a lots of such correlations found soon. My impression is that correlations are to be found when we have both, a complete human genome and a complete metagenome of all that lives on particular person - a human genobiome, as I’ve called it (BTW, word “genobiome” is not present in Google - is there a better word for that?). And I believe that getting the first full human genobiome will be the achievement compared to sequencing human genome for the first time. Not because of technical difficulties - because of the all discoveries that need to be made to make it happen. For example, human gut of all people carries a species doing some sulfur reaction - but  its population is only up to few thousands cells. How many such cases are we have in our organisms? That is very good question. The field is brand new, and possibilities of speculations are endless.

Zemanta Pixie




PhD thesis in LaTeX

19 06 2008

For the record: here you can see a single (still unfinished) page of my PhD thesis prepared in LaTeX. I used PhD thesis style prepared by Jamie Stevens and wrote the whole thing using Kile editor. An image on the margin can be inserted with command:

\marginpar{
     \centering{
         \includegraphics[width=3cm]{image.pdf}
     }
     Caption text
}




Surprises in biological databases - nr

25 05 2008

If you wonder why clustering with cd-hit of a recent nr database from NCBI takes ages, here’s an answer:

>gi|10955428|ref|NP_053140.1| hypothetical protein pB171_078 [Escherichia coli]gi|16082681|ref|NP_395228.1|
 transposase/IS protein [Yersinia pestis CO92]gi|16082847|ref|NP_395401.1| transposase/IS protein [Yersinia
 pestis CO92]gi|16120383|ref|NP_403696.1| transposase/IS protein [Yersinia pestis CO92]gi|16120444|ref|NP_4
03757.1| transposase/IS protein [Yersinia pestis CO92]gi|16120514|ref|NP_403827.1| transposase/IS protein [
Yersinia pestis CO92]gi|16120586|ref|NP_403899.1| transposase/IS protein [Yersinia pestis CO92]gi|16120719|
ref|NP_404032.1| transposase/IS protein [Yersinia pestis CO92]gi|16120857|ref|NP_404170.1| transposase/IS p
rotein [Yersinia pestis CO92]gi|16120894|ref|NP_404207.1| transposase/IS protein [Yersinia pestis CO92]gi|1
6120962|ref|NP_404275.1| transposase/IS protein [Yersinia pestis CO92]gi|16121092|ref|NP_404405.1| transpos
ase/IS protein [Yersinia pestis CO92]gi|16121136|ref|NP_404449.1| transposase/IS protein [Yersinia pestis C
O92]gi|16121228|ref|NP_404541.1| transposase/IS protein [Yersinia pestis CO92]gi|16121314|ref|NP_404627.1|
transposase/IS protein [Yersinia pestis CO92]gi|16121385|ref|NP_404698.1| transposase/IS protein [Yersinia
pestis CO92]gi|16121430|ref|NP_404743.1| transposase/IS protein [Yersinia pestis CO92]gi|16121620|ref|NP_40
4933.1| transposase/IS protein [Yersinia pestis CO92]gi|16121706|ref|NP_405019.1| transposase/IS protein [Y
ersinia pestis CO92]gi|16121792|ref|NP_405105.1| transposase/IS protein [Yersinia pestis CO92]gi|16121890|r
ef|NP_405203.1| transposase/IS protein [Yersinia pestis CO92]gi|16121951|ref|NP_405264.1| transposase/IS pr
otein [Yersinia pestis CO92]gi|16121988|ref|NP_405301.1| transposase/IS protein [Yersinia pestis CO92]gi|16
122008|ref|NP_405321.1| transposase/IS protein [Yersinia pestis CO92]gi|16122148|ref|NP_405461.1| transposa
se/IS protein [Yersinia pestis CO92]gi|16122266|ref|NP_405579.1| transposase/IS protein [Yersinia pestis CO
92]gi|16122324|ref|NP_405637.1| transposase/IS protein [Yersinia pestis CO92]gi|16122408|ref|NP_405721.1| t
ransposase/IS protein [Yersinia pestis CO92]gi|16122588|ref|NP_405901.1| transposase/IS protein [Yersinia p
estis CO92]gi|16122620|ref|NP_405933.1| transposase/IS protein [Yersinia pestis CO92]gi|16122738|ref|NP_406
051.1| transposase/IS protein [Yersinia pestis CO92]gi|16122852|ref|NP_406165.1| transposase/IS protein [Ye
rsinia pestis CO92]gi|16122926|ref|NP_406239.1| transposase/IS protein [Yersinia pestis CO92]gi|16123007|re
f|NP_406320.1| transposase/IS protein [Yersinia pestis CO92]gi|16123118|ref|NP_406431.1| transposase/IS pro
tein [Yersinia pestis CO92]gi|16123368|ref|NP_406681.1| transposase/IS protein [Yersinia pestis CO92]gi|161
23410|ref|NP_406723.1| transposase/IS protein [Yersinia pestis CO92]gi|16123439|ref|NP_406752.1| transposas
e/IS protein [Yersinia pestis CO92]gi|16123584|ref|NP_406897.1| transposase/IS protein [Yersinia pestis CO9
2]gi|16123688|ref|NP_407001.1| transposase/IS protein [Yersinia pestis CO92]gi|16123734|ref|NP_407047.1| tr
ansposase/IS protein [Yersinia pestis CO92]gi|16123839|ref|NP_407152.1| transposase/IS protein [Yersinia pe
stis CO92]gi|16123892|ref|NP_407205.1| transposase/IS protein [Yersinia pestis CO92]gi|16123908|ref|NP_4072
21.1| transposase/IS protein [Yersinia pestis CO92]gi|16124133|ref|NP_407446.1| transposase/IS protein [Yer
sinia pestis CO92]gi|22123963|ref|NP_667386.1| transposase/IS protein [Yersinia pestis KIM]gi|22124031|ref|
NP_667454.1| transposase/IS protein [Yersinia pestis KIM]gi|22124203|ref|NP_667626.1| transposase/IS protei
n [Yersinia pestis KIM]gi|22124372|ref|NP_667795.1| transposase/IS protein [Yersinia pestis KIM]gi|22124391
|ref|NP_667814.1| transposase/IS protein [Yersinia pestis KIM]gi|22124420|ref|NP_667843.1| transposase/IS p
rotein [Yersinia pestis KIM]gi|22124556|ref|NP_667979.1| transposase/IS protein [Yersinia pestis KIM]gi|221
24665|ref|NP_668088.1| transposase/IS protein [Yersinia pestis KIM]gi|22124814|ref|NP_668237.1| transposase
/IS protein [Yersinia pestis KIM]gi|22124844|ref|NP_668267.1| transposase/IS protein [Yersinia pestis KIM]g
i|22124913|ref|NP_668336.1| transposase/IS protein [Yersinia pestis KIM]gi|22125025|ref|NP_668448.1| transp
osase/IS protein [Yersinia pestis KIM]gi|22125118|ref|NP_668541.1| transposase/IS protein [Yersinia pestis
KIM]gi|22125219|ref|NP_668642.1| transposase/IS protein [Yersinia pestis KIM]gi|22125447|ref|NP_668870.1| t
ransposase/IS protein [Yersinia pestis KIM]gi|22125565|ref|NP_668988.1| transposase/IS protein [Yersinia pe
stis KIM]gi|22125833|ref|NP_669256.1| transposase/IS protein [Yersinia pestis KIM]gi|22125913|ref|NP_669336
.1| transposase/IS protein [Yersinia pestis KIM]gi|22126032|ref|NP_669455.1| transposase/IS protein [Yersin
ia pestis KIM]gi|22126111|ref|NP_669534.1| transposase/IS protein [Yersinia pestis KIM]gi|22126227|ref|NP_6
69650.1| transposase/IS protein [Yersinia pestis KIM]gi|22126294|ref|NP_669717.1| transposase/IS protein [Y
ersinia pestis KIM]gi|22126458|ref|NP_669881.1| transposase/IS protein [Yersinia pestis KIM]gi|22126621|ref
|NP_670044.1| transposase/IS protein [Yersinia pestis KIM]gi|22126672|ref|NP_670095.1| transposase/IS prote
in [Yersinia pestis KIM]gi|22126967|ref|NP_670390.1| transposase/IS protein [Yersinia pestis KIM]gi|2212702
6|ref|NP_670449.1| transposase/IS protein [Yersinia pestis KIM]gi|22127088|ref|NP_670511.1| transposase/IS
protein [Yersinia pestis KIM]gi|22127284|ref|NP_670707.1| transposase/IS protein [Yersinia pestis KIM]gi|22
127489|ref|NP_670912.1| transposase/IS protein [Yersinia pestis KIM]gi|22127607|ref|NP_671030.1| transposas
e/IS protein [Yersinia pestis KIM]gi|22127670|ref|NP_671093.1| transposase/IS protein [Yersinia pestis KIM]
gi|22127690|ref|NP_671113.1| transposase/IS protein [Yersinia pestis KIM]gi|22127900|ref|NP_671323.1| trans
posase/IS protein [Yersinia pestis KIM]gi|31795384|ref|NP_857837.1| transposase/IS protein [Yersinia pestis
 KIM]gi|31795462|ref|NP_857912.1| transposase/IS protein [Yersinia pestis KIM]gi|32470047|ref|NP_862989.1|
putative ATP-binding protein [Escherichia coli]gi|45439896|ref|NP_991435.1| transposase/IS protein [Yersini
a pestis biovar Microtus str. 91001]gi|45439948|ref|NP_991487.1| transposase/IS protein [Yersinia pestis bi
ovar Microtus str. 91001]gi|45440109|ref|NP_991648.1| transposase/IS protein [Yersinia pestis biovar Microt
us str. 91001]gi|45440257|ref|NP_991796.1| transposase/IS protein [Yersinia pestis biovar Microtus str. 910
01]gi|45440297|ref|NP_991836.1| transposase/IS protein [Yersinia pestis biovar Microtus str. 91001]gi|45440
401|ref|NP_991940.1| transposase/IS protein [Yersinia pestis biovar Microtus str. 91001]

But to tell the honest true, this is not a problem - this is less than 10% of only one of many other problems. This particular protein (gi number: 10955428) has over three hundred other gi numbers in its header in non-redundant database from NCBI, which apparently made cd-hit stand still in amusement of such a lengthy description for weeks. Quick fix in Perl, and now the clustering is going to be finished within few hours, as it should.





Joining ONS club - classification and prediction of bacteriocins

3 05 2008

It’s finally the time to jump in into Open Notebook Science pool with my small project: classification and prediction of bacteriocins. Main page of this project is on Freelancing Science wiki: freelancingscience.wikispaces.com/bacteriocins. After reading recent post by Michael Barton on ONS , I’ve decided to stick only to wiki - I had already another blog set up for this project, but if blog doesn’t work very well for Michael, I doubt it will work for me. Since it’s completely side project, updates on the project blog on would be embarassingly rare. So far the wiki doesn’t contain much of a data, nothing more than a plan in fact. But I think it’s important to at least start somewhere.
Direct inspiration for the project was this post at Microbiology Blog. It describes results of some experiments on growth inhibition of bacteria by haloarcheal organisms, which could be in some cases explained by novel archeocins, peptide or protein antibiotics from Archea. After quick look I realised, that I see sequence similarity between seemingly non-related bacteriocins. That of course lead to a question if I am able to repeat the procedure from my PhD project - understand the protein family, and then write an annotation/prediction tool. I don’t expect outstanding results but at least this will be a good occasion to document my approach to protein sequence annotation. So if not scientific, it should have at least a little of educational value.





Domain annotation in trimeric autotransporter adhesins

10 04 2008

First major outcome of my PhD project has just appeared in the Bioinformatics (open access). It describes a system we have design to annotate specific group of bacterial proteins.

Trimeric autotransporter adhesins (TAAs) form one of the many families of bacterial surface proteins. In medically relevant species they adhere to host cells (in non-pathogenic species we don’t know what they adhere to), therefore they are considered essential virulence factors. They are autotransporters, which means that they are passing the outer membrane by themselves - C-terminal part makes a pore through which the rest of the protein goes out. In contrary to many other autotransporters, exported part is not cut but stays attached to the membrane by the C-terminal autotransport domain. TAAs are also trimeric - the pore is made of three subunits and the exported fiber is also a trimer. The last feature is pretty unique - so far it’s the only family of bacterial surface proteins which forms fibrous trimers. Interestingly, they differ in length between few hundred and five thousands residues.

What’s so special about these proteins for bioinformatician? Structure of the fiber is not homogenous - it is a mixture of globular domains and coiled-coils. On a sequence level, they have lots of internal repeats (see the picture), heavily biased residue composition, their domain composition and architecture varies by protein. The only conserved part in all TAAs in the autotransport domain. Systems designed to identify and annotate typical protein domains (such as PFAM) don’t handle them very well - average coverage of PFAM annotation of TAAs is about 30%. The server we have built relies on the fact that domains of TAAs are exclusive for this family (they do not appear anywhere else because its unique structural constrains). Therefore we could use different thresholds, manually curated alignments and domain-context derived rules to improve the annotation.

Manual analysis of TAAs sequences is pretty tedious (well, it was, now we have the server), but on the other hand I have learnt a lot about how to read a protein sequence. I mean really read and understand how particular combination of letters influences its structure.





BioBrick as a functional role

3 04 2008
Genetics

When I initially saw The MIT Registry of Standard Biological Parts, I just fell in love with the idea. However, after closer inspection I realized that it’s not what I hoped to find. The Registry collects an interchangeable functional modules that can be assembled into novel biological systems. And it does it as good as it sounds, but to a certain extent. Pedro wrote some time ago about unavoidable complexity and potential issues with collected parts. I completely agree with his arguments but I have even more doubts about the Registry’s current approach.

First of all, my feeling is that DNA-centric view of life starts to limit us in understanding what is happening at a molecular level. It moved forward science a lot and it is still extremely useful, but with genetics we are not going to understand and avoid emergent properties of biological systems. DNA, RNA, proteins at a sequence and structure level are all interacting with each other. This properties are encoded in DNA, I agree. However, as Pedro pointed out, we have no way to predict what happens after transferring a part to other organism. It is possible to select for mutations that will render this part usable in the other organism, but I don’t think this approach would be of much use if we deal with organisms that are hard to grow (imagine you want to insert a specific system into extremophile organism). And what is more, it’s not necessarily practical if we transfer the part to an organism which already has a similar element encoded in the genome.

In my humble opinion, the Registry can be extended in two directions, transforming parts into a containers that have a specific functional role and include sub-gene elements, like domains or tectons. Let me describe both in more detail.

Currently a BioBrick is assigned a function and a sequence. I would rather see a functional role, that can be fulfilled by many different sequences. For example, if we have an enzymatic function the BioBrick would include not only single DNA sequence from a single organism, but also a protein sequence, domains, sequence motifs and a structure (whatever is available), and all these should be available for all organisms for which we can assign reliably this information. To clarify, I’m far from populating the Registry with BLAST results. I would rather have it done manually, or at least in the way The SEED allows experts to create subsystems and assign a functional roles to proteins. In this way we could just take a gene from a target organism instead of mutating the original one. Having a container would mean that we could include there different flavors of the same gene (for example, after optimization).

For the second thing, I’m a big fan of creating novel functions out of existing elements. That’s a reason why I believe the Registry should include building blocks of proteins as well as other fancy things, like riboswitches. One of the obvious example would be a signal transduction element, where one can attach different receptor domains to the same membrane component. This has been done already thousands of times - why not to standardize it?

Maybe with these two changes maybe we could finally connects some dots and make a complexity of biological systems more understandable or at least traceable. Future directions of the Registry are not very well defined, so I believe there’s a space for at least discussion about such ideas.





Semi-automated workflows - Taverna Interaction Service

12 03 2008

I was still thinking about recent Neil’s wondering about possibility of automating every scientific workflow, when I saw this (Bioinformatics Advance Access abstract):

The Taverna Interaction Service: enabling manual interaction in workflows by Anders Lanzén and Tom Oinn

Taverna is an application that eases the integration of tools and databases for life science research by the construction of workflows. The Taverna Interaction Service extends the functionality of Taverna by defining human interaction within a workflow and acting as a mediation layer between the automated workflow engine and one or more users.

I have not tried it yet but this Taverna plugin is very likely an answer to doubts I often have when automation of bioinformatics workflows is discussed: we shouldn’t always remove ourselves from the workflow, as interaction with software can be often critical in making a discovery. For example conscious decision about which sequences should go in during PSI-BLAST search can dramatically influence quality of resulting profile. So I agree with Neil that not every workflow can be automated, but more importantly not every workflow should be. Possibility of wrapping one’s mind around a problem is gone when there’s no feedback loop on the process.





Mining PubMed - another tools available

5 03 2008

There are two new tools available that mine semantically PubMed abstracts, e-LiSe and Anne O’Tate. First one was made by my colleagues from Institute of Biochemistry and Biophysics in Warsaw, while the second is from University of Illinois, Chicago. Female-sounding names is not the only thing that makes them look similar, they both provide analogous functionality, like keywords or author names associated with user query.

There’s quite a lot of third party interfaces to PubMed (see David Rothman’s excellent list), so I couldn’t resist to run few queries on both servers and compare them to GoPubmed, which currently wins hands down in terms of features and interface. I wasn’t surprised to see that results overlap significantly, although not completely. Each of servers found valuable keywords other two did not have - that’s understandable, they use different algorithms. I wonder if we will see a meta-server of PubMed data-miners, like there are for protein structure prediction (for example meta.bioinfo.pl). In theory, exhaustive search for meaningful keywords by different methods and then their classification and analysis should work better than any single method, but this is just a guess.





Importance of null models - slides by Kevin Karplus

21 02 2008

Again, a short note today (but I have some longer posts on the way). I’ve just fished reading slides of the talk Kevin Karplus had given on the 3DSig (satellite conference of the last ISMB in Vienna). The talk was entitled: Better than chance: the importance of null models. If you haven’t been there, I hope take-home messages will convince you to have a look:

  • Base your null models on biologically meaningful null hypotheses, not just computationally convenient math.
  • Generative models and simulation can be useful for more complicated models.
  • Picking the right model remains more art than science.

Very good connection of math skills and a feeling of biological problems.