RSS

Category Archives: bioinformatics

Bioinformatics is a visual analytics (sometimes)

Short description of my research interest is “I do proteins” (I took this phrase from my friend Ana). I try to figure out what particular protein, protein family, or set of proteins does in the wider context. Usually I start where automated methods have ended – I have all kinds of annotation so I try to put data together and form some hypothesis. I recently realized that the process is basically visualizing different kind of data – or rather looking at the same issue from many different perspectives.

It starts with alignments. Lots of alignments. And they all end up in different forms of visual representation. Sometimes it’s a conservation with secondary structure prediction (with AlignmentViewer or Jalview):

blog-0005

Sometimes I look for transmembrane beta-barrels (with ProfTMB):

blog-0005

Sometimes I try to find a pattern in hydrophobicity and side-chain size values across the alignment (Aln2Plot):

blog-0005

Afterwards I seek for patterns and interesting correlations in domain organization (PFAM, Smart):

blog-0008

Sometimes I map all these findings onto a structure or a model that I make somewhere in the meantime based on found data (Pymol, VMD, Chimera):

blog-0006

I also try to make sense out of genomic context (works for eukaryotic organisms as well – The SEED):

blog-0005

I investigate how the proteins cluster together according to their similarity (CLANS):

blog-0013

And figure out how the protein or the system I’m studying fits into interaction or metabolic networks (Cytoscape, Medusa, STRING, STITCH):

blog-0007

If there’s some additional numerical information I dump it into analysis software (R, for simpler things DiVisa):

blog-0005

And I make note along the process in the form of a mindmap (Freemind, recently switched to Xmind, because it allows to store attachments and images in the mindmap file, not just link to them like Freemind does): blog-0010

So it turns out that I mainly do visual analytics. I spend considerable amount of time on preparing various representations of biological data and then the rest of the time I look at the pictures. While that’s not something every bioinformatician does, many of my colleagues have their own workflows that also rely heavily on pictures. For some areas it’s more prominent, for others it’s not, but the fact is that pictures are everywhere.

There are two reasons I use manual workflow with lots looking at intermediate results: I work with weak signals (for example, sometimes I need to run BLAST at E-value of 1000) or I need to deeply understand the system I study. Making connections between two seemingly unrelated biological entities requires wrapping one’s brain around the problem and… lots of looking at it.

And here comes the frustration. I counted that I use more than twenty (!) different programs for visualization. And even if I’m enjoying monitor setup 4500 pixels wide which is almost enough to put all that data onto screen, the main issue is that the software isn’t connected. AlignmentViewer cannot adjust its display automatically based on the domain I’m looking at or a network node I’m investigating – I need to do it by myself. Of course I can couple alignments and structure in Jalview, Chimera or VMD but I don’t find such solution to be usable on the long run. To have the best of all worlds, I need to juggle all these applications.

I’ve been longing for some time already for a generic visualization platform that is able to show 2D and 3D data within the single environment, so I follow development of SecondLife visualization environment and Croquet/Cobalt initiatives. While these don’t look very exciting right now, I hope they will provide a common platform for different visualization methods (and of course visual collaboration environment).

But to be realistic, visual analytics in biology is not going to become a mainstream. It’s far more efficient to improve algorithms for multidimensional data analysis than to spend more time looking at pictures. I had already few such situations when I could see some weak signal and in a year or two it became obvious. But I’m still going to enjoy scientific visualization. I came to science for aesthetic reasons after all. 🙂

6 Comments

Posted by Pawel Szczesny on December 18, 2008 in bioinformatics, Proteins, Research, Software, Visualization

Tags: bioinformatics, biology, Chimera, Cytoscape, Online Services, protein, Protein family, Visual analytics, Visualization

Synthetic biology is not engineering, it’s a programming

19 Nov

Image via Wikipedia

Topic of this post has been sitting in my head for the very long time, but I couldn’t come up with a good enough opening. I’ve found it recently in the comments thread under the post on systems biology by Derek Lowe over at In the Pipeline. Citing Cellbio:

A trick of the human mind has us believe that if we rename something, we have changed the fundamental nature of the beast, but we have not.

I have taken it out of the context, but it applies very well to current situation in synthetic biology. My enormous frustration with this field comes from the fact that most of so-called synthetic biology is nothing else than genetic engineering with more systematic approach. The whole engineering meme has stuck in people’s head and many of them seem to care more about characterization of the system than about understanding how it works.

If we take a bearing from a car and from a bike, both will differ in shape and very likely one couldn’t be replaced by the other. However, their role and mechanism of work is the same, no matter in which machine we put it (this is BTW what I tried to say in my previous post on BioBricks, but judging from the comments I failed). Mainstream synthetic biology doesn’t seem to be interested in understanding how car and bike works – it’s interested in taking both of them apart as fast as possible, puting labels on the parts and pretend that now we understand how they work. And while this approach can be succesful to a certain extent in engineering, biology, especially synthetic biology, is not engineering, it’s rather a programming.

If we look at the particular component of conserved signalling pathway in two different organisms, its sequence most likely will differ. And for some pairs of organisms sequences of this component stop to be freely exchangable: they need to be mutated to fit particular chassis. Repository of information what works where is a great starting point, but it’s about the time to move further. It’s about the time to express biological systems as sets of functional roles and to build a compiler that transforms an abstract description of biological system into sequence understandable by the particular architecture (organism). This is what I think synthetic biology is all about. It’s designing by understanding.

Formalized language of biological processes sounds like a domain of systems biology, but a compiler certainly doesn’t, so such programming framework could use the best of both worlds. Can you imagine “Hello world” equivalent of a living cell? Or how would you debug program in such language? Sounds like lots of fun.

8 Comments

Posted by Pawel Szczesny on November 19, 2008 in bioinformatics, Biological engineering, Synthetic biology

Tags: BioBrick, Genetic engineering, Synthetic biology, Systems biology

Data from Bioinformatics Career Survey posted

02 Sep

Data analysis of Bioinformatics Career Survey

Michael Barton did a great job of collecting and cleaning data for First Bioinformatics Career Survey. Raw results are available at Github and please read also details on the analysis and sharing results over at OWW page.

Michael encouraged to go wild with an analysis, so here’s my quick look at the data. On the image above you can see a scatter plot of salary vs years in the field (top), histogram of salaries (bottom left), histogram of planned years in the field and histogram of positions (bottom right). All plots are colored according to the positions.

There some obvious things in these graphs, such as correlations between position and salary or between years in the field and position (see also the video below). But what strikes me is the plot showing estimated number of years in the field. There are some local maxima at around 5, 20 and 30 years, but its very interesting to see that ca. half of the people see themselves in bioinformatics for another 25-30 years and longer, and there’s no clear correlation between positions of these people and these predictions (other than senior/PI-level staff doesn’t like an idea of working for another 30-40 years). The reason I find it interesting is that I have no idea how bioinformatics will look like in these 20-30 years (and that was the reason I’ve put conservative 5 years in this field). Do you know? Do you have an idea how bioinformatics will look like so much time ahead?

Comments Off

Posted by Pawel Szczesny on September 2, 2008 in bioinformatics, Career

Tags: bioinformatics, Data analysis, Statistic

BadA head structure

09 Aug

Modularity is one of the most interesting features of the trimeric autotransporter adhesins, and probably one of the most frustrating. As I wrote before, domain annotation is quite difficult, especially that these proteins can have often few thousands residues in length.

BadA, the major adhesin of Bartonella henselae, is probably the best known large TAA out there. Its sequence served us as a unofficial benchmark for domain annotation tool. Its head consist of three domains, one resembling head of YadA and two others which we claimed are similar to Hia head domains. The claim at the moment of starting this project wasn’t supported very well – Evalues of HHpred alignments were around 1 (of course all less sensitive tools didn’t see anything), but we knew they must be similar (because that two,three conserved residues were at exactly where we expected). Crystal structure of these two domains from BadA couldn’t be solved directly, so we’ve attempted molecular replacement and that worked. On the picture above you can see three known head structures for TAAs, BadA (ours), Hia and YadA (full BadA head model in on the right) and arrangement of corresponding domains in all three proteins. The whole story and lots of pretty pictures (you must see EM figures) was published today yesterday in PLoS Pathogens (OA).

Today the story isn’t so exciting as it was at the beginning. Currently HHpred easily finds domains from Hia and BadA similar with high probability – it’s an advantage of bigger database size and more mediating sequences. But I’m still pretty happy about how it went – such projects build confidence in one’s analysis skills.

Domain annotation in trimeric autotransporter adhesins

2 Comments

Posted by Pawel Szczesny on August 9, 2008 in bioinformatics

Tags: Annotation, bioinformatics, protein, Protein domain, protein structure

Bioinformatics Career Survey – two weeks left

21 Jul

If you haven’t filled the survey yet, please spend few minutes to do so over at Bioinformatics Zen. There are only two weeks left.

Comments Off

Posted by Pawel Szczesny on July 21, 2008 in bioinformatics

Configuring Torque and InterProScan

10 Jul

Image via Wikipedia

If by the chance, you want to use InterProScan with Torque Resource Manager (queueing system based on PBS project) it doesn’t work by default (it’s tested with LSF, configuration files are supplied for original PBS and Sun Grid Engine). Fortunately there are two small changes needed in the InterProScan config files to make it work. First, during iprscan configuration, choose PBS54 as your queueing system. Then, in the file pbs54.conf (IPRSCANHOME/conf) remove “-d” switch from following two lines:

asyncsub=qsub [%optqueue][%optresource] -d -o /dev/null -e /dev/null "[%toolcmd]"
syncsub=qsub [%optqueue][%optresource] -d -o /dev/null -e /dev/null -I "[%toolcmd"]

Assumming that Torque binaries are available in the global PATH (qsub, qdel etc., on my machine they sit under /usr/local/bin), change in default shell in the enviroment file pbs54env.sh – from #!/bin/sh to #!/bin/bash. Also, you can add another directories to the PATH in that file (I didn’t). Voilla. InterProScan jobs are now queued.

3 Comments

Posted by Pawel Szczesny on July 10, 2008 in bioinformatics, Software

Tags: InterProScan, qsub, queueing system, torque

Human genobiome and disease risk assesment

06 Jul

Image via Wikipedia

I’ve recently attended a talk on the advancements of human metagenomics projects. As the speaker admitted, the whole field is a researchers’ gold mine – almost all they find is new and interesting. There were couple of interesting points – mainly concerning how limited our knowledge about things in here is. For example, there was a unconfirmed feeling among microbiologists that in fact all modern microbiology is nothing more than biology of E. coli and relatives. Now we know that for sure – number of known to us microbial species is estimated at 0.5% of all existing microbial species. Also, I heard a nice story about polish doctor who described in 19th century Helicobacter pylori and its role in gastric diseases (there was a Nobel prize for that in 2005), wrote a book and then trashed the whole thing because he couldn’t grow the bacteria in a pure culture. Another important issue was amount of data and lack of new ways of handling them.

But the most interesting for me was a connection between human microbiome and diseases. Or rather a possibility of such connection. I am not aware of any single case when composition of human microbiome have been proven to influence chance of getting ill and I don’t think there will be a lots of such correlations found soon. My impression is that correlations are to be found when we have both, a complete human genome and a complete metagenome of all that lives on particular person – a human genobiome, as I’ve called it (BTW, word “genobiome” is not present in Google – is there a better word for that?). And I believe that getting the first full human genobiome will be the achievement compared to sequencing human genome for the first time. Not because of technical difficulties – because of the all discoveries that need to be made to make it happen. For example, human gut of all people carries a species doing some sulfur reaction – but its population is only up to few thousands cells. How many such cases are we have in our organisms? That is very good question. The field is brand new, and possibilities of speculations are endless.

SNPWatch: Researchers Find SNP Associated with Diffuse-type Gastric Cancer

Comments Off

Posted by Pawel Szczesny on July 6, 2008 in bioinformatics, Research

Tags: biology, Helicobacter pylori, metagenomics, microbiology

PhD thesis in LaTeX

19 Jun

For the record: here you can see a single (still unfinished) page of my PhD thesis prepared in LaTeX. I used PhD thesis style prepared by Jamie Stevens and wrote the whole thing using Kile editor. An image on the margin can be inserted with command:

\marginpar{
     \centering{
         \includegraphics[width=3cm]{image.pdf}
     }
     Caption text
}

6 Comments

Posted by Pawel Szczesny on June 19, 2008 in bioinformatics, Comments

Tags: Dissertation, LaTeX, TeX, Typesetting

Surprises in biological databases – nr

25 May

If you wonder why clustering with cd-hit of a recent nr database from NCBI takes ages, here’s an answer:

>gi|10955428|ref|NP_053140.1| hypothetical protein pB171_078 [Escherichia coli]gi|16082681|ref|NP_395228.1|
 transposase/IS protein [Yersinia pestis CO92]gi|16082847|ref|NP_395401.1| transposase/IS protein [Yersinia
 pestis CO92]gi|16120383|ref|NP_403696.1| transposase/IS protein [Yersinia pestis CO92]gi|16120444|ref|NP_4
03757.1| transposase/IS protein [Yersinia pestis CO92]gi|16120514|ref|NP_403827.1| transposase/IS protein [
Yersinia pestis CO92]gi|16120586|ref|NP_403899.1| transposase/IS protein [Yersinia pestis CO92]gi|16120719|
ref|NP_404032.1| transposase/IS protein [Yersinia pestis CO92]gi|16120857|ref|NP_404170.1| transposase/IS p
rotein [Yersinia pestis CO92]gi|16120894|ref|NP_404207.1| transposase/IS protein [Yersinia pestis CO92]gi|1
6120962|ref|NP_404275.1| transposase/IS protein [Yersinia pestis CO92]gi|16121092|ref|NP_404405.1| transpos
ase/IS protein [Yersinia pestis CO92]gi|16121136|ref|NP_404449.1| transposase/IS protein [Yersinia pestis C
O92]gi|16121228|ref|NP_404541.1| transposase/IS protein [Yersinia pestis CO92]gi|16121314|ref|NP_404627.1|
transposase/IS protein [Yersinia pestis CO92]gi|16121385|ref|NP_404698.1| transposase/IS protein [Yersinia
pestis CO92]gi|16121430|ref|NP_404743.1| transposase/IS protein [Yersinia pestis CO92]gi|16121620|ref|NP_40
4933.1| transposase/IS protein [Yersinia pestis CO92]gi|16121706|ref|NP_405019.1| transposase/IS protein [Y
ersinia pestis CO92]gi|16121792|ref|NP_405105.1| transposase/IS protein [Yersinia pestis CO92]gi|16121890|r
ef|NP_405203.1| transposase/IS protein [Yersinia pestis CO92]gi|16121951|ref|NP_405264.1| transposase/IS pr
otein [Yersinia pestis CO92]gi|16121988|ref|NP_405301.1| transposase/IS protein [Yersinia pestis CO92]gi|16
122008|ref|NP_405321.1| transposase/IS protein [Yersinia pestis CO92]gi|16122148|ref|NP_405461.1| transposa
se/IS protein [Yersinia pestis CO92]gi|16122266|ref|NP_405579.1| transposase/IS protein [Yersinia pestis CO
92]gi|16122324|ref|NP_405637.1| transposase/IS protein [Yersinia pestis CO92]gi|16122408|ref|NP_405721.1| t
ransposase/IS protein [Yersinia pestis CO92]gi|16122588|ref|NP_405901.1| transposase/IS protein [Yersinia p
estis CO92]gi|16122620|ref|NP_405933.1| transposase/IS protein [Yersinia pestis CO92]gi|16122738|ref|NP_406
051.1| transposase/IS protein [Yersinia pestis CO92]gi|16122852|ref|NP_406165.1| transposase/IS protein [Ye
rsinia pestis CO92]gi|16122926|ref|NP_406239.1| transposase/IS protein [Yersinia pestis CO92]gi|16123007|re
f|NP_406320.1| transposase/IS protein [Yersinia pestis CO92]gi|16123118|ref|NP_406431.1| transposase/IS pro
tein [Yersinia pestis CO92]gi|16123368|ref|NP_406681.1| transposase/IS protein [Yersinia pestis CO92]gi|161
23410|ref|NP_406723.1| transposase/IS protein [Yersinia pestis CO92]gi|16123439|ref|NP_406752.1| transposas
e/IS protein [Yersinia pestis CO92]gi|16123584|ref|NP_406897.1| transposase/IS protein [Yersinia pestis CO9
2]gi|16123688|ref|NP_407001.1| transposase/IS protein [Yersinia pestis CO92]gi|16123734|ref|NP_407047.1| tr
ansposase/IS protein [Yersinia pestis CO92]gi|16123839|ref|NP_407152.1| transposase/IS protein [Yersinia pe
stis CO92]gi|16123892|ref|NP_407205.1| transposase/IS protein [Yersinia pestis CO92]gi|16123908|ref|NP_4072
21.1| transposase/IS protein [Yersinia pestis CO92]gi|16124133|ref|NP_407446.1| transposase/IS protein [Yer
sinia pestis CO92]gi|22123963|ref|NP_667386.1| transposase/IS protein [Yersinia pestis KIM]gi|22124031|ref|
NP_667454.1| transposase/IS protein [Yersinia pestis KIM]gi|22124203|ref|NP_667626.1| transposase/IS protei
n [Yersinia pestis KIM]gi|22124372|ref|NP_667795.1| transposase/IS protein [Yersinia pestis KIM]gi|22124391
|ref|NP_667814.1| transposase/IS protein [Yersinia pestis KIM]gi|22124420|ref|NP_667843.1| transposase/IS p
rotein [Yersinia pestis KIM]gi|22124556|ref|NP_667979.1| transposase/IS protein [Yersinia pestis KIM]gi|221
24665|ref|NP_668088.1| transposase/IS protein [Yersinia pestis KIM]gi|22124814|ref|NP_668237.1| transposase
/IS protein [Yersinia pestis KIM]gi|22124844|ref|NP_668267.1| transposase/IS protein [Yersinia pestis KIM]g
i|22124913|ref|NP_668336.1| transposase/IS protein [Yersinia pestis KIM]gi|22125025|ref|NP_668448.1| transp
osase/IS protein [Yersinia pestis KIM]gi|22125118|ref|NP_668541.1| transposase/IS protein [Yersinia pestis
KIM]gi|22125219|ref|NP_668642.1| transposase/IS protein [Yersinia pestis KIM]gi|22125447|ref|NP_668870.1| t
ransposase/IS protein [Yersinia pestis KIM]gi|22125565|ref|NP_668988.1| transposase/IS protein [Yersinia pe
stis KIM]gi|22125833|ref|NP_669256.1| transposase/IS protein [Yersinia pestis KIM]gi|22125913|ref|NP_669336
.1| transposase/IS protein [Yersinia pestis KIM]gi|22126032|ref|NP_669455.1| transposase/IS protein [Yersin
ia pestis KIM]gi|22126111|ref|NP_669534.1| transposase/IS protein [Yersinia pestis KIM]gi|22126227|ref|NP_6
69650.1| transposase/IS protein [Yersinia pestis KIM]gi|22126294|ref|NP_669717.1| transposase/IS protein [Y
ersinia pestis KIM]gi|22126458|ref|NP_669881.1| transposase/IS protein [Yersinia pestis KIM]gi|22126621|ref
|NP_670044.1| transposase/IS protein [Yersinia pestis KIM]gi|22126672|ref|NP_670095.1| transposase/IS prote
in [Yersinia pestis KIM]gi|22126967|ref|NP_670390.1| transposase/IS protein [Yersinia pestis KIM]gi|2212702
6|ref|NP_670449.1| transposase/IS protein [Yersinia pestis KIM]gi|22127088|ref|NP_670511.1| transposase/IS
protein [Yersinia pestis KIM]gi|22127284|ref|NP_670707.1| transposase/IS protein [Yersinia pestis KIM]gi|22
127489|ref|NP_670912.1| transposase/IS protein [Yersinia pestis KIM]gi|22127607|ref|NP_671030.1| transposas
e/IS protein [Yersinia pestis KIM]gi|22127670|ref|NP_671093.1| transposase/IS protein [Yersinia pestis KIM]
gi|22127690|ref|NP_671113.1| transposase/IS protein [Yersinia pestis KIM]gi|22127900|ref|NP_671323.1| trans
posase/IS protein [Yersinia pestis KIM]gi|31795384|ref|NP_857837.1| transposase/IS protein [Yersinia pestis
 KIM]gi|31795462|ref|NP_857912.1| transposase/IS protein [Yersinia pestis KIM]gi|32470047|ref|NP_862989.1|
putative ATP-binding protein [Escherichia coli]gi|45439896|ref|NP_991435.1| transposase/IS protein [Yersini
a pestis biovar Microtus str. 91001]gi|45439948|ref|NP_991487.1| transposase/IS protein [Yersinia pestis bi
ovar Microtus str. 91001]gi|45440109|ref|NP_991648.1| transposase/IS protein [Yersinia pestis biovar Microt
us str. 91001]gi|45440257|ref|NP_991796.1| transposase/IS protein [Yersinia pestis biovar Microtus str. 910
01]gi|45440297|ref|NP_991836.1| transposase/IS protein [Yersinia pestis biovar Microtus str. 91001]gi|45440
401|ref|NP_991940.1| transposase/IS protein [Yersinia pestis biovar Microtus str. 91001]

But to tell the honest true, this is not a problem – this is less than 10% of only one of many other problems. This particular protein (gi number: 10955428) has over three hundred other gi numbers in its header in non-redundant database from NCBI, which apparently made cd-hit stand still in amusement of such a lengthy description for weeks. Quick fix in Perl, and now the clustering is going to be finished within few hours, as it should.

2 Comments

Posted by Pawel Szczesny on May 25, 2008 in bioinformatics

Tags: bioinformatics, Online Services

Joining ONS club – classification and prediction of bacteriocins

03 May

It’s finally the time to jump in into Open Notebook Science pool with my small project: classification and prediction of bacteriocins. Main page of this project is on Freelancing Science wiki: freelancingscience.wikispaces.com/bacteriocins. After reading recent post by Michael Barton on ONS , I’ve decided to stick only to wiki – I had already another blog set up for this project, but if blog doesn’t work very well for Michael, I doubt it will work for me. Since it’s completely side project, updates on the project blog on would be embarassingly rare. So far the wiki doesn’t contain much of a data, nothing more than a plan in fact. But I think it’s important to at least start somewhere.
Direct inspiration for the project was this post at Microbiology Blog. It describes results of some experiments on growth inhibition of bacteria by haloarcheal organisms, which could be in some cases explained by novel archeocins, peptide or protein antibiotics from Archea. After quick look I realised, that I see sequence similarity between seemingly non-related bacteriocins. That of course lead to a question if I am able to repeat the procedure from my PhD project – understand the protein family, and then write an annotation/prediction tool. I don’t expect outstanding results but at least this will be a good occasion to document my approach to protein sequence annotation. So if not scientific, it should have at least a little of educational value.

4 Comments

Posted by Pawel Szczesny on May 3, 2008 in bioinformatics, Research

Tags: bioinformatics, microbiology, ONS, open notebook science, wikispaces