Visualization of internal repeats in proteins (or DNA)

24 Jan

There’s a number of protein families that have internal repeats (like TPR, Armadillo, ankyrin etc.). I’m very interested in many of them for reasons I will explain in other post. Assessing arrangement of these repeats is straightforward in majority of cases – most of them tend to occur next to each other, with little or no insertions between them (finding them at first is completely different story). However, there are proteins where internal repeats are separated by other domains or repeats, which can result in a real mess (or in scientific language: mosaic-like architecture). When couple of months ago I looked for some visualization method that would allow me to have a quick overview of internal structure of such proteins, I’ve stumbled across The Shape of Song – visualization method developed by Martin Wattenberg, researcher at IBM. This fitted my requirements so I’ve implemented it with some help of Processing (and which I’ve added later to a protein analysis server that has a chance to be published next month). Resulting visualization is below:

Internal repeats in a protein

Repeats are colored according to repeat type and are connected according to repeat family. If you think about it in terms of SCOP (Structural Classification of Proteins) hierarchy, colors represent class, while arcs connect superfamilies. The longer and more complicated analysed sequence is, the more useful this approach seems to be, so for short proteins typical domain bubbles would work better.

People that are into genomic sequences may notice similarity of this approach to Circos developed by Martin Krzywinski (whose work I really admire, especially on HDTR). Basically the idea behind both is pretty much the same, but I’ve never thought about straightening that circle until I saw The Shape of Song. My thinking is sometimes dramatically schematic…


Tags: , , , , ,

8 responses to “Visualization of internal repeats in proteins (or DNA)

  1. ignasi

    January 24, 2008 at 17:12

    Simply beautiful. Processing is an amazing language.
    By the way, do you recommend it for nice “bioinformatics outputs”? I’ve got to do some work that has an important part on the graphical output… and I’m strongly considering using processing, although I’m afraid of the learning curve it has. I jnow nothing about java but I’m getting into OOprogramming with python right now…
    Any advice? Thanks a lot.

  2. freesci

    January 24, 2008 at 17:26

    I don’t know Java almost at all, but still I can code something in Processing (it’s pretty easy to learn if you know at least one programming language). So learning curve of Processing may be not an issue. More important thing is that with Processing you cannot create a graphics without popping out a window with it. So in other words, it’s unusable if you plan to do some stuff through a web server/remotely executed script (you may include it as an applet in the webpage, which is what I did, but keep that in mind). Other than that it’s probably the best choice currently for preparing sophisticated output in a reasonable time in terms of coding (Processing script that generates above graphics is around 100 lines, including defining variables, reading/parsing input file, setting the sizes, legend etc.).

    For python I would recommend PIL (Python Imaging Library) and matplotlib. However, if you would specify a little bit what you are aiming for, probably I could come up with different recommendations.

  3. ignasi

    January 26, 2008 at 14:56

    Well, I have to represent some protein sequences and subsequences of them of at least 5 different types after some analyses and predictions.
    I’ve been looking for graphical tools in Perl (my scripts are actually in perl). I’ve found the GD library and some mods in BioPerl which are more or less what I want. But I’d like it to be interactive as well so when passing the pointer over the graphic for instance, it could display some values or the actual sequence bit or whatever… Not just plain pictures, which should also be available for larger analyses rather than single sequence runs. And also this, a way to represent many sequences and its information…but I guess this is an information visualization problem of an “omics” approach.
    And this you mention that in a website should work as an applet it’s important since this work has as a goal, build a web service… But if you display an output like a picture, it can be generated also in the server and then sent back, or what?

  4. Paulo Nuin

    January 30, 2008 at 21:02

    Hi Paweł

    Would you distribute the script that generate the visualization? If so drop me a note.


  5. freesci

    January 30, 2008 at 21:09

    Hi Paulo, the script is very crude, so I don’t plan to make it freely available, although I may send it upon request.

  6. wrightfisher

    February 1, 2008 at 18:27

    This visualization reminds me of the RNA structural diagrams which indicate base pairing partnerships. Excellent examples appear in the published writings of E. Rivas and S.R. Eddy.

%d bloggers like this: