Author Archives: Pawel Szczesny

Closing down Freelancing Science shop

It’s finally time to close down Freelancing Science shop. I will post in a different place, under more general domain name and on self-hosted WordPress installation. Visit my new site over at

I’m moving because existing form and scope of this blog has been more and more frustrating. I’m going to continue experiments with different approaches to scientific career, but this is not going to be the main topic of the new site. Additionally, I don’t want to suddenly spam people who subscribed to this blog when I was more interested in bioinformatics with non-scientific topics.

The new site will explore large number of different fields, such as systems science, photography, dynamic processes, biocomplexity, memetics, but also the topics covered here, such as science 2.0, bioinformatics, structural biology or data visualization. If you aren’t interested in any of new topics, you can subscribe only to selected notebooks (categories).

Within a month or so, commenting will be closed.


Posted by on April 8, 2010 in Comments


Proposal for Science 2.0 lectures

I’ve just submitted a proposal for three lectures about different aspect of Science 2.0. Target audience are PhD-students. Below you can find a brief overview. Probably the details will change a bit when I start to prepare the lectures (for example I’m aware that Etherpad is on its way out), but nevertheless you are very welcome to comment and suggest different approach.

Science 2.0 – practical aspects of the internet revolution

Part 1 – communication, collaboration, visibility

New communications channels (blogs, microblogs, aggregators, virtual conferences ans poster sessions) and examples of successful applying in science. New roles of blogs, Research Blogging initiative. Wikis, Etherpad and Google Documents/Wave – platforms for document co-writing. Collaboration for programmers, Git. Visibility and recognition in the internets: StackOverflow and ResearcherID.

Part 2 – practical open science

Spectrum of openness in science. Community annotation of genes/proteins/structures and why these aren’t so successful. Crowdsourcing and citizen-science. Overview of open data repositories, focusing on open data coming from pharma industry. Mechanisms of Open Access and Open Notebook Science. Current discussions on intellectual property – what’s not protected and what’s not licensable?

Part 3 – searching for information and literature management

Information overflow – myth or fact? Searching for information – differences between PubMed and Google Scholar. Semantic analysis of abstracts based on GoPubMed and NovoSeek. Targeted text-mining tools. Literature management: online (Connotea, CiteULike) and desktop (Zotero, Mendeley) approaches. Alternatives for EndNote. Automated or not – literature recommendations.


Posted by on December 7, 2009 in Community


Tags: , , ,

Complex systems and biology – introduction

What you can read in here is a set of my loose notes on complex systems and biology. I want to learn about the topic as fast as I can, so if I’m wrong anywhere, please point that to me. This post is an overview and indication of issues I’d like to cover.

Complex adaptive systems (CAS) are the heart of many phenomenas we observe every day, such as global trade, ecosystems, human body, immune system, internet and even language. Complexity of CAS does not equall to amount of information, rather it’s a indication of complex, positive and negative interactions of its components. All CAS feature a common set of dualisms:

  • distinct/connected – CAS are built of a large number of agents that interact simultaneously and independently but all together become tightly regulated system (other names: individual/system or distributed/collective)
  • robust/sensitive – CAS are pretty robust, yet at the same time are quite sensitive to initial conditions and some signals (see butterfly effect); both features are unpredictable
  • local/global – protein is a CAS, protein network is a CAS, cell is a CAS, tissue is a CAS, organism is a CAS, society is a CAS; agents of a CAS, can be CAS themselves
  • adaptive/evolving – CAS is able to adapt as a system and usually its agents are also mutually adaptive, and at the same time CAS is evolving; even if local landscape prefers simpler solutions (adaptation) CAS usually evolve toward bigger complexity

These dualisms are in some sense as artificial as wave-particle dualism. Complex system has all these features at the same time – their visibility depends only on design of a experiment. As a result, CAS present a common set of features: they are self-organizing, coherent, emergent and non-linear.

Probably the best so far representation of CAS is a network, which has a number of important features: it is scale-free (distribution of links in the network tends to follow power law), clustered (“friend of my friend is likely my friend too”) and small-world-like (diameter of a network is small, aka “six degrees of separation”). Such representation has been applied to biological complex systems, such as metabolic networks, or protein-protein interaction networks with a great success. However please remember that it’s only representation and many times people argued that scale-free networks may not be the best approximation of natural networks (see for example this recent paper).

Scale-free or not, network representation doesn’t address all dualities mentioned above, especially last two. Naturally emerging levels of organisation and relation between adaptation and evolution of complex systems are rarely studied from biological point of view, probably because we don’t have a clear idea how to reduce these phenomenas to something measurable.

In the next posts, I will try to cover other CAS representations and computational approaches to CAS modeling.


Posted by on December 4, 2009 in bioinformatics


Science 2.0 in Poland – getting popular, recognized as important

Few days ago I had a chance to speak about Science 2.0 at the Institute of Biochemistry and Biophysics of Polish Academy of Sciences (the one I’m affiliated with). Compared to the seminar on the same topic I gave at the same place (but for much smaller audience) 4 years ago, I had much more stories to tell, way more real-life examples and better idea of where the whole “2.0” meme is leading us. I also got better at speaking (4 years ago some of my colleagues literally slept on my seminar). So, message got clearer, and messenger had improved.

But given wide interest in the topic from inside and outside of academic environment already before the seminar I think two things had happened in Poland in the last 4 years. First, internet got recognized as a game changing technology, and people simply are interested in any new way they can use this tool (yes, I know it’s 2009 – if you live on the nets it’s hard to realize how slow adoption rate is outside of virtual worlds). Second thing is, that internet as a tool is also recognized as important – for example people had ideas to include Science 2.0 topics into program of PhD studies (I will follow up on this topic in a week or two). Getting popular, important… Only wide adoption is what we need :).

Comments Off on Science 2.0 in Poland – getting popular, recognized as important

Posted by on November 28, 2009 in bioinformatics


Notes from Next Generation Sequencing Workshop in Rome

I was in Rome for two days attending Next Generation Sequencing Workshop organized by EMBRACE (EU FP6 NoE), UPPMAX and CASPUR with the support of the Italian Society of Bioinformatics. It was pretty interesting event and I want to share with you couple of interesting things I’ve learned there.

Hardware layer

First day was devoted mainly to the hardware side of NGS. It started with a presentation from Tony Cox from Sanger Institute who described a hardware setup used to support their sequencing projects. At 400 gigabases a week (current output) Sanger IT infrastructure is stretched in every direction (capacity, availability, redundancy) and Tony pointed out that each sequencing laboratory is going to face similar issues did sooner or later. His advice for such labs was to estimate first number of bases produced and then use multipliers to assess storage requirements for the project. A minor thing that I’ve noticed in his talk was exposing databases as filesystem via FUSE layer – I might use that approach in some projects too.

George Magklaras  from The Biotechnology Centre of Oslo described a number of approaches they took during implementation of their infrastructure. He talked about  FCoE, Fibre Channel over Ethernet, and pointed out that it’s cheaper and almost as efficient as Fibre Channel alone. At the Centre they use Lustre (Sanger is too), high performance networked file system, but they benchmark other solutions too, because some situations/projects require transparent and efficient data encryption (mostly medical data). Similarly to Tony, George pointed out that compartmentalization of data is necessary, as moving large amounts of files over the network creates a unnecessary bottleneck.

Other interesting talk was from Guy Cochrane from EBI about Sequence Read Archive. It was an overview of the project, but again with few interesting tidbits that drawn my attention. One of them was Aspera, much faster alternative (and secure at the same time) to good old FTP. He also presented a data reduction strategy that if I understood correctly is not yet implemented over at SRA, but might be some day in the future. First point was deletion of intensity data – that’s something perfectly reasonable but is heavily opposed by a number of scientists. Then, all only consensus is preserved plus second most frequent base (important for polymorphism studies). The minimum for long-term storage was proposed to consist only of sequence and quality data.


Majority of second day was devoted to software. It doesn’t make sense to list all described projects – I will share with you only my general impression.

Despite large number of scientists devoting their time to develop new tools for next generation sequencing data, I think that software lags a little behind other technological advances in this area. In case of really large amount of data, assembly becomes hard or impossible, mapping erroneous, annotation too slow (pilot study of 1000 genomes project generated so much data, that computing farm was busy for full 60 days – on single CPU it would take 25 000 days). Software development for NGS differs dramatically compared to scientific software in general and needs much much better programmers than we usually are. For example, Desmond Higgins  was praising open source software – they found extremely fast implementation of UPGMA algorithm (much faster then their own), and they could speed up their tool (SeedMap) so much that it’s running it on even largest family of sequences in a reasonable time.


Another bottleneck was data presentation layer – there are some attempts to make digging into data easier, but having a biologically meaningful overview is as hard as it was before. Other people pointed our that problem too (I wasn’t the only biologist there).

Need for stronger community

Probably the most funny part of the workshop was the discussion about creating an organized community of people working with next generation sequencing technologies.  It was funny is this sense, that some consensus about community emerged quite fast. How to build it – that was another story. Obviously lots of participants were sure that if they build a site, people will come. Yeah, sure. 🙂 I’ve suggested using wiki in the first place and additionally hire a community manager if they really want to gather people from many different forums, sites, groups etc. Lot’s of people didn’t buy these ideas, suggesting more traditional approach, so curious if they were right, I’m going to follow development of this community.

NGS = high tech

Probably the most important lesson was to realize that sequencing is a field with very high requirements for infrastructure and  even higher requirements for skilled staff. Basically every element of the infrastructure may become a bottleneck and if you want to avoid it, cost of data maintenance and analysis exceeds very fast cost of producing the data. When I talked about it to many people during the last year (I’m involved in some sequencing projects at the analysis/annotation step) often people felt I overestimate infrastructure needs. Now I have some specific number to back it up :).


Posted by on November 21, 2009 in bioinformatics


Tags: , ,

Microstocks are for scientists too

Money is rarely directly discussed on science blogs, but rarely science bloggers say that they don’t care. Quite a number of them run advertisement or affiliate programs on their sites, trying to monetize the traffic they generate. And while I don’t know specific numbers, my estimate is (some time ago I did run such programs on a photography blog which was way more popular that this one) that in majority of cases it buys them a coffee or two per week. This blog is hosted over at and team forbids inserting your own scripts into the blog (occasional affiliate links seem to be fine, if you’re interested). Making money from Google ads wasn’t an option for me. But I have tried to earn money by sending images of molecules to microstock sites and that seems to be more profitable than previous strategy.

Inspiration to write this post came from the fact that I’ve recently logged into one of the sites and I was quite surprised to see that despite the fact I didn’t upload anything for almost two years, my images are still selling quite well. In majority of microstock sites your gallery exposure is bigger if you upload new stuff on regular basis. So the conclusion is that after two years there’s still not many similar images of molecules to choose from.


Above you can see one of the attempts to create nice picture of hemoglobin molecule. That should give you an idea what images are selling well. Simple, clean, bright colours etc. Few other suggestions:

  • pay attention to the license under which the software you use to generate images is distributed. For example, you cannot use VMD or Chimera (both have non-commercial licenses), while Qutemol (under GPL) is fine.
  • use automated submitters (available for all platforms), instead of relying on ftp. You just don’t want to manually annotate dozens of images on the web. The other route is to fill IPCT tags.
  • submit to all microstock sites that let you in, but start with the bigger ones (iStockPhoto, Dreamstime, Fotolia, Shutterstock etc.)
  • if you have time, experiment with graphics or 3D software. Additional modifications in GIMP or Blender occasionally produce interesting images.
  • If you live in a strange country, check first regulations under which you can earn money via microstock. In Poland for example, you need to start a company first (which, as my Polish readers can confirm, is a really painful process)


Reblog this post [with Zemanta]

Posted by on November 5, 2009 in Money


Tags: , ,