Manual sequence analysis – some common mistakes

RSS

Manual sequence analysis – some common mistakes

25 Sep

This is a topic I probably will come back to on many occasions. Publication with very wrong sequence analysis like the one Stephen Spiro pointed out on his blog is not an exception. I may agree that large scale analysis can stand quick and dirty treatment of protein sequence (and some error propagation at the same time). In large scale analysis nobody cares if the domain assignment is 100% right (it isn’t), if there are false positives (there are) or even if the material to begin with (protein sequences for example) is free of errors (it is not) – as long as the overall quality of the work is acceptable. However, this optimistic approach cannot be applied to the manual protein sequence analysis. Simply errors introduced in such cases are a way more important. How to avoid some of these errors? A few common mistakes that come to my mind are:

lack, not accurate or quick and dirty domain annotation: this probably is a topic for separate post, but in short – relying on a single method or strict E-value, excluding overlaps, ignoring internal repeats, forgetting about structural elements like transmembrane helices etc. lead to mistakes in domain annotation
running PSI-BLAST search on unclustered databases: the profile for many query sequences will get biased and diverge in a random direction if the PSI-BLAST runs on the unclustered database (remember 500 copies of the same protein in the results?); after all these years I still don’t get why NCBI does not provide nr90 (non-redundant db clustered at 90% identity threshold) for the PSI-BLAST
running PSI-BLAST without looking at the results of each run: if you don’t assess what goes in, you risk allowing some garbage
masking low-complexity, coiled-coils and transmembrane regions in BLAST search on every single occasion: while most of the times this is a valid approach, there are cases where the answer is revealed after turning the masking off
skipping other tools for sequence analysis like predictors of signal sequences, motifs, functional sites
skipping analysis of a genomic context: while not applicable to all systems, analysis of the genomic context may influence dramatically function prediction

It’s so far all I could think of. Do you have any other suggestions? Let me know.