Database query and ranked results

RSS

Database query and ranked results

22 Jan

: Image via Wikipedia

Already some time ago I’ve read a piece by Marcelo Calbucci: Is it a database or a search engine?. While it deals with search information within a real estate database, I think his comments are applicable in the many areas of life sciences.

In short, Marcelo points out that people miss a lot of interesting entries while looking for a house, because of inflexibility of the query; number of bedrooms, price, distance from some point – these are all set. However, users are flexible and in such case need rather a search engine that gives them close enough answer or allows to specify weight to each filter.

In life sciences we do search for similarities and analogies all the time. Sometimes it’s direct comparison of sequences, on other occasion is high-level meta-comparison between two systems. And while we have various (statistical) metrics of similarities and they sometimes become a part of a database designs, interfaces of biological databases don’t allow to rank query results according to these metrics. For example I can easily find all human proteins related to disease X or disease Y or disease Z, although I cannot specify that I want proteins related to Z AND Y first on the list. Other example would be searching PubMed – I can look for articles related to “synthetic biology”, but I have no way to specify, that I want papers by James Collins from HHMI AND articles related to these papers to be first on the list. I guess it is possible to obtain such results without going through the whole list, but I doubt the method will be very simple. Filtering still seems to be neglected aspect of database design in life sciences.

My dream biological search engine would have a series of sliders (or ideally, I would like to have a device with series of mechanical knobs attached to the computer) and would allow me to dynamically change weights of various aspects of the query and see immediately how it affects the results. It would be something resembling interactivity of Gapminder World, but on dynamically generated data. Technology and proof of concept seems to be there, but I guess we need to wait quite a few years before this approach will be adopted within life sciences.

4 Comments

Posted by Pawel Szczesny on January 22, 2009 in bioinformatics, Data mining, Software

Tags: bioinformatics, Database, PubMed, Search, Web search engine

4 responses to “Database query and ranked results”

Kay at Suicyte

January 22, 2009 at 23:12

I don’t quite see the connection between your query examples and your description of an ideal search engine. I am not mistaken, all of your examples could be dealt with by SQL-access to a relational version of Pubmed. The business with fuzzy searches, term weights and sliders would be useful for solving different questions, such as finding papers that are more related to disease X than to disease Y.

I am generally not a friend of fuzzy searches and weighting, but rather prefer pure boolean searches that retrieve only those entries that fully match my search criteria. I must admit however that there are situations where a certain fuzziness comes in handy.

What I really hate is PubMed’s habit of searching for other (similar looking or similar sounding) term than those I specified – with the pathetic excuse that there are more hits. I guess that most people consider this a useful feature, but I don’t. If there are no valid matches to my query, I can stand the answer ‘no hits’. There is no need to search for the next best thing just to be able to report something.
Pawel Szczesny

January 23, 2009 at 00:06

Kay, thanks. I agree – my examples are about sorting, not doing fuzzy search. But I’ve made them such way, because I was afraid running into a need of explanation what I mean by “abstract _more related_ to disease X”, which I’m not sure I could do. I just couldn’t come up with a fuzzy search example that wouldn’t in my mind raise a question “why one would want to do such thing?”

On the other hand, both approaches seem to me closely connected. One thing is a filter (and sorting options). SRS does it quite well, and probably that’s about it among biological databases. The other thing is fuzziness (and weighting of filters). I believe both should be implemented and available as user switchable options – not only to improve search, but also (as you’ve pointed out) to avoid situations that the server does more than one has asked it to do.
Mr. Gunn

January 28, 2009 at 03:36

You and I are looking for similar things, Pawel.

I think people are just starting to realize that, in this new paradigm of Big Data, it’s more likely that you’ll get too many results than not enough. Being able to comprehend the results through sorting, filtering, and visualizations that allow you to make sense of everything and how it’s related is becoming increasingly important.
John Woods

January 30, 2009 at 21:27

I’ve been thinking about similar things. How would Google Biology–or, rather, Bioogle, or Biologoogle, or something clever–look?

I’m not sure I have any answers yet. It’s a tough problem. But that’s why I’m in systems biology. =)

Greetings, by the way. I just found you through Google Reader.

John