RSS

Category Archives: Data mining

Database query and ranked results

: Image via Wikipedia

Already some time ago I’ve read a piece by Marcelo Calbucci: Is it a database or a search engine?. While it deals with search information within a real estate database, I think his comments are applicable in the many areas of life sciences.

In short, Marcelo points out that people miss a lot of interesting entries while looking for a house, because of inflexibility of the query; number of bedrooms, price, distance from some point – these are all set. However, users are flexible and in such case need rather a search engine that gives them close enough answer or allows to specify weight to each filter.

In life sciences we do search for similarities and analogies all the time. Sometimes it’s direct comparison of sequences, on other occasion is high-level meta-comparison between two systems. And while we have various (statistical) metrics of similarities and they sometimes become a part of a database designs, interfaces of biological databases don’t allow to rank query results according to these metrics. For example I can easily find all human proteins related to disease X or disease Y or disease Z, although I cannot specify that I want proteins related to Z AND Y first on the list. Other example would be searching PubMed – I can look for articles related to “synthetic biology”, but I have no way to specify, that I want papers by James Collins from HHMI AND articles related to these papers to be first on the list. I guess it is possible to obtain such results without going through the whole list, but I doubt the method will be very simple. Filtering still seems to be neglected aspect of database design in life sciences.

My dream biological search engine would have a series of sliders (or ideally, I would like to have a device with series of mechanical knobs attached to the computer) and would allow me to dynamically change weights of various aspects of the query and see immediately how it affects the results. It would be something resembling interactivity of Gapminder World, but on dynamically generated data. Technology and proof of concept seems to be there, but I guess we need to wait quite a few years before this approach will be adopted within life sciences.

4 Comments

Posted by Pawel Szczesny on January 22, 2009 in bioinformatics, Data mining, Software

Tags: bioinformatics, Database, PubMed, Search, Web search engine

Mining PubMed – another tools available

05 Mar

There are two new tools available that mine semantically PubMed abstracts, e-LiSe and Anne O’Tate. First one was made by my colleagues from Institute of Biochemistry and Biophysics in Warsaw, while the second is from University of Illinois, Chicago. Female-sounding names is not the only thing that makes them look similar, they both provide analogous functionality, like keywords or author names associated with user query.

There’s quite a lot of third party interfaces to PubMed (see David Rothman’s excellent list), so I couldn’t resist to run few queries on both servers and compare them to GoPubmed, which currently wins hands down in terms of features and interface. I wasn’t surprised to see that results overlap significantly, although not completely. Each of servers found valuable keywords other two did not have – that’s understandable, they use different algorithms. I wonder if we will see a meta-server of PubMed data-miners, like there are for protein structure prediction (for example meta.bioinfo.pl). In theory, exhaustive search for meaningful keywords by different methods and then their classification and analysis should work better than any single method, but this is just a guess.

5 Comments

Posted by Pawel Szczesny on March 5, 2008 in bioinformatics, Data mining, PubMed

Tags: bioinformatics, Data mining, literature search, PubMed