Now that the autumn is here I have some new projects starting, and running for approximately this month (before jumping out into the real world and trying to get a real job :) ). Good thing is, it all builds upon previous work.
First thing is, I'll work with Egon Willighagen to hook up Bioclipse with Semantic MediaWiki, via the RDFIO extension, which I developed as part of GSoC this year, mentored by Denny Vrandecic. I'm excited about putting RDFIO into some real world usage!
Second thing is I'll work part time at the Bioclipse group to improve the ways user documentation for plugins is authored and published, enabling some automation of publishing content to the Bioclipse website as well as the wiki etc. This will hopefully make it a lot easier for end users to find their desired extra functionality, and make more users see the value of a research platform with an open and modular structure, as is Bioclipse!
The last administrative details of my thesis project are now finished, and the report is now available in final form, for download as PDF in on this page (no 14 in the list), or this direct link. (The title of the project was "SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking").
I reported earlier that Jena/SPARQL outperformed Prolog for a lookup query with some numerical value comparison. It later on turned out that the results were flawed and finally that Prolog indeed was the fastest as soon as turning to datasets with more than a few hundred peaks.
The Prolog program I was using was rather complicated with recursive operations on double lists etc. Then, some week ago, I tried, in order to highlight differences in expressivity between Prolog and SPARQL, to implement a Prolog query that mimicked the structure of the SPARQL query I used, as close as possible. Interestingly it turned out that this Prolog query can be optimized to become blazing fast by reversing the order of shift values to search for, so that the largest values are searched for first. With this optimization the query outperforms both the SPARQL and the earlier used prolog code. See figure 1 below for results (The new prolog query is named "SWI-Prolog Minimal"). It appears that the querying time does not even increase with the number of triples in the RDF store!
The explanation seems to stem from the fact that larger NMR Shift values are in general more unique than smaller values (see histogram of shift values in the full data set in figure 2 below). Thus, by testing for the largest value first, the query will be much less prone to get stuck in false leads. (Well, looking at the histogram, it appears that one could in fact do even better sorting than just from larger to smaller, like testing for values around 100 before values around 130 etc.)
(Find the new Prolog code below. The SPARQL query, and earlier Prolog code, can be found as attachments to this blog post.)
Egon pointed to an interesting blog post about a feature that is available as a an extension to Jena, the semantic web framework available in Bioclipse. It allows to very easily query multiple SPARQL endpoints from a single SPARQL query (using the
SERVICE keyword), and use variable bound from one endpoint when querying the next.
This is very useful in general. I was also thinking of the specific scenario (along the lines we have partly already been thinking) to use multiple Semantic MediaWikis as community maintained databanks, for querying back into Bioclipse. Being able to use multiple MediaWiki installs is very useful because it is hard to incorporate a very efficient access restriction system in MediaWiki (due to the nature of how it works, with template calls and all), so then it is better to be able to have separate wikis for content which needs special restrictions.
Note that this is still at the experimental stage, so things are a bit rough around the edges!
During my short stay at EBI, Egon had kindly arranged an opportunity to talk to Janna from the Steinbeck group, about some interesting work she has done on searching for cage structures in molecules, using Prolog, so we met over a coffee, together with Nico who's now at the Steinbeck group (visiting).
Bot Nico and Janna kindly gave a lot of good advice about research in general (as I'm currently looking into possibly doing PhD somewhere), which I highly appreciated. :)
And we talked some about the original topic too :), that is the cage structure problem. The cage structure problem is kind of an extension to the problem of expressing rings, which has previously been reported as a problem for OWL-DL. So because of this, it is interesting that Janna came up with a working solution, using Prolog.
As a highlighting example from the DL-side, Michel Dumontier has done some work on representing molecules, including rings. But they also had to use rules, not plain OWL.
So that seem to be the general conclusion: In order to express ring structures (or extensions of it, such as cage structures), you'll need to use rules in some way.
Unfortunatly my project is now running out of time, so I might not have much time to look more into this topic as part of my project :(. Will see if I can include this as a part of another course I still have to finish ("knowledge based systems in bioinformatics"), but that remains to see.
The problem I've had, to find a working SPARQL query for doing similarity search for spectra, according to a list of peak shift values (documented on this blog: post1, post 2 and on semanticoverflow) is now finally solved, thanks to helpful advice from Brandon Ibach on the [pellet-users] mailing list.
Found a very interesting quote:
"Both OWL-DL and function-free Horn rules are decidable fragments of first-order logic with interesting, yet orthogonal expressive power"
"Horn rules", is what prolog builds upon (a prolog statement are horn rules, AFAIS), so maybe Prolog fits into the category of "function-free horn rules"? (Gotta try to figure that out), and OWL-DL is the W3C standard for expressive semantics, that reasoners like pellet (which is available in bioclipse build upon.
What do you think of that title? :) To me it sounds like one of the (many) natural next steps forward for Bioclipse sometime in future1.
There are lots of things that can't be answered by a computer from data alone. Maybe the majority of what we humans perceive as knowledge is inferred from a combination of data (simple fact statements about reality) and rules that tell how facts can be combined together to allow making implicit knowledge (knowledge that is not persisted as facts anywhere, but has to be inferred from other facts and rules) become explicit.
One can easily imagine though, that storing every single piece of knowledge that could be stated, as an explicit fact, would require more storage than can probably ever be made available in this universe.
It is not too hard to come up with some processes which are just too complex and involves too much variability2 that it is unrealistic to try to capture every imaginable state of of that system or process in explicit facts. Instead we must seek the "first principles" that defines the process, and through simulations make explicit any knowledge we are looking for, at the time we need it (one can of course cache often accessed knowledge).