Automating answering of questions with no answers - by wrapping simulations in semantics

What do you think of that title? :) To me it sounds like one of the (many) natural next steps forward for Bioclipse sometime in future1.

Explicit knowledge is too expensive

There are lots of things that can't be answered by a computer from data alone. Maybe the majority of what we humans perceive as knowledge is inferred from a combination of data (simple fact statements about reality) and rules that tell how facts can be combined together to allow making implicit knowledge (knowledge that is not persisted as facts anywhere, but has to be inferred from other facts and rules) become explicit.

One can easily imagine though, that storing every single piece of knowledge that could be stated, as an explicit fact, would require more storage than can probably ever be made available in this universe.

Simulations can make knowledge explicit, from first princples

It is not too hard to come up with some processes which are just too complex and involves too much variability2 that it is unrealistic to try to capture every imaginable state of of that system or process in explicit facts. Instead we must seek the "first principles" that defines the process, and through simulations make explicit any knowledge we are looking for, at the time we need it (one can of course cache often accessed knowledge).

So; Simulations can (temporarily ) make explicit, the knowledge we are looking for, but which does not exist in explicit form.

Simulations should be automated

There are lots of simulation software for biological systems out there ... deterministic, stochastic and agent-based ones to mention a few categories. The plethora of different systems to choose from does not make life easier for the bench scientist. And in order to improve that situation, and provide some automation, a standardized way to deal with simulations is needed.

What is done already?

There are some efforts to increase interoperability, most notably through the SBML standard, a standard file format for molecular biology models (pathways, gene regulatory networks, and the like). But how to make things more "automatable", as is one of the main goals of semantic web (AFAIK)?

Interestingly, Egon hinted about a blog post about a very semantic annotation of SBML models! Nice indeed that SBML and semantic tech can now be integrated.

Still more to do

I'm thinking whether that is still enough though. If wanting to automate knowledge discovery using these kind of systems, one needs to capture in a computer-readable way also the outcome of simulations, not only the underlying model3.

So, all in all, I think there is some work to be done in wrapping simulation software into a "semantic shell" that knows all it needs to understand the the language of incoming data (it might even need to be able to produce such data, from semantic queries), and also can analyze the simulation results, and provide the questioner with an answer in a semantic way.

Thus, by wrapping simulations in semantics, one might be able to automate answering of questionswith no answers! (at least, that's the idea :) )

  • 1. And something I'd love to work on, if in any way future permits.
  • 2. ''Think of the embryonic development process for example, and then add dimensions like; species, environmental factors, mutations etc. etc.
    Say for example that we are looking for the expression level of gene X, in the compartment Y, in the species Z, after A days of growth, with a temperature that goes from +17 C to + 4 C in a gradient during that A-day period''. You can easily see that there are quite some possible combinations of factors affecting the state of the system in every time step ... .
  • 3. Most probably, principles from the [ SADI framework], will come useful here


Building blocks of this already there

Very much interestingly, I found out, at my 3 day stay at EBI, that the building blocks for this is already there.