NMR SPARQL search problem solved

Primary links

NMR SPARQL search problem solved

Fri, 2010-02-19 22:20 | by Samuel Lampa

The problem I've had, to find a working SPARQL query for doing similarity search for spectra, according to a list of peak shift values (documented on this blog: post1, post 2 and on semanticoverflow) is now finally solved, thanks to helpful advice from Brandon Ibach on the [pellet-users] mailing list.

We found a query that works fine. Unfortunatly though, it scales exponentially with the number of shift values (and corresponding tests in the FILTER) searched for, as explained below.

The working SPARQL query

Sample data

First, the sample data I'm using now:

http://saml.rilspace.com/nmrshiftdata.1.rdf.xml (only 1 spectrum)

(Yes, one single spectra as for now. That is due to the scaling problems.)

The file contain a spectrum with peaks (nmr:hasPeak) with one shift each (nmr:hasShift), with the following values (for convenience if anyone wants to try): [17.6, 18.3, 22.6, 26.5, 31.7, 33.5, 33.5, 41.8, 42.0, 42.2, 78.34, 140.99, 158.3, 193.4, 203.0, 0]

The query

Trying to search for just a few of these peaks (so as to not run into performance problem) against the 1 spectrum file, results in this SPARQL query, which works fine, and finishes in 3 sec:

PREFIX owl:<http://www.w3.org/2002/07/owl#>
PREFIX afn:<http://jena.hpl.hp.com/ARQ/function#>
PREFIX fn:<http://www.w3.org/2005/xpath-functions#>
PREFIX nmr:<http://www.nmrshiftdb.org/onto#>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT ?s ?s1 ?s2 ?s3 ?s4
WHERE {
   ?s nmr:hasPeak [ nmr:hasShift ?s1 ] ,
                  [ nmr:hasShift ?s2 ] ,
                  [ nmr:hasShift ?s3 ] ,
                  [ nmr:hasShift ?s4 ] .
FILTER ( fn:abs(?s1 - 17.6) < 0.3 &&
          fn:abs(?s2 - 18.3) < 0.3 &&
          fn:abs(?s3 - 22.6) < 0.3 &&
          fn:abs(?s4 - 26.5) < 0.3 ) .
} LIMIT 1

The reason I didn't realize this simple solution was that I was unaware of the procesing model of SPARQL, which in fact involves "generating every possible combination of bindings of values in the data to variables in the query, where the bindings match the query pattern" ... and "Each set of bindings then being evaluated against each FILTER expression and those sets that don't evaluate to true are discarded.", citing Brandon's clarifying explanation.

Exponential scaling with number of shift values

So far so good. Unfortunately though, this query scales very bad - exponentially with the number of variables / shift values tested for.

To verify this, I tried going from # = 1 (where # denotes numbers of shift values & variables) to # = 6. As exemplified with code (with prefixes omitted):

With # = 1:

SELECT ?s ?s1
WHERE {   ?s nmr:hasPeak [ nmr:hasShift ?s1 ] .
FILTER ( fn:abs(?s1 - 17.6) < 0.3 ).
} LIMIT 1

...To # = 2:

SELECT ?s ?s1 ?s2
WHERE {   ?s nmr:hasPeak [ nmr:hasShift ?s1 ] ,
                  [ nmr:hasShift ?s2 ] .
FILTER ( fn:abs(?s1 - 17.6) < 0.3 &&
          fn:abs(?s2 - 18.3) < 0.3 ).
} LIMIT 1

... etc, up to # = 6. The results from timing the queries follow below:

# |  Reasoning time (s)  |  log10(Reasoning time)
---------------------------------------------------
1 |              0.043   |                 -1.37
2 |              0.058   |                 -1.24
3 |              0.295   |                 -0.53
4 |              3.332   |                  0.52
5 |             44.183   |                  1.65
6 |            488.612   |                  2.69

The log10(reasoning time) results in a straight line when plotting (except in the very beginning).

(I'm using Pellet 2.0.0 in Bioclipse, on a 1.3 GHz Intel Su7300 Dual Core laptop (ASUS UL30A)))

This is of course problematic as one might want to search for spectra with up to as many as 50 peaks in the worst case, and of course with datasets bigger than 1 spectrum, so any hints about alternative strategies with more linear scaling behavior are still highly welcome.

Samuel Lampa's blog
Login to post comments

Tags:

Comments

by Egon Willighagen (not verified) | Sat, 2010-02-20 08:45

trickery

Interesting results! I'm happy to see you make good progress... at a cheminformatics level, some trickery is actually used for this problem... The NMRShiftDB actually uses indexing on the spectra, allowing to put a simpler filter on things: it assigns spectral regions, e.g. 0-2.5 ppm, 2.5-5, 5-7.5, 7,5.10 (or so, I don't know the exact margins). Then if the query is matched, first a simpler version is matched, with a small number of areas, instead of peaks :)

The trickery here is to reformulate your question. If your question is too hard, you reformulate the problem in smaller and/or different questions, and do things step by step.

by Samuel Lampa | Sun, 2010-02-21 01:54

Nested SELECTs

Egon Willighagen wrote:
"The trickery here is to reformulate your question. If your question is too hard, you reformulate the problem in smaller and/or different questions, and do things step by step."

Yeah, that's true. Probably one could do it it in iterative steps (first picking up spectra which have at least one matching peak, then run on those, etc.) if we get nested SELECTs to work in the Bioclipse SPARQL (didn't succeed so far ... maybe made some mistake ... )

Samuel's Tech Blog

Primary links

Tags

NMR SPARQL search problem solved

The working SPARQL query

Sample data

The query

Exponential scaling with number of shift values

Comments

trickery

Nested SELECTs

Recent blog posts