The problem I've had, to find a working SPARQL query for doing similarity search for spectra, according to a list of peak shift values (documented on this blog: post1, post 2 and on semanticoverflow) is now finally solved, thanks to helpful advice from Brandon Ibach on the [pellet-users] mailing list.
We found a query that works fine. Unfortunatly though, it scales exponentially with the number of shift values (and corresponding tests in the FILTER) searched for, as explained below.
First, the sample data I'm using now:
(Yes, one single spectra as for now. That is due to the scaling problems.)
The file contain a spectrum with peaks (nmr:hasPeak) with one shift each (nmr:hasShift), with the following values (for convenience if anyone wants to try): [17.6, 18.3, 22.6, 26.5, 31.7, 33.5, 33.5, 41.8, 42.0, 42.2, 78.34, 140.99, 158.3, 193.4, 203.0, 0]
Trying to search for just a few of these peaks (so as to not run into performance problem) against the 1 spectrum file, results in this SPARQL query, which works fine, and finishes in 3 sec:
PREFIX owl:<http://www.w3.org/2002/07/owl#> PREFIX afn:<http://jena.hpl.hp.com/ARQ/function#> PREFIX fn:<http://www.w3.org/2005/xpath-functions#> PREFIX nmr:<http://www.nmrshiftdb.org/onto#> PREFIX xsd:<http://www.w3.org/2001/XMLSchema#> PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> SELECT ?s ?s1 ?s2 ?s3 ?s4 WHERE { ?s nmr:hasPeak [ nmr:hasShift ?s1 ] , [ nmr:hasShift ?s2 ] , [ nmr:hasShift ?s3 ] , [ nmr:hasShift ?s4 ] . FILTER ( fn:abs(?s1 - 17.6) < 0.3 && fn:abs(?s2 - 18.3) < 0.3 && fn:abs(?s3 - 22.6) < 0.3 && fn:abs(?s4 - 26.5) < 0.3 ) . } LIMIT 1
So far so good. Unfortunately though, this query scales very bad - exponentially with the number of variables / shift values tested for.
To verify this, I tried going from # = 1 (where # denotes numbers of shift values & variables) to # = 6. As exemplified with code (with prefixes omitted):
With # = 1:
SELECT ?s ?s1 WHERE { ?s nmr:hasPeak [ nmr:hasShift ?s1 ] . FILTER ( fn:abs(?s1 - 17.6) < 0.3 ). } LIMIT 1
SELECT ?s ?s1 ?s2 WHERE { ?s nmr:hasPeak [ nmr:hasShift ?s1 ] , [ nmr:hasShift ?s2 ] . FILTER ( fn:abs(?s1 - 17.6) < 0.3 && fn:abs(?s2 - 18.3) < 0.3 ). } LIMIT 1
# | Reasoning time (s) | log10(Reasoning time) --------------------------------------------------- 1 | 0.043 | -1.37 2 | 0.058 | -1.24 3 | 0.295 | -0.53 4 | 3.332 | 0.52 5 | 44.183 | 1.65 6 | 488.612 | 2.69
The log10(reasoning time) results in a straight line when plotting (except in the very beginning).
(I'm using Pellet 2.0.0 in Bioclipse, on a 1.3 GHz Intel Su7300 Dual Core laptop (ASUS UL30A)))
This is of course problematic as one might want to search for spectra with up to as many as 50 peaks in the worst case, and of course with datasets bigger than 1 spectrum, so any hints about alternative strategies with more linear scaling behavior are still highly welcome.
Comments
trickery
Interesting results! I'm happy to see you make good progress... at a cheminformatics level, some trickery is actually used for this problem... The NMRShiftDB actually uses indexing on the spectra, allowing to put a simpler filter on things: it assigns spectral regions, e.g. 0-2.5 ppm, 2.5-5, 5-7.5, 7,5.10 (or so, I don't know the exact margins). Then if the query is matched, first a simpler version is matched, with a small number of areas, instead of peaks :)
The trickery here is to reformulate your question. If your question is too hard, you reformulate the problem in smaller and/or different questions, and do things step by step.
Nested SELECTs
Egon Willighagen wrote:
"The trickery here is to reformulate your question. If your question is too hard, you reformulate the problem in smaller and/or different questions, and do things step by step."
Yeah, that's true. Probably one could do it it in iterative steps (first picking up spectra which have at least one matching peak, then run on those, etc.) if we get nested SELECTs to work in the Bioclipse SPARQL (didn't succeed so far ... maybe made some mistake ... )