Prolog query much faster when mimicking SPARQL

I reported earlier that Jena/SPARQL outperformed Prolog for a lookup query with some numerical value comparison. It later on turned out that the results were flawed and finally that Prolog indeed was the fastest as soon as turning to datasets with more than a few hundred peaks.

The Prolog program I was using was rather complicated with recursive operations on double lists etc. Then, some week ago, I tried, in order to highlight differences in expressivity between Prolog and SPARQL, to implement a Prolog query that mimicked the structure of the SPARQL query I used, as close as possible. Interestingly it turned out that this Prolog query can be optimized to become blazing fast by reversing the order of shift values to search for, so that the largest values are searched for first. With this optimization the query outperforms both the SPARQL and the earlier used prolog code. See figure 1 below for results (The new prolog query is named "SWI-Prolog Minimal"). It appears that the querying time does not even increase with the number of triples in the RDF store!

Figure 1: Spectrum similarity search comparison: SWI-Prolog vs. JenaFigure 1: Spectrum similarity search comparison: SWI-Prolog vs. Jena

The explanation seems to stem from the fact that larger NMR Shift values are in general more unique than smaller values (see histogram of shift values in the full data set in figure 2 below). Thus, by testing for the largest value first, the query will be much less prone to get stuck in false leads. (Well, looking at the histogram, it appears that one could in fact do even better sorting than just from larger to smaller, like testing for values around 100 before values around 130 etc.)

Figure 2: Histogram of NMR Shift values in 25000 spectrum datasetFigure 2: Histogram of NMR Shift values in 25000 spectrum dataset

(Find the new Prolog code below. The SPARQL query, and earlier Prolog code, can be found as attachments to this blog post.)

Test info: The tests are performed with the respective tools integrated in the Bioclipse workbench. Three iterations of each test was performed, restarting Bioclipse between each, to avoid bias from the amount of available memory.
Operating System: 32 bit Ubuntu 9.10 desktop, with CPU speed locked to 1.3GHz.
Hardware:  used was an Asus UL30A laptop with an 1.3GHz Intel SU7300 CPU, 4Gb of RAM and a 5400 RMP Hard drive.

New Prolog code for the NMR Shift near-value comparison

:- rdf_register_ns(nmr, '').  
:- rdf_register_ns(xsd, '').

select_mol_w_pshifts( Mol ) :-
  q( Mol, 203.0 ),
  q( Mol, 193.4 ),
  q( Mol, 158.3 ),
  q( Mol, 140.99 ),
  q( Mol, 78.34 ),
  q( Mol, 42.2 ),
  q( Mol, 42.0 ),
  q( Mol, 41.8 ),
  q( Mol, 33.5 ),
  q( Mol, 33.5 ),
  q( Mol, 31.7 ),
  q( Mol, 26.5 ),
  q( Mol, 22.6 ),
  q( Mol, 18.3 ),
  q( Mol, 17.6 ),
  q( Mol, 0 ).

%%% Query method %%%
q( Mol, RefShiftVal ) :-
  rdf( Mol, nmr:hasSpectrum, Spec),
  rdf( Spec, nmr:hasPeak, Peak),
  rdf( Peak, nmr:hasShift, literal(type(xsd:decimal, ShiftValLiteral))),
  atom_number_fixzero( ShiftValLiteral, ShiftVal ),
  abs(ShiftVal - RefShiftVal) =< 3.

atom_number_fixzero( Atom, Num ) :-
  atom_length( Atom, AtomLen ), AtomLen > 0 -> % IF atom is not empty
  atom_number( Atom, Num );                    % THEN Convert to num. val.
  atom_number( '0', Num ).