File based RDF storage in Pellet, first tests

As reported in a previous blog post, I ran into java stacksize errors when importing large amounts of data into pellet. Pellet was using just the in-memory Jena RDF store, which obviously puts limits on the amount of data it can handle.

Jena offers other options for RDF storage though, including SDB for SQL backends, and TDB for a pure Java file based storage. The latter is said to be the faster and easier to setup, which makes it suitable for implementing in the pellet/rdf bioclipse plugin, so I went away and implemented it (which was indeed very easy, after all the Eclipse classpath horror had been sorted out). Will have to refine it with better handling of filepaths, good method naming etc, before hopefully committing a patch during tomorrow or so.

First comparable performance comparison: Pellet vs. Prolog

Meanwhile, went away and compared the time for the two operations from the last blog post (Loading of ~1 million triples + extracting all unique predicates) between Pellet and Prolog. Now the results are more similar (Pellet's performance has increased around 1000 fold :) ) though Prolog is still the faster one:

Comparison results

Operation	Pellet	Prolog
Loading ~mill. triples	71.703 s	49.519 s
Retr.all unique predicates	4.371 s	3.508 s

</nowiki>

Pellet

Total time for importing nmrshiftdata with Pellet: 71.703 s
[[http://www.nmrshiftdb.org/onto#moleculeId],
[http://xmlns.com/foaf/0.1/homepage],
[http://www.blueobelisk.org/chemistryblogs/casnumber],
[http://www.w3.org/2002/07/owl#sameAs],
[http://www.blueobelisk.org/chemistryblogs/inchi],
[http://www.blueobelisk.org/chemistryblogs/inchikey],
[http://www.nmrshiftdb.org/onto#hasSpectrum],
[http://purl.org/dc/elements/1.1/title],
[http://www.nmrshiftdb.org/onto#spectrumId],
[http://www.nmrshiftdb.org/onto#spectrumType],
[http://www.nmrshiftdb.org/onto#hasPeak],
[http://www.nmrshiftdb.org/onto#temperature],
[http://www.nmrshiftdb.org/onto#solvent],
[http://www.nmrshiftdb.org/onto#field],
[http://www.w3.org/1999/02/22-rdf-syntax-ns#type],
[http://www.nmrshiftdb.org/onto#hasShift],
[http://purl.org/dc/elements/1.1/source], [http://purl.org/ontology/bibo/doi]]
Total time for retreiving all predicates, with Pellet: 4.371 s

Note that Pellet returns some extra, in-built owl predicates

Prolog

Total time for importing nmrshiftdata with Prolog: 49.519 s
[['.'('http://purl.org/dc/elements/1.1/source',
'.'('http://purl.org/dc/elements/1.1/title',
'.'('http://purl.org/ontology/bibo/doi',
'.'('http://www.blueobelisk.org/chemistryblogs/casnumber',
'.'('http://www.blueobelisk.org/chemistryblogs/inchi',
'.'('http://www.blueobelisk.org/chemistryblogs/inchikey',
'.'('http://www.nmrshiftdb.org/onto#field',
'.'('http://www.nmrshiftdb.org/onto#hasPeak',
'.'('http://www.nmrshiftdb.org/onto#hasShift',
'.'('http://www.nmrshiftdb.org/onto#hasSpectrum',
'.'('http://www.nmrshiftdb.org/onto#moleculeId',
'.'('http://www.nmrshiftdb.org/onto#solvent',
'.'('http://www.nmrshiftdb.org/onto#spectrumId',
'.'('http://www.nmrshiftdb.org/onto#spectrumType',
'.'('http://www.nmrshiftdb.org/onto#temperature',
'.'('http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
'.'('http://www.w3.org/2002/07/owl#sameAs',
'.'('http://xmlns.com/foaf/0.1/homepage', []))))))))))))))))))]]
Total time for retreiving all predicates with Prolog: 3.508 s

To be continued...

Samuel's Tech Blog

Primary links

Tags