semantic web

Autumn is here - new projects starting

Now that the autumn is here I have some new projects starting, and running for approximately this month (before jumping out into the real world and trying to get a real job :) ). Good thing is, it all builds upon previous work.

First thing is, I'll work with Egon Willighagen to hook up Bioclipse with Semantic MediaWiki, via the RDFIO extension, which I developed as part of GSoC this year, mentored by Denny Vrandecic. I'm excited about putting RDFIO into some real world usage!

Second thing is I'll work part time at the Bioclipse group to improve the ways user documentation for plugins is authored and published, enabling some automation of publishing content to the Bioclipse website as well as the wiki etc. This will hopefully make it a lot easier for end users to find their desired extra functionality, and make more users see the value of a research platform with an open and modular structure, as is Bioclipse!

RDFIO 0.4.0 released - GSoC Finished!

With the release of RDFIO 0.4.0, my GSoC 2010 project is now over!

I want to thank especially my mentor Denny Vrandečić, and also the SMW community at large for a great time! I also want to sincerely thank my masters project mentor Egon Willighagen who mentioned about, and encouraged me to apply to the program. Without this encouragement, I'd never taken the step. It has been a good time and rewarding, and I've much enjoyed to have time to get a bit into MediaWiki/SMW extensions development, as well as to provide some new functionality to these great bits of software.

The main GSoC coding is now over, and I will need to take a little break for an exam this friday, but surely I'll continue to refine the RDFIO extension later, especially as Egon and me are looking into using it to integrate Bioclipse with SMW in order to make RDF data in Bioclipse "Community editable" (could turn out to be some real useful stuff!).

The 0.4.0 release

This new release brings a lot of refactoring and reworkings under the hood, as well as quite a few minor bugfixes and improvements here and there, so upgrading is recommended. We also got the issues in SMWWriter fixed now, so patching it is no longer needed, which hopefully will make installing easier!

Of the more notable changes are the improved selection of wiki titles on import, as described in this blog post. Another important fix is that the default output format (if not specifying any) is now "SPARQL Resultset XML" which now makes the SPARQL endpoint fully "SPARQL compliant" and queryable from typical SPARQL tools like Jena. It is a remaining topic though how to allow update operations without leaving the endpoint wide open ... i.e. how to implement some form of user rights checking, when used as a webservice.

A little technical note also is that RDFIO now takes the $wgDBprefix parameter into account, so if you are using RDFIO with table prefixes in the database, you will need to regenerate the tables and the triples in the store (can be done at the Special:ARC2Admin and Special:SMWAdmin pages respectively).

I should not end without a note about some great bits of existing code that I've had the pleasure to make use of:

  • ARC, the PHP RDF library I've been using, and on which RDFIO is very heavily dependent
    I found it very powerful and super conveniant to work with!
  • I enjoyed making use of the SMWWriter and PageObjectModel extensions, which also definitely made life easier for me and saved me tons of work.

Sensible wiki titles on RDF import with "pseudo RDF namespaces"

This week I just finished the last remaining items on my todo list for my Google Summer of Code project, (which is available in the form of the RDFIO MediaWiki extension). Those things, which I also mentioned in my last blog post were to:

  • Add ability to use ("pseudo") namespaces for general RDF entities (non-properties) in order to choose wiki titles for them on RDF import.
  • Add a screen that shows URIs lacking a namespace prefix to abbreviate it with.

Regarding the first point, it might not be overly easy to see the usefulness of it at once, so I just created a screencast to show the difference between using it and not:

It demonstrates the problem of choosing sensible wiki titles for general RDF entities in case no good property for naming is available, (such as rdfs:label etc) ... since "entity" URIs often just consist of nonsensible id:s and often no namespace prefixes are defined for them. RDFIO lets you add "pseudo" namespaces (using a simplified splitting pattern, not necessarily consistent with XMLns specs), in order to come around this problem.

  • The new functionality is so far only available in the svn trunk
  • More info, install instructions etc on the extension page

Hopefully I'll find time to also demonstrate the second point above, as well as the "filter by ontology" feature for the SPARQL endpoint, with screencasts early next week.

Otherwise, the coming week I'll use for doing some refactoring of the currently quite unmanageable code, as well as add commenting, and hopefully also add the feature to filter RDF export by a [[Export RDF::false]] SMW property (which was the "it time permits" item of my TODO list).

GSoC status update July 14

(Figured I better do regular status updates, as a lot of small things tend to get missed if blogging only when there is something to show off)

As you might have seen among the GSoC2010 tagged posts I've had a rudimental RDF/XML import, and a SPARQL endpoint (only for querying so far!) up running for a while. You should be able to set up these yourself by following one or more of the instructions in the Google code repo:

I have since worked a bit on some use cases, which revealed a lot of intricacies to take into account on RDF import. One of them was a spinoff discussion, from a blog post by Egon Willighagen, which quite nicely outlines one of the motivations for having general RDF import in MediaWiki (read post, read discussion).

The last few days I've been working on heavly refactoring the import code, so that it is more general and easy to modify in new ways. There is still a lot to be improved in the code, like error handling, documentation, adding more options etc, so feel free to give feedback on the code! (Especially RDFImporter.php and EquivalendURIHandler.php, and preferrably use the mailing lists: semediawiki-devel, semediawiki-user or mediawiki-l)

The RDF import seems to be the most challenging part in my project (and on which the export feature heavily depends) - since it is the part where I'm breaking a bit of new ground, so here feedback is much welcome.

Choosing wiki titles for RDF entities on import - Feedback wanted

The one most challenging issue is about how to select reasonable wikititles to use for RDF entities on RDF import, based on the RDF data (one relevant blog post here). The question of being able to export the page with the original URI, should not limit the choice directly, since this is already solved by storing the original URI as a property on each page.

The thoughts we have had so far - in short - is:

  • First look if the RDF entity in question has one of a list of properties, in prioritized order, that should be used as wiki title.
    • The first of these, could be a special property which can be used to manually specify this by including it in the imported RDF, like "hasWikiTitle".
    • The suggested list or properties so far can be seen in this blog post. This list should of course be configurable, and one question is also how to best implement this configuration? A setting in LocalSettings.php? A wiki page?
  • If no matching property is in this list, then the label for the RDF entity should be used. For example, if the entity:s URI is http://bio2rdf.org/go:0032283, then 0032283 is used.

How to configure namespace prefixes / preudo-namespaces?

Using only the label of course has the risk that multiple RDF entities converts to the same wiki title, which is not acceptable for example if using the wiki as a "one time RDF editor", which is one of the motivations for this project.

To solve this, one alternative (as a configurable option) could be to use a pseudo-namespace in the wiki title (e.g. "go" in the above example, which would result in "go:0032283" as the wikititle). This could be configured by creating a mapping between base URI:s and pseudo-namespaces (.e. "http://bio2rdf.org/go:" and "go", in this case).

But then there is the question how to configure this mapping. We've been thinking of a few options:

  • Let it be configured in the incoming RDF, by using a custom predicate "hasPseudoNameSpace"
  • Store a config in a Wiki article
  • Config in LocalSettings.php
  • On submitting data for import, analyze it first and present a screen with all the base URI:s used, with fields to manually fill in the pseudo-prefixes to use.
  • Any combination of the above

I will be working ahead, and try to figure out the most reasonable strategy together with Denny (who is my GSoC mentor), but feedback and comments are always welcome! (As said, preferrably send feedback on the mailing lists; semediawiki-devel, semediawiki-user or mediawiki-l!)

If you want to follow the project progress, see the status page for options

Final version of degree project report

The last administrative details of my thesis project are now finished, and the report is now available in final form, for download as PDF in on this page (no 14 in the list), or this direct link. (The title of the project was "SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking").

Report approved - preliminary version for download

My degree project, titled "SWI-Prolog as a Semantic Web tool for semantic querying in Bioclipse" is getting closer to finish. Now my report is approved by the Scientific Reviewer (thanks, Prof. Mats Gustafsson), so I wanted to make it available here (Download PDF). Reports on typos are welcome of course! :)

Browse semantic data in parallell

The semantic web field has seemed quite void of successful general user interfaces to browse semantic data in an efficient way (SPARQL querying is not really for everybody and their aunt). An interesting approach is the Freebase Parallax which lets you continuously views sets of data all while you narrow down or extend your search criteria, thus "browsing in parallell". Seems to make a lot of sence in its simplicity.

3rd Project Update (Integrating SWI-Prolog for Semantic Reasoning in Bioclipse)

I just had my 3rd, and last project update presentation (before the final presenation on April 28th), presenting results from comparing the performance of the integrated SWI-Prolog against Jena and Pellet, for a spectrum similarity search query. Find the sldes below.

Correction of flawed results: Close competition between Jena and Prolog

UPDATE 29/3: See new results here

I reported in a previous blog post (with a bit of surprise) that Jena clearly outperformed SWI-Prolog for a NMR Spectrum similarity search run inside Bioclipse. I have now realized that indeed these previous results were flawed for a number of reasons.

Semantic web for LAMP systems

Just aquainted myself a tiny bit with ARC, RDF Classes for PHP. Might be useful in the Bioclipse / Semantic MediaWiki integration that is in the thoughts for the coming months. ARC "framework", or RDF API used by RDF modules in Drupal, and has been talked about being the substitute for the currently used RAP framework in Semantic MediaWiki (used if one wants to set up a SPARQL endpoint for a Semantic MediaWiki).

On a different note, there are a bunch of nice web apps built on ARC over at semsol.com, most notably trice, a semantic web application framework.