I have a new blog!

GSoC status update July 14

(Figured I better do regular status updates, as a lot of small things tend to get missed if blogging only when there is something to show off)

As you might have seen among the GSoC2010 tagged posts I've had a rudimental RDF/XML import, and a SPARQL endpoint (only for querying so far!) up running for a while. You should be able to set up these yourself by following one or more of the instructions in the Google code repo:

I have since worked a bit on some use cases, which revealed a lot of intricacies to take into account on RDF import. One of them was a spinoff discussion, from a blog post by Egon Willighagen, which quite nicely outlines one of the motivations for having general RDF import in MediaWiki (read post, read discussion).

The last few days I've been working on heavly refactoring the import code, so that it is more general and easy to modify in new ways. There is still a lot to be improved in the code, like error handling, documentation, adding more options etc, so feel free to give feedback on the code! (Especially RDFImporter.php and EquivalendURIHandler.php, and preferrably use the mailing lists: semediawiki-devel, semediawiki-user or mediawiki-l)

The RDF import seems to be the most challenging part in my project (and on which the export feature heavily depends) - since it is the part where I'm breaking a bit of new ground, so here feedback is much welcome.

Choosing wiki titles for RDF entities on import - Feedback wanted

The one most challenging issue is about how to select reasonable wikititles to use for RDF entities on RDF import, based on the RDF data (one relevant blog post here). The question of being able to export the page with the original URI, should not limit the choice directly, since this is already solved by storing the original URI as a property on each page.

The thoughts we have had so far - in short - is:

  • First look if the RDF entity in question has one of a list of properties, in prioritized order, that should be used as wiki title.
    • The first of these, could be a special property which can be used to manually specify this by including it in the imported RDF, like "hasWikiTitle".
    • The suggested list or properties so far can be seen in this blog post. This list should of course be configurable, and one question is also how to best implement this configuration? A setting in LocalSettings.php? A wiki page?
  • If no matching property is in this list, then the label for the RDF entity should be used. For example, if the entity:s URI is http://bio2rdf.org/go:0032283, then 0032283 is used.

How to configure namespace prefixes / preudo-namespaces?

Using only the label of course has the risk that multiple RDF entities converts to the same wiki title, which is not acceptable for example if using the wiki as a "one time RDF editor", which is one of the motivations for this project.

To solve this, one alternative (as a configurable option) could be to use a pseudo-namespace in the wiki title (e.g. "go" in the above example, which would result in "go:0032283" as the wikititle). This could be configured by creating a mapping between base URI:s and pseudo-namespaces (.e. "http://bio2rdf.org/go:" and "go", in this case).

But then there is the question how to configure this mapping. We've been thinking of a few options:

  • Let it be configured in the incoming RDF, by using a custom predicate "hasPseudoNameSpace"
  • Store a config in a Wiki article
  • Config in LocalSettings.php
  • On submitting data for import, analyze it first and present a screen with all the base URI:s used, with fields to manually fill in the pseudo-prefixes to use.
  • Any combination of the above

I will be working ahead, and try to figure out the most reasonable strategy together with Denny (who is my GSoC mentor), but feedback and comments are always welcome! (As said, preferrably send feedback on the mailing lists; semediawiki-devel, semediawiki-user or mediawiki-l!)

If you want to follow the project progress, see the status page for options

Debugging MediaWiki extensions with Eclipse for PHP and XDebug on Ubuntu 10.04

 I ran into some troubles with the debugging with XDebug in Eclipse for PHP Developers / PDT (breakpoints stopped to take / catch, after I changed location of my www folder - which I figured out later), so I wanted to document the full setup procedure it here. I mainly followed this blog post. (Assuming you have apache and php set up!).

  • Download Eclipse for PHP developers
    • I'm using the Linux 32 bit version
    • Because of a bug with PDT for Eclipse Helios, I chose to go with Eclipse PDT SR2 for now.
  • Install XDebug by typing:
 apt-get install php5-xdebug
  • Edit /etc/php5/conf.d/xdebug.ini to look something like this:
# xdebug.remote_log="/tmp/xdebug.log" 
    • The log line can be good to enable in case you run into troubles.
    • The first line can be different depending on PHP version. Use the line that you can find in the file (which is the default for your version of Ubuntu):
  • Restart apache:
 sudo apache2ctl restart

Configure Eclipse

  • Create a PHP Project. Mine is named "RDF4SMW".
  • Choose Run > Debug Configurations in Eclipse
  • Turn the configuration to your web server setup. For me, it looks like this ("smw" is the folder where MediaWiki is installed):

Custom location of www folder

Then an important thing If you have changed the location of your www folder from /var/www, then don't miss this! This was what caused the problem for me, with breakpoints not taking:

  • Click the "Configure" button in the Debug configuration screen seen above
  • Click the "Path mapping" tab
  • Configure according to the location of your www folder. My current config looks like so:
  • This should be it. Happy debugging!

Considering name change: SMW RDF Connector --> SMW RDF IO

Most probably, the extended RDF import/export in Semantic MediaWiki, which I'm working on, will be made an extension, to start with (at least that's my idea).

First I was calling it "RDFIO", but in fear of being a too general name, or too undescriptive, I switched to "SMW RDF Connector", and the project currently lives in a repo called SMW RDF Connector.

Now, realize this is awfully long, I'm thinking to go back to "RDFIO", or rather "SMW RDF IO" again, before it's "too late", (haven't even created a MediaWiki extension page yet). Seems like it's the most clever, short choice after all.

Any comments?

New release of Bioclipse (2.4)

Bioclipse, the open source chemoinformatics and bioinformatics workbench, which was the context for my thesis project, just released a new major version, 2.4:


Also, the site got a major overhaul just a few days ago, so why don't you have a look! :)


Final version of degree project report

The last administrative details of my thesis project are now finished, and the report is now available in final form, for download as PDF in on this page (no 14 in the list), or this direct link. (The title of the project was "SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking").

RDF properties to use for wiki titles on import - suggestions?

One of the things we try to do in my GSoC project is to select suitable wiki titles when importing arbitrary RDF triples into Semantic MediaWiki (The full RDF URI:s are very ugly to use as wiki titles!).

Simply shorting the namespace in the entitiy's URI to it's prefix (as specified in the import data) could be a general fallback (see screenshot) but for many types of data it could be nice to make use of a property that puts some "natural language" label for the entity instead. We came up with this list of properties so far:

More suggestions?

ARC2 based SPARQL Endpoint for Semantic MediaWiki up running

I now managed to create a SPARQL endpoint for Semantic MediaWiki, in a MediaWiki SpecialPage, based on the ARC2 RDF library for PHP (including its built-in triplestore). See screenshot below, and code in the svn trunk. (I have to say I'm impressed by the ease of working with ARC2!)

Code is still quite ugly and the "Equivalent URI" handling that we talked about is still not implemented. Will turn to that now, while doing some refactoring (more object oriented etc).

Even nicer with more prefixes


Ok, even nicer now ... after allowing to introduce custom namespace prefixes also with '?' and '=' as separators (Don't know if that is valid for QNames though? Anyone knows?) than before (blog post 1, and post 2):

Code (see code repo) is still quite ugly though. Just typed together to get things working. Will look at structuring things more as soon as I have slept a little better :)

I also got to think of these random issues, which is foremost questions to myself, that might not be fully thought-through yet, but jotting them down here for possible feedback (hoping it's readable ^^):

  • Imports will have to be done according to a special "mapping configuration", so that the same config can be used on import and export, if wanted. Such a config should be generated on import based on the RDF namespaces used. It should also be manually expandable with a prioritized list of properties (like dc:title) to use as substitute for URIs. Maybe it could be stored in a (timestamped?) wiki article?
  • When using "natural language" substitutes for URI:s, such as "dc:title", one has to put a property such as swivt:originalURI or similar into the page, so that it can be exported again with its original URI (or there is already the swivt:importedFrom (or sth like that...) ... is it better to use that?
  • The SPARQL endpoint will need to be accessed according to the same kinds of mapping configs if the wiki should be able to be queried in the same format as the import data ...

Using RDF namespace prefixes in wiki titles


As you might have seen in my previous blog post, using the full RDF URI:s as wiki titles on importing, is not optimal. I now managed to use the RDF namespace prefixes submitted on import time, to make titles nicer. Definitely improves things a bit. At least all properties are not all treated as the same as in the previous blog post for example.

But, also still a bit to go. Not all URI:s have prefixes, and for some things you would probably want to use the corresponding dc:title or similar instead of an abbreviated URI, etc.


Populating SMW from RDF/XML (First test)

I now managed, (using ARC2 and SMWWriter, in a MediaWiki extension) to populate Semantic MediaWiki pages with triples, from a snippet of RDF/XML (Thanks to Egon Willighagen for the RDF-ized NMRShiftDB data, submitted further below), yay! But ... as you can see, using the full URI:s as wiki names is not a good idea. URI:s as wiki titles is ugly ... and the predicates in this case even got truncated and all treated as the same one, since they had hashes in it, which MediaWiki obviously doesn't allow in titles), so that's the whole background for our talking so much about the "equivalent URI handler" (mentioned here for example), which is thought to be a configurable handler of mappings from URI patterns to (sensible) wiki titles. Also, optimally the same pattern can then be used both on import and export so that the format is kept, allowing to use SMW as a (collaborative) RDF editor! (which is one of the main motivations for my GSoC project).

Well, the hard bits (URI -> Wiki title mapping) remains. Diving in to it now ...