Samuel Lampa's blog

FIMS Project Status update - Thinking about CLI wrapper XML formats

Time for a little more "overview" like status update of the Bioclipse HPC Client part of the FIMS Project I'm working on for UPPNEX at UPPMAX.

Right now I'm hacking away on the batch job config wizard (just fixed a "select remote file" dialog, for file path parameters, which I actually even screencasted :=)).

Otherwise, I start coming to the stage when I need to do a decision about command line tool wrapper formats.

So far I've tried to use the Galaxy (bioinformatics portal) XML format, hoping to take advantage of the vast number of already wrapped bioinformatics tools (Actually, I'm using the format now - that is what drives the wizard in the Bioclipsescreencast above).

Unfortunately though, I figured out that most (if not all?) tool configs in Galaxy do not wrap the command line tool itself, but rather an accompanying script file (python/bash/perl), that does some additional stuff (different for each tool), which makes it hard to use the tool configs right away outside the galaxy platform.

So, realizing this, I'm considering whether we should go for something even more general, for this light-weight batch config wizard (which is not trying to be a complete replacement for Galaxy).

I actually just got to know about another such format (via a question on stack overflow). Apparently the docbook-package contains such a format! So, in case one would find that there are lots of ready-made such docbook-definitions for a bunch of cli tools already, then this could be quite interesting. ... so that's something I'll check in now. Otherwise, I maybe might as well stick with the Galaxy format (Have to admit though that the docbook one feels like a more general choice, in the general sense ... or what do you think?).

Then one could of course also have converters between the DocBook and Galaxy xml formats too ... should be pretty straight-forward with XSLT.

Ok, so that's where I am, and what I'm thinking about right now! Feel free to drop a line of feedback!

Got "select remote file" to work in the Cluster Job Config Wizard

As hinted about in this blog post, I'm working on a cluster batch job configuration wizard for Bioclipse, making heavy use of the TM / RSE components for Eclipse.

In the wizard, I of course wanted to be able to fire up a file selection dialog for filling in fields for file paths on the cluster. Only problem was that the RSE API is (as usual to Java projects) not of the simplest kind, and for a newcomer like me I found it a bit challenging to find where to start.

As pointed out in this short blog post, I finally found the (simple) solution, and here we go (the final code needed some additions though, but in principle it was simple):

Some more info and can be found at the project's wiki page.

Opening a remote file selection dialog with the RSE for Eclipse

This was easier than expected. Helped by the RSE File UI API Docs and this forum post, I figured out how to do:

SystemRemoteFileDialog dialog = new SystemRemoteFileDialog(SystemBasePlugin.getActiveWorkbenchShell());
dialog.open();
 
IRemoteFile file = (IRemoteFile) dialog.getSelectedObject();
System.out.println("Selected file's absolute path: " + file.getAbsolutePath());

Now also committed!

Update: Using proper interface for dealing with remote files (commit).

iRODS for managing next gen sequencing data in an HPC environment

My work in the UPPNEX project at UPPMAX HPC Center includes helping develop/find a solution to Next Generation Sequencing data management. The main aim is to automat data handling and making data handling easier for users.

One of the main challenges is  how to handle all this data in a multiple projects/users environment where access restrictions are critical, while many users also want to share certain data, etc.

Jonas Hagberg found out about iRODS, which after some initial research seems to fit our needs very well. It also now seems to be what some big sequencing centers are focusing on right now (Sanger, Broad ...).

Basically, iRODS is a rule-oriented data management system, that sits as a logical layer on top of actual file systems, provides a unified file identifier namespace, and can automate things like data migration between fast cache-like storage and longer time archiving storage, meta data tagging etc. (or, the automation itself can be controlled by manual tagging). Client access is done via the shell through the i-commands, via a web-file manager interface, fuse module, Java or PHP API. All in all, iRODS looks surprisingly mature, and to provide good flexibility while keeping the tech-stack reasonably simple.

The Sanger iRODS slides (March 2011) were very good indeed (they much describe the problems that we face). Also there seems to be some slides (April 2011) from a corresponding initiative at Broad Inistitute.

Installation of iRODS on ubuntu is mainly executing an automated shell script, and is done in a few minutes, so I expect to be diving into iRODS rather full time from now on.

Some iRODS Links:

Installing PDT 2.2 with XDebug 2.1.1 on Eclipse Helios on Ubuntu

(To be improved... got problems with image upload at the moment ... :/ )

After one year of very little PHP / MediaWiki / SMW coding, the time finally has came to get going a bit agian. So first thing was getting a decent PHP IDE with a debugger, up running.

Last year I used Eclipse with PDT (PHP Development Tools) and Xdebug. The installation was a bit tricky though and I had to stay with Eclipse Galileo due to some bugs, but now things have changed. With PDT 2.2 and the latest XDebug (2.1.1), installing on Ubuntu is a snap.

This is what should work:

Tags:

Installing ArgoUML-Python for python code generation from UML

I've been looking into ways to generate python code from UML. I tried Enterprise Architect from Sparx systems, which is indeed an impressive product, but since the code generation lacked features I wished for anyway, I thought looking into the open source alternatives might be worth it.

I tried some UML tools like Umbrello, Gaphor, StarUML etc but they all were either not mature enough, or had problems running on Linux or Linux/Wine.

I finally settled on ArgoUML, which feels mature enough. And there is the ArgoUML-python plugin, for python code generation (see also this question on StackOverflow, for a background to my choice). Unfortunately there is no binaries available for download, so you have to build them yourself. That was not much of a hassle at all in Eclipse though, so I'll go through the steps here:

A graphical client for running bioinformatics tools on HPC clusters

This blog has been silent for a while and someone might wonder what I've been doing.

One answer is: Developing a graphical client for non-linux-experienced users to connect securely to a computer cluster and configure batch jobs for common bioinformatics software. The project is financed within the UPPNEX project, and so the focus is foremost analysis of Next Generation Sequencing data, but the client will be fully capable to use for any software installed on the cluster.

The client meets a rising need in the next generation sequencing community, since biologists generally have far less experience with *nix systems and programming, than, say physicists, while the vast amounts of sequencing data increases the need to use large scale computing resources such as the ones provided in the UPPNEX project.

I demonstrated a proof-of-concept version at the UPPNEX-SciLifeLab bioinformatics forum on Feb 22nd, and the slides are now available:

As can be seen in the slides, the client is based on the very capable Bioclipse platform.

Repository of scripts for Next Generation Sequencing (NGS) at UPPNEX

The other day we set up a repository at the UPPNEX website, for typical workflows/scripts, used for mangling Next Generation Sequencing data at the SNIC-UPPMAX HPC clusters. Only two scripts there so far, but we are expecting more to come!

We were discussing different solutions for such a repository, including sites like gist.github.com and www.myexperiment.org, but settled for an own repository in order to provide maximally convenient access for the not-always-so-computer-mileaged NGS community. We'll encourage users to also publish their scripts on sites like myexperiment.org though (Actually, that kind of publicity should be a good thing in general for UU).

Looping over a List<String> in Bioclipse's JS console

I had some trouble finding out how to loop over the results of a manager method in Bioclipse, which returns a List<String> to Bioclipse's javascript console. Since I didn't find it documented anywhere (probably it is, somewhere?), I wanted to ducument the snippet here:

var strings = myManager.methodReturningListOfStrings(someParams);
for (var i=0; i<strings.size(); i++) {
  js.say(strings.get(i));
}

Operator and Bracket Syntax in PyDev

Loving the new operator and bracket syntax highlighting in PyDev for Eclipse! (I installed it from the nightly builds Eclipse Update site, as it is included in the upcoming 1.6.5 only). See screenshot below for my favourite highting Scheme (creds: Rolf Lampa):