Lessons learned from UPPNEX just published in GigaScience

I have forgot to blog about it, but let's at least put the link here, to our new GigaScience paper, summarizing our "Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data":


UPPNEX reaching 1 PB of NGS storage

My frequent need to produce plots and graphs of different statistics at UPPMAX finally forced me to learn some R (which is a good thing). With the help of R studio, I finally got started.

My first plot was an overview of the Next Generation Sequencing (NGS) storage at UPPNEX, since it's start in 2009/2010, until today. It's reaching 1 PB even though this graph does not even include all data! (temporary working data is excluded, due to difficulties to track NGS and non NGS data separately). You can imagine some exponential component there, no? :)


UPPMAX arranges NextGen Sequencing Cloud/Hadoop hackathon in May

For you next gen sequencing bioinformaticians interested in getting some hands on new cloud based technologies for computation, mainly around the hadoop framework, and being around in Uppsala at the end of May 2012, may want to have a look at this:

As the event page on UPPMAX states:
"The hackathon will focus on next challenges that cloud adoption poses: massively distributed data processing frameworks such as Hadoop, distributed cloud databases and distributed bioinformatics applications."

Based on my experiences from a very useful (as in getting new hands on experience) and interesting (as in making new contacts) hackathon at CSC in Helsinki, Finland, I am sure this will be a highly interesting and useful hackathon as well, for all who are faced with the challenges of big sequencing data.

Apply before April 30 to get (EU/COST action: SeqAhead) funding! See you in Uppsala at the end of May!

HPC Client Screencast: Experimental Job Config Wizard

My work at UPPMAX, on the Bioclipse based HPC Client i is progressing, slowly but steadily. I just screencasted an experimental version of the job configuration wizard, which loads command line tool definitions from the Galaxy workbench, and use them to generate a GUI for configuring the parameters to the command line tool in question, as well as the parameters for the Slurm Resource manager (used at UPPMAX). Have a look if you want :) :

The Wizard obviously has quite some rough edges still. My current TODO is as follows:

  • Set sensible default values i widgets (i.e. when there is just 1 alt)
  • Use checkboxes and radiobuttons for select fields with few options
  • Use progress bar between wizard pages that takes time to load
  • Decide how to take care of the cheetah #if#lse#endif syntax, available in some galaxy tool config files.
  • Add validation
  • Use a time widget for the job time
  • Add a custom view with just a "connect" button, and showing only remote files for the configured host.
  • More modular loading of modules (hierarchical etc.)
  • More advanced parsing of options (i.e. allowing to omit params, rather than just saying "no" on them).

Etc ... More suggestions? :)

FIMS Project Status update - Thinking about CLI wrapper XML formats

Time for a little more "overview" like status update of the Bioclipse HPC Client part of the FIMS Project I'm working on for UPPNEX at UPPMAX.

Right now I'm hacking away on the batch job config wizard (just fixed a "select remote file" dialog, for file path parameters, which I actually even screencasted :=)).

Otherwise, I start coming to the stage when I need to do a decision about command line tool wrapper formats.

So far I've tried to use the Galaxy (bioinformatics portal) XML format, hoping to take advantage of the vast number of already wrapped bioinformatics tools (Actually, I'm using the format now - that is what drives the wizard in the Bioclipsescreencast above).

Unfortunately though, I figured out that most (if not all?) tool configs in Galaxy do not wrap the command line tool itself, but rather an accompanying script file (python/bash/perl), that does some additional stuff (different for each tool), which makes it hard to use the tool configs right away outside the galaxy platform.

So, realizing this, I'm considering whether we should go for something even more general, for this light-weight batch config wizard (which is not trying to be a complete replacement for Galaxy).

I actually just got to know about another such format (via a question on stack overflow). Apparently the docbook-package contains such a format! So, in case one would find that there are lots of ready-made such docbook-definitions for a bunch of cli tools already, then this could be quite interesting. ... so that's something I'll check in now. Otherwise, I maybe might as well stick with the Galaxy format (Have to admit though that the docbook one feels like a more general choice, in the general sense ... or what do you think?).

Then one could of course also have converters between the DocBook and Galaxy xml formats too ... should be pretty straight-forward with XSLT.

Ok, so that's where I am, and what I'm thinking about right now! Feel free to drop a line of feedback!

iRODS for managing next gen sequencing data in an HPC environment

My work in the UPPNEX project at UPPMAX HPC Center includes helping develop/find a solution to Next Generation Sequencing data management. The main aim is to automat data handling and making data handling easier for users.

One of the main challenges is  how to handle all this data in a multiple projects/users environment where access restrictions are critical, while many users also want to share certain data, etc.

Jonas Hagberg found out about iRODS, which after some initial research seems to fit our needs very well. It also now seems to be what some big sequencing centers are focusing on right now (Sanger, Broad ...).

Basically, iRODS is a rule-oriented data management system, that sits as a logical layer on top of actual file systems, provides a unified file identifier namespace, and can automate things like data migration between fast cache-like storage and longer time archiving storage, meta data tagging etc. (or, the automation itself can be controlled by manual tagging). Client access is done via the shell through the i-commands, via a web-file manager interface, fuse module, Java or PHP API. All in all, iRODS looks surprisingly mature, and to provide good flexibility while keeping the tech-stack reasonably simple.

The Sanger iRODS slides (March 2011) were very good indeed (they much describe the problems that we face). Also there seems to be some slides (April 2011) from a corresponding initiative at Broad Inistitute.

Installation of iRODS on ubuntu is mainly executing an automated shell script, and is done in a few minutes, so I expect to be diving into iRODS rather full time from now on.

Some iRODS Links:

A graphical client for running bioinformatics tools on HPC clusters

This blog has been silent for a while and someone might wonder what I've been doing.

One answer is: Developing a graphical client for non-linux-experienced users to connect securely to a computer cluster and configure batch jobs for common bioinformatics software. The project is financed within the UPPNEX project, and so the focus is foremost analysis of Next Generation Sequencing data, but the client will be fully capable to use for any software installed on the cluster.

The client meets a rising need in the next generation sequencing community, since biologists generally have far less experience with *nix systems and programming, than, say physicists, while the vast amounts of sequencing data increases the need to use large scale computing resources such as the ones provided in the UPPNEX project.

I demonstrated a proof-of-concept version at the UPPNEX-SciLifeLab bioinformatics forum on Feb 22nd, and the slides are now available:

As can be seen in the slides, the client is based on the very capable Bioclipse platform.