iRODS for managing next gen sequencing data in an HPC environment

My work in the UPPNEX project at UPPMAX HPC Center includes helping develop/find a solution to Next Generation Sequencing data management. The main aim is to automat data handling and making data handling easier for users.

One of the main challenges is how to handle all this data in a multiple projects/users environment where access restrictions are critical, while many users also want to share certain data, etc.

Jonas Hagberg found out about iRODS, which after some initial research seems to fit our needs very well. It also now seems to be what some big sequencing centers are focusing on right now (Sanger, Broad ...).

Basically, iRODS is a rule-oriented data management system, that sits as a logical layer on top of actual file systems, provides a unified file identifier namespace, and can automate things like data migration between fast cache-like storage and longer time archiving storage, meta data tagging etc. (or, the automation itself can be controlled by manual tagging). Client access is done via the shell through the i-commands, via a web-file manager interface, fuse module, Java or PHP API. All in all, iRODS looks surprisingly mature, and to provide good flexibility while keeping the tech-stack reasonably simple.

The Sanger iRODS slides (March 2011) were very good indeed (they much describe the problems that we face). Also there seems to be some slides (April 2011) from a corresponding initiative at Broad Inistitute.

Installation of iRODS on ubuntu is mainly executing an automated shell script, and is done in a few minutes, so I expect to be diving into iRODS rather full time from now on.

Some iRODS Links:

iRODS Website
iRODS Fact sheet
iRODS Primer - Comprehensive guide to (seemingly) all you need to know
Update: See also a very interesting article in Bio-IT World

Samuel's Tech Blog

Primary links

Tags

iRODS for managing next gen sequencing data in an HPC environment

Recent blog posts