I have a new blog!

New majority voted twitter hashtag for NextGen Sequencing: #deepseq

As I concluded in a question on Biostar, there has been no real consensus on a short, non-hijacked hashtag to use for "High-Througput sequencing" / "Next Generation Sequencing" on social media sites such as twitter and identi.ca.

After some community voting, a new winner turned out: #deepseq (click for twitter feed)

So, do spread the word, and start using it!

Grepping SQL dumps with endless lines? Use the fold command!

Grepping for stuff in MySQL dumps is not that nice, with miles-wide lines. You could send the grep output to a command such as "cut -c 1-200", but that would still not be guaranteed to give you the actual matched content.

Enter the "fold" command, which formats output into lines with a max count of chars:

grep "stuff" sqldump.sql | fold -w 200 | grep -C 1 "stuff"

... will give you a much better view of the context of the match!

(The first grep gets the (mile-wide) line that has the match, then fold will split the mile-wide line into 200 char long lines, and "grep -C 1" will show only the one 200 char wide line where the match is + 1 line of context before and after).


My SMWCon Fall 2011 Talk

After doing my GSoC project for Wikimedia foundation / Semantic MediaWiki in 2010, resulting in the RDFIO Extension, I finally could make it to the Semantic MediaWiki Conference, which was in Berlin this week.

While I write up a longer review of the many interesting talks, you can in the meantime find the slides from my talk below, on "hooking up Semantic MediaWiki with external tools" (such as Bioclipse and R):


  • For the SMW/Bioclipse Hookup there is a status update on my blog.
  • ... with a demo screencast.
  • More info on the RDFIO extension available on the Extension page
  • Code for the Bioclipse SMW module is available at github
  • Bioclipse website is at bioclipse.net
  • ... and the (SMW) Bioclipse wiki
  • The SMW/R hookup is not yet published in any journal, but this is what is available:
    • Egon Willighagen, who did it, has blogged it.
    • Also, the rrdf package he wrote, is is available in the CRAN, and there's a PDF available, describing it.

Essential screen flags and shortcuts

GNU Screen is a nice little program, allowing you to have "terminals" that you can detach in the background, so that you can have long batch jobs started, outputting stuff to the stdout, for example, but still don't be afraid to close down your terminal by accident etc.

Unfortunately screen has, IMO, quite an awkward syntax, but I managed to learn 3 flag combinations, and two keyboard combinations (from inside screen) that seems to be what I need for basic usage of screen:


Start a new named screen session:

screen -dmS ASessionName

List all detached screen sessions:
screen -ls

Re-attach a named screen session:
screen -r ASessionName


Detach the current session in background:

Ctrl + a, Ctrl + d

Close current screen session:
Ctrl + d


HPC Client Screencast: Experimental Job Config Wizard

My work at UPPMAX, on the Bioclipse based HPC Client i is progressing, slowly but steadily. I just screencasted an experimental version of the job configuration wizard, which loads command line tool definitions from the Galaxy workbench, and use them to generate a GUI for configuring the parameters to the command line tool in question, as well as the parameters for the Slurm Resource manager (used at UPPMAX). Have a look if you want :) :

The Wizard obviously has quite some rough edges still. My current TODO is as follows:

  • Set sensible default values i widgets (i.e. when there is just 1 alt)
  • Use checkboxes and radiobuttons for select fields with few options
  • Use progress bar between wizard pages that takes time to load
  • Decide how to take care of the cheetah #if#lse#endif syntax, available in some galaxy tool config files.
  • Add validation
  • Use a time widget for the job time
  • Add a custom view with just a "connect" button, and showing only remote files for the configured host.
  • More modular loading of modules (hierarchical etc.)
  • More advanced parsing of options (i.e. allowing to omit params, rather than just saying "no" on them).

Etc ... More suggestions? :)

(E)BNF parser for (parts of) the Galaxy ToolConfigs with ANTLR

As blogged earlier, I'm currently into parsing the syntax of some definitions for the parameters and stuff of command line tools. As said in the linked blog post, I was pondering whether to use the Galaxy Toolconfig format or the DocBook CmdSynopsis format. It turned out though Well, that cmdsynopsis lacks the option to specify a list of valid choices, for a parameter, as is possible in the Galaxy ToolConfig format (see here), and thus can be used to generate drop-down lists in wizards etc. which is basically what I want to do ... so, now I'm going with the Galaxy format after all.

Enter the Galaxy format then. Look at an example code snippet:

<tool id="sam_to_bam" name="SAM-to-BAM" version="1.1.1">
  <description>converts SAM format to BAM format</description>
    <requirement type="package">samtools</requirement>
  <command interpreter="python">
      #if $source.index_source == "history":
      #end if
    <conditional name="source">
      <param name="index_source" type="select" label="Choose the source for the reference list">
        <option value="cached">Locally cached</option>
        <option value="history">History</option>
      <when value="cached">
      ... cont ...

Here I've got some challenges. XML parsing is easy, even in Java (I use the Java XPath libs for that). But look inside the <command> tag ... that's some really non-xml stuff, no? (it is instructions for a python based template library, used in galaxy). I have to parse this though, in order to replicate the logic of it ... so what to do? ... well, I turned to the ANTLR Parser Generator.

ANTLRWorks works nicely out of the box

I heard a lot of good things about ANTLR, like that it is more easily debugged than typical BNF parsers etc, so the choice wasn't that hard. I tried the ANTLR for Eclipse, but though it looks nice, it that was quite buggy, and I couldnt get it to work properly in neither Eclipse 3.5 or 3.6. So, finally I went with the easy option and developed my EBNF grammar in ANTLRWorks, which is an integrated Java App, with the correct ANTLR lib already installed etc. Turned out to work really good!

The grammar I came up with so far (only for the syntax inside the <command> tag so far, though!) is available on GitHub ... and below (in condensed syntax to save some space), for you convenience :)

grammar GalaxyToolConfig;
options {output=AST;}
command    : binary (ifstatement param+ (ELSE param+)? ENDIF | param)*;
binary     : WORD;
WORD    : ('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z'|'.'|'_'|'0'..'9')*;
        : '$'('{')?WORD('}')?;
STRING  : '"'('a'..'z'|'A'..'Z')+'"';
IF      : '#if';
ELSE    : '#else';
ENDIF   : '#end if';
EQ      : '=';
EQTEST  : '==';
DBLDASH : '--';
COLON   : ':';
WS      : (' '|'\t'|'\r'|'\n') {$channel=HIDDEN;};

Suggestions for improvements? :) ... Then go ahead and mail me ... samuel dot lampa at gmail dot com)

Also, see a little screenshot from ANTLRWorks below:

ANTLRWorks Screenshot

As you can see in the screenshot, the different parts have correctly been identified as "param", "if statement" and so forth. You can se also how I can click in the test syntax, to see where in the parse tree that actual part appears.

When done, I just exported the resulting parser code in ANTLRWorks with "Generate > Generate Code", copied the code from the "output" folder into my Eclipse project, added the antlr-3.3 jar into the build path of it, and then ran the __Test__.java file that comes with the output.

I wanted to do a little more parsing in my test though, so I ended up with this little test code:

package net.bioclipse.uppmax.galaxytoolconfigparser;
import org.antlr.grammar.v3.*;
import org.antlr.runtime.ANTLRStringStream;
import org.antlr.runtime.CharStream;
import org.antlr.runtime.CommonTokenStream;
import org.antlr.runtime.RecognitionException;
import org.antlr.runtime.TokenStream;
import org.antlr.runtime.tree.CommonTree;
import org.antlr.runtime.tree.DOTTreeGenerator;
import org.antlr.runtime.tree.Tree;
import org.antlr.runtime.tree.TreeAdaptor;
import org.antlr.stringtemplate.StringTemplate;
public class ParseTest {
    // Generated stuff from ANTLR, which I can use to recognize token types   
    public static final int EOF=-1;
    public static final int ELSE=4;
    public static final int ENDIF=5;
    public static final int WORD=6;
    public static final int IF=7;
    public static final int STRING=8;
    public static final int VARIABLE=9;
    public static final int EQTEST=10;
    public static final int COLON=11;
    public static final int DBLDASH=12;
    public static final int EQ=13;
    public static final int WS=14;
    public static void main(String[] args) throws RecognitionException {
        String testString = "    sam_to_bam.py" 
                + "      --input1=$source.input1\n"
                + "      --dbkey=${input1.metadata.dbkey}\n"
                + "      #if $source.index_source == \"history\":\n"
                + "        --ref_file=$source.ref_file\n" 
                + "      #else\n"
                + "        --ref_file=\"None\"\n" 
                + "      #end if\n"
                + "      --output1=$output1\n"
                + "      --index_dir=${GALAXY_DATA_INDEX_DIR}\n"; 
        CharStream charStream = new ANTLRStringStream(testString);
        GalaxyToolConfigLexer lexer = new GalaxyToolConfigLexer(charStream);
        TokenStream tokenStream = new CommonTokenStream(lexer);
        GalaxyToolConfigParser parser = new GalaxyToolConfigParser(tokenStream, null);
        System.out.println("Starting to parse ...");
        // GalaxyToolConfigParser.command_return command = parser.command();
        CommonTree tree = (CommonTree)parser.command().getTree();
        System.out.println("Done parsing ...");
        int i = 0;
        while (i<tree.getChildCount()) {
            Tree subTree = tree.getChild(i);
            System.out.println("Tree child: " + subTree.getText() + ", (Token type: " + subTree.getType() + ")");
        // Generate DOT Syntax tree
        //DOTTreeGenerator gen = new DOTTreeGenerator();
        //StringTemplate st = gen.toDOT(tree);
        //System.out.println("Tree: \n" + st);

... generating this output:

Starting ...
Done executing command ...
Subtree text: sam_to_bam.py, (Token type: 6)
Subtree text: --, (Token type: 12)
Subtree text: input1, (Token type: 6)
Subtree text: =, (Token type: 13)
Subtree text: $source.input1, (Token type: 9)
Subtree text: --, (Token type: 12)
Subtree text: dbkey, (Token type: 6)
Subtree text: =, (Token type: 13)
Subtree text: ${input1.metadata.dbkey}, (Token type: 9)
Subtree text: #if, (Token type: 7)
Subtree text: $source.index_source, (Token type: 9)
Subtree text: ==, (Token type: 10)
Subtree text: "history", (Token type: 8)
Subtree text: :, (Token type: 11)
Subtree text: --, (Token type: 12)
Subtree text: ref_file, (Token type: 6)
Subtree text: =, (Token type: 13)
Subtree text: $source.ref_file, (Token type: 9)
Subtree text: #else, (Token type: 4)
Subtree text: --, (Token type: 12)
Subtree text: ref_file, (Token type: 6)
Subtree text: =, (Token type: 13)
Subtree text: "None", (Token type: 8)
Subtree text: #end if, (Token type: 5)
Subtree text: --, (Token type: 12)
Subtree text: output1, (Token type: 6)
Subtree text: =, (Token type: 13)
Subtree text: $output1, (Token type: 9)
Subtree text: --, (Token type: 12)
Subtree text: index_dir, (Token type: 6)
Subtree text: =, (Token type: 13)
Subtree text: ${GALAXY_DATA_INDEX_DIR}, (Token type: 9)

... seemingly I have the stuff I need, for doing some logic parsing now! :)

Some words about BNF

ANTLR is an (E)BNF parser generator. I had heard a little about BNF before, and was more or less scared off from the topic, thinking it looked too advanced, but really, I found it isn't that hard at all!

It strikes me that BNF is quite much RegEx but with functions added, which allows for recursive pattern matching, which you'll need for anything more advanced, such as nested braces/xml tags etc ... but as you can see in the example above also, much of the pattern matching syntax actually has big similarities to RegEx.

In terms of tutorials, for the (E)BNF/ANTLR combo at least, I'd highly recommend this set of screencasts on using ANTLR in Eclipse. Though I didn't use the Eclipse version, these screencasts quickly give you an idea of how it all works ... I watched at least a bunch of them, and I'm happy I did.

Exercise in XSLT RegEx: (Partial) Galaxy ToolConfig to DocBook CmdSynopsis conversion

As blogged about before, I was interested in knowing the difference between the Galaxy Toolconfig, and the DocBook cmdsynopsis format, for the purpose of automatically generating wizards (see an example that I screencasted here) to fill in the required parameters to command line tools. To quickly get some hands-on experience with the formats, I started creating an XSLT transformation from galaxy toolconfig format to the docbook cmdsynopsis format.

I quite quickly realized some important differences, such as that cmdsynopsis lacks the ability to specify a list of possible/valid options for a parameter, which could be used for creating drop-downs in the wizards. But apart from that, the little work on the transformation I had already done when realizing this, actually was a nice little exercise in using regex with xslt. Look at the command tag content in this excerpt of a Galaxy ToolConfig XML file:

<tool id="sam_to_bam" name="SAM-to-BAM" version="1.1.1">
  <description>converts SAM format to BAM format</description>
    <requirement type="package">samtools</requirement>
  <command interpreter="python">
      #if $source.index_source == "history":
      #end if
    <conditional name="source">
      <param name="index_source" type="select" label="Choose the source for the reference list">
        <option value="cached">Locally cached</option>
        <option value="history">History</option>
      <when value="cached">
        <param name="input1" type="data" format="sam" label="SAM File to Convert">
           <validator type="unspecified_build" />
           <validator type="dataset_metadata_in_file" filename="sam_fa_indices.loc" metadata_name="dbkey" metadata_column="1" message="Sequences are not currently available for the specified build." line_startswith="index" />
      <when value="history">
        <param name="input1" type="data" format="sam" label="Convert SAM file" />
        <param name="ref_file" type="data" format="fasta" label="Using reference file" />
    <data format="bam" name="output1" label="${tool.name} on ${on_string}: converted BAM" />

... you see that in the command tag, the actual syntax of the command is specified in a kind of "free text" format ... This might not be exactly what one might think to use XSLT transformations for, but together with the regex functionality in XSLT 2.0 you definitely has this option too. Helped by this article on xml.com, I put together this little XSLT stylesheet for parsing up the free text content of that command tag (haven't got to the more detailed config inside the inputs-tag in the galaxy format, but might not need either, if staying with the galaxy format anyway):

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
    <xsl:output method="xml" indent="yes" encoding="UTF-8" />
    <xsl:template match="/">
            <xsl:apply-templates select="tool/command" />
    <xsl:template match="tool/command">
            <xsl:value-of select="@interpreter" />
        <xsl:for-each select='tokenize(
                                                    "[ ]+",
                                            " "),
        <xsl:if test='matches(.,"\{")!=true()'>
                <xsl:value-of select='replace(.,"=.*","")'></xsl:value-of>
                <xsl:if test='matches(.,".*=.*")'>
                    <xsl:text> </xsl:text>
                        <xsl:value-of select='replace(.,".*=\s*\$?","")'></xsl:value-of>

... a bit crazy with all these nested regex replace function calls, no? :) ... but, I can tell you, it actually works very good! Found it easier to work with than many other regex implementations (i.e. matching newlines could be done with "\n", which I think you can't do by default in some other ones).

I can also mention that the tokenize function splits a string into an "array" of the parts between the parts that is matched by the expression given to tokenize (similar to "split" in some other languages, like python).

The result of the transoformation? Here it goes:

<?xml version="1.0" encoding="UTF-8"?>
   <arg>--input1 <replaceable>source.input1</replaceable>
   <arg>--ref_file <replaceable>source.ref_file</replaceable>
   <arg>--ref_file <replaceable>"None"</replaceable>
   <arg>--output1 <replaceable>output1</replaceable>

Not perfect (there are double "--ref_file" arguments still), but at least it has parsed up the different arguments, removed some galaxy specific stuff (the parts enclosed by "{}") and the conditional statements. At least I think it shows that xslt + regex is actually an option, don't you think? :)

A caveat here though: I found out that most of the XSLT processor tools for Ubuntu (xsltproc, xalan, the one built into php5) don't accept XSLT 2.0 features such as regex, so I ended up using the java based saxon processor.

To call it for doing a transformation, you simply go (when using the open source "home edition"):

java -jar saxon9he.jar [xml-file] [xslt-file] > [output-file]

Works good! (does a good job of formatting the XML too).

FIMS Project Status update - Thinking about CLI wrapper XML formats

Time for a little more "overview" like status update of the Bioclipse HPC Client part of the FIMS Project I'm working on for UPPNEX at UPPMAX.

Right now I'm hacking away on the batch job config wizard (just fixed a "select remote file" dialog, for file path parameters, which I actually even screencasted :=)).

Otherwise, I start coming to the stage when I need to do a decision about command line tool wrapper formats.

So far I've tried to use the Galaxy (bioinformatics portal) XML format, hoping to take advantage of the vast number of already wrapped bioinformatics tools (Actually, I'm using the format now - that is what drives the wizard in the Bioclipsescreencast above).

Unfortunately though, I figured out that most (if not all?) tool configs in Galaxy do not wrap the command line tool itself, but rather an accompanying script file (python/bash/perl), that does some additional stuff (different for each tool), which makes it hard to use the tool configs right away outside the galaxy platform.

So, realizing this, I'm considering whether we should go for something even more general, for this light-weight batch config wizard (which is not trying to be a complete replacement for Galaxy).

I actually just got to know about another such format (via a question on stack overflow). Apparently the docbook-package contains such a format! So, in case one would find that there are lots of ready-made such docbook-definitions for a bunch of cli tools already, then this could be quite interesting. ... so that's something I'll check in now. Otherwise, I maybe might as well stick with the Galaxy format (Have to admit though that the docbook one feels like a more general choice, in the general sense ... or what do you think?).

Then one could of course also have converters between the DocBook and Galaxy xml formats too ... should be pretty straight-forward with XSLT.

Ok, so that's where I am, and what I'm thinking about right now! Feel free to drop a line of feedback!

Got "select remote file" to work in the Cluster Job Config Wizard

As hinted about in this blog post, I'm working on a cluster batch job configuration wizard for Bioclipse, making heavy use of the TM / RSE components for Eclipse.

In the wizard, I of course wanted to be able to fire up a file selection dialog for filling in fields for file paths on the cluster. Only problem was that the RSE API is (as usual to Java projects) not of the simplest kind, and for a newcomer like me I found it a bit challenging to find where to start.

As pointed out in this short blog post, I finally found the (simple) solution, and here we go (the final code needed some additions though, but in principle it was simple):

Some more info and can be found at the project's wiki page.

Opening a remote file selection dialog with the RSE for Eclipse

This was easier than expected. Helped by the RSE File UI API Docs and this forum post, I figured out how to do:

SystemRemoteFileDialog dialog = new SystemRemoteFileDialog(SystemBasePlugin.getActiveWorkbenchShell());
IRemoteFile file = (IRemoteFile) dialog.getSelectedObject();
System.out.println("Selected file's absolute path: " + file.getAbsolutePath());

Now also committed!

Update: Using proper interface for dealing with remote files (commit).