DocBook

Exercise in XSLT RegEx: (Partial) Galaxy ToolConfig to DocBook CmdSynopsis conversion

As blogged about before, I was interested in knowing the difference between the Galaxy Toolconfig, and the DocBook cmdsynopsis format, for the purpose of automatically generating wizards (see an example that I screencasted here) to fill in the required parameters to command line tools. To quickly get some hands-on experience with the formats, I started creating an XSLT transformation from galaxy toolconfig format to the docbook cmdsynopsis format.

I quite quickly realized some important differences, such as that cmdsynopsis lacks the ability to specify a list of possible/valid options for a parameter, which could be used for creating drop-downs in the wizards. But apart from that, the little work on the transformation I had already done when realizing this, actually was a nice little exercise in using regex with xslt. Look at the command tag content in this excerpt of a Galaxy ToolConfig XML file:

<tool id="sam_to_bam" name="SAM-to-BAM" version="1.1.1">
  <description>converts SAM format to BAM format</description>
  <requirements>
    <requirement type="package">samtools</requirement>
  </requirements>
  <command interpreter="python">
    sam_to_bam.py
      --input1=$source.input1
      --dbkey=${input1.metadata.dbkey} 
      #if $source.index_source == "history":
        --ref_file=$source.ref_file
      #else
        --ref_file="None"
      #end if
      --output1=$output1
      --index_dir=${GALAXY_DATA_INDEX_DIR}
  </command>
  <inputs>
    <conditional name="source">
      <param name="index_source" type="select" label="Choose the source for the reference list">
        <option value="cached">Locally cached</option>
        <option value="history">History</option>
      </param>
      <when value="cached">
        <param name="input1" type="data" format="sam" label="SAM File to Convert">
           <validator type="unspecified_build" />
           <validator type="dataset_metadata_in_file" filename="sam_fa_indices.loc" metadata_name="dbkey" metadata_column="1" message="Sequences are not currently available for the specified build." line_startswith="index" />
        </param>
      </when>
      <when value="history">
        <param name="input1" type="data" format="sam" label="Convert SAM file" />
        <param name="ref_file" type="data" format="fasta" label="Using reference file" />
      </when>
    </conditional>
  </inputs>
  <outputs>
    <data format="bam" name="output1" label="${tool.name} on ${on_string}: converted BAM" />
  </outputs>
</xml>

... you see that in the command tag, the actual syntax of the command is specified in a kind of "free text" format ... This might not be exactly what one might think to use XSLT transformations for, but together with the regex functionality in XSLT 2.0 you definitely has this option too. Helped by this article on xml.com, I put together this little XSLT stylesheet for parsing up the free text content of that command tag (haven't got to the more detailed config inside the inputs-tag in the galaxy format, but might not need either, if staying with the galaxy format anyway):

<?xml version="1.0"?>
 
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
 
    <xsl:output method="xml" indent="yes" encoding="UTF-8" />
 
    <xsl:template match="/">
        <cmdsynopsis>
            <xsl:apply-templates select="tool/command" />
        </cmdsynopsis>
    </xsl:template>
 
    <xsl:template match="tool/command">
        <command>
            <xsl:value-of select="@interpreter" />
        </command>
        <xsl:for-each select='tokenize(
                                    replace(
                                        replace(
                                            replace(
                                                replace(
                                                    .,
                                                    "[ ]+",
                                                    ""),
                                                "\n#[^\s]+",
                                                ""),
                                            "\n+",
                                            " "),
                                        "(^\s+|\s+$)",
                                        ""),
                                    "\s")'>
        <xsl:if test='matches(.,"\{")!=true()'>
            <arg>
                <xsl:value-of select='replace(.,"=.*","")'></xsl:value-of>
                <xsl:if test='matches(.,".*=.*")'>
                    <xsl:text> </xsl:text>
                    <replaceable>
                        <xsl:value-of select='replace(.,".*=\s*\$?","")'></xsl:value-of>
                    </replaceable>
                </xsl:if>
            </arg>
        </xsl:if>
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

... a bit crazy with all these nested regex replace function calls, no? :) ... but, I can tell you, it actually works very good! Found it easier to work with than many other regex implementations (i.e. matching newlines could be done with "\n", which I think you can't do by default in some other ones).

I can also mention that the tokenize function splits a string into an "array" of the parts between the parts that is matched by the expression given to tokenize (similar to "split" in some other languages, like python).

The result of the transoformation? Here it goes:

<?xml version="1.0" encoding="UTF-8"?>
<cmdsynopsis>
   <command>python</command>
   <arg>sam_to_bam.py</arg>
   <arg>--input1 <replaceable>source.input1</replaceable>
   </arg>
   <arg>--ref_file <replaceable>source.ref_file</replaceable>
   </arg>
   <arg>--ref_file <replaceable>"None"</replaceable>
   </arg>
   <arg>--output1 <replaceable>output1</replaceable>
   </arg>
</cmdsynopsis>

Not perfect (there are double "--ref_file" arguments still), but at least it has parsed up the different arguments, removed some galaxy specific stuff (the parts enclosed by "{}") and the conditional statements. At least I think it shows that xslt + regex is actually an option, don't you think? :)

A caveat here though: I found out that most of the XSLT processor tools for Ubuntu (xsltproc, xalan, the one built into php5) don't accept XSLT 2.0 features such as regex, so I ended up using the java based saxon processor.

To call it for doing a transformation, you simply go (when using the open source "home edition"):

java -jar saxon9he.jar [xml-file] [xslt-file] > [output-file]

Works good! (does a good job of formatting the XML too).

FIMS Project Status update - Thinking about CLI wrapper XML formats

Time for a little more "overview" like status update of the Bioclipse HPC Client part of the FIMS Project I'm working on for UPPNEX at UPPMAX.

Right now I'm hacking away on the batch job config wizard (just fixed a "select remote file" dialog, for file path parameters, which I actually even screencasted :=)).

Otherwise, I start coming to the stage when I need to do a decision about command line tool wrapper formats.

So far I've tried to use the Galaxy (bioinformatics portal) XML format, hoping to take advantage of the vast number of already wrapped bioinformatics tools (Actually, I'm using the format now - that is what drives the wizard in the Bioclipsescreencast above).

Unfortunately though, I figured out that most (if not all?) tool configs in Galaxy do not wrap the command line tool itself, but rather an accompanying script file (python/bash/perl), that does some additional stuff (different for each tool), which makes it hard to use the tool configs right away outside the galaxy platform.

So, realizing this, I'm considering whether we should go for something even more general, for this light-weight batch config wizard (which is not trying to be a complete replacement for Galaxy).

I actually just got to know about another such format (via a question on stack overflow). Apparently the docbook-package contains such a format! So, in case one would find that there are lots of ready-made such docbook-definitions for a bunch of cli tools already, then this could be quite interesting. ... so that's something I'll check in now. Otherwise, I maybe might as well stick with the Galaxy format (Have to admit though that the docbook one feels like a more general choice, in the general sense ... or what do you think?).

Then one could of course also have converters between the DocBook and Galaxy xml formats too ... should be pretty straight-forward with XSLT.

Ok, so that's where I am, and what I'm thinking about right now! Feel free to drop a line of feedback!