XML

Using the Chemical Translation Service with Python to get Inchis from chemical names

I needed to convert a bunch of chemical compound name into International Chemical Identifiers (Inchis), to enable easily creating links to various web services and databases that take inchis as input, such as Chembl.

I found out the very useful Chemical Translation Service, which has nice GUIs for doing this manually. In order to do this in a more automated fashion for many compounds though, I realized I'd have to script it up a bit, (in python of course).

I decided to make use of the XML format of the translation service. I have had mixed experiences with both messing with urls, and parsing xml, in python before, so I was very happy to get to know two new python packages that focus on providing a straightforward API that is "usable to humans", requests and xmltodict.

They turned out to be great combination, and IMO the conversion becomes a quite readable bunch of code lines:

# Base URL of the Chemical Translation Service
base_url = "http://cts.fiehnlab.ucdavis.edu/transform/transform"
 
# Create a dictionary with the query parameters
query_params = { "format" : "xml",
                   "extension" : "xml",
                   "to" : "inchikey",
                   "idValue" : query_compound_name,
                   "from" : "name"}
 
# Execute the query
response = requests.get(base_url, params=query_params)
 
# Parse the XML into a python dict (array) structure
xmldict = xmltodict.parse(response.text)
 
# Extract the Inchi key from the array structure
chem_data = xmldict['compoundResultSets']['compoundResultSet']
inchi_key = chem_data['inchiHashKey']

And, why not make it complete with command line flags and stuff:

Exercise in XSLT RegEx: (Partial) Galaxy ToolConfig to DocBook CmdSynopsis conversion

As blogged about before, I was interested in knowing the difference between the Galaxy Toolconfig, and the DocBook cmdsynopsis format, for the purpose of automatically generating wizards (see an example that I screencasted here) to fill in the required parameters to command line tools. To quickly get some hands-on experience with the formats, I started creating an XSLT transformation from galaxy toolconfig format to the docbook cmdsynopsis format.

I quite quickly realized some important differences, such as that cmdsynopsis lacks the ability to specify a list of possible/valid options for a parameter, which could be used for creating drop-downs in the wizards. But apart from that, the little work on the transformation I had already done when realizing this, actually was a nice little exercise in using regex with xslt. Look at the command tag content in this excerpt of a Galaxy ToolConfig XML file:

<tool id="sam_to_bam" name="SAM-to-BAM" version="1.1.1">
  <description>converts SAM format to BAM format</description>
  <requirements>
    <requirement type="package">samtools</requirement>
  </requirements>
  <command interpreter="python">
    sam_to_bam.py
      --input1=$source.input1
      --dbkey=${input1.metadata.dbkey} 
      #if $source.index_source == "history":
        --ref_file=$source.ref_file
      #else
        --ref_file="None"
      #end if
      --output1=$output1
      --index_dir=${GALAXY_DATA_INDEX_DIR}
  </command>
  <inputs>
    <conditional name="source">
      <param name="index_source" type="select" label="Choose the source for the reference list">
        <option value="cached">Locally cached</option>
        <option value="history">History</option>
      </param>
      <when value="cached">
        <param name="input1" type="data" format="sam" label="SAM File to Convert">
           <validator type="unspecified_build" />
           <validator type="dataset_metadata_in_file" filename="sam_fa_indices.loc" metadata_name="dbkey" metadata_column="1" message="Sequences are not currently available for the specified build." line_startswith="index" />
        </param>
      </when>
      <when value="history">
        <param name="input1" type="data" format="sam" label="Convert SAM file" />
        <param name="ref_file" type="data" format="fasta" label="Using reference file" />
      </when>
    </conditional>
  </inputs>
  <outputs>
    <data format="bam" name="output1" label="${tool.name} on ${on_string}: converted BAM" />
  </outputs>
</xml>

... you see that in the command tag, the actual syntax of the command is specified in a kind of "free text" format ... This might not be exactly what one might think to use XSLT transformations for, but together with the regex functionality in XSLT 2.0 you definitely has this option too. Helped by this article on xml.com, I put together this little XSLT stylesheet for parsing up the free text content of that command tag (haven't got to the more detailed config inside the inputs-tag in the galaxy format, but might not need either, if staying with the galaxy format anyway):

<?xml version="1.0"?>
 
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
 
    <xsl:output method="xml" indent="yes" encoding="UTF-8" />
 
    <xsl:template match="/">
        <cmdsynopsis>
            <xsl:apply-templates select="tool/command" />
        </cmdsynopsis>
    </xsl:template>
 
    <xsl:template match="tool/command">
        <command>
            <xsl:value-of select="@interpreter" />
        </command>
        <xsl:for-each select='tokenize(
                                    replace(
                                        replace(
                                            replace(
                                                replace(
                                                    .,
                                                    "[ ]+",
                                                    ""),
                                                "\n#[^\s]+",
                                                ""),
                                            "\n+",
                                            " "),
                                        "(^\s+|\s+$)",
                                        ""),
                                    "\s")'>
        <xsl:if test='matches(.,"\{")!=true()'>
            <arg>
                <xsl:value-of select='replace(.,"=.*","")'></xsl:value-of>
                <xsl:if test='matches(.,".*=.*")'>
                    <xsl:text> </xsl:text>
                    <replaceable>
                        <xsl:value-of select='replace(.,".*=\s*\$?","")'></xsl:value-of>
                    </replaceable>
                </xsl:if>
            </arg>
        </xsl:if>
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

... a bit crazy with all these nested regex replace function calls, no? :) ... but, I can tell you, it actually works very good! Found it easier to work with than many other regex implementations (i.e. matching newlines could be done with "\n", which I think you can't do by default in some other ones).

I can also mention that the tokenize function splits a string into an "array" of the parts between the parts that is matched by the expression given to tokenize (similar to "split" in some other languages, like python).

The result of the transoformation? Here it goes:

<?xml version="1.0" encoding="UTF-8"?>
<cmdsynopsis>
   <command>python</command>
   <arg>sam_to_bam.py</arg>
   <arg>--input1 <replaceable>source.input1</replaceable>
   </arg>
   <arg>--ref_file <replaceable>source.ref_file</replaceable>
   </arg>
   <arg>--ref_file <replaceable>"None"</replaceable>
   </arg>
   <arg>--output1 <replaceable>output1</replaceable>
   </arg>
</cmdsynopsis>

Not perfect (there are double "--ref_file" arguments still), but at least it has parsed up the different arguments, removed some galaxy specific stuff (the parts enclosed by "{}") and the conditional statements. At least I think it shows that xslt + regex is actually an option, don't you think? :)

A caveat here though: I found out that most of the XSLT processor tools for Ubuntu (xsltproc, xalan, the one built into php5) don't accept XSLT 2.0 features such as regex, so I ended up using the java based saxon processor.

To call it for doing a transformation, you simply go (when using the open source "home edition"):

java -jar saxon9he.jar [xml-file] [xslt-file] > [output-file]

Works good! (does a good job of formatting the XML too).

How to format an XML document lacking line breaks and indents

Install xmlstarlet in Ubuntu:

sudo apt-get install xmlstarlet

Use the formatting command:

[code that produces some xml] | xmlstarlet fo

... or, if you are generating the XML with an XSLT stylesheet, don't forget the following line, after the xsl:stylesheet tag:

<xsl:output method="xml" indent="yes" encoding="UTF-8" />