Samuel's Tech Blog

Primary links

I have a new blog!

» Go to my new blog!

Go speedup with threading, for line by line string processing

Fri, 2013-06-21 10:54 | by Samuel Lampa

UPDATE, 2013004: With a slightly modified code, and 3 threads (see basecompl_par2.go below), the numbers are now improved: 52%, resp. 80% speedup, depending on how you count, so, almost doubling the speed!

Note: Use this G+ thread if you want to comment/discuss!

Note II: Of course, in reality, for this kind of basic bioinformatics processing, one would use a pre-existing package, like BioGo instead!

Bioinformaticians process a lot of text based files (although the situation is improving and some better formats are emerging, slowly).

The text formats in question typically need to be parsed line-by-line, for practical reasons and for getting clean, modular and understandeable code. I was thus interested to see how much speedup one could gain for such a string processing task, by using the very easy concurrency / parallelization features of the Go programming language.

As test data I used the same 58MB fasta file (uncompressed) as used in my previous explorations.

As a simple test code, I made a very "dumb" code that just computes the DNA base complement back and fourth 4 times (I only do the 4 common letters, no other logic), in order to emulate a 4-step string processing workflow.

Find the code here.

This code does not do any creation of a new string, or string copying, as part of the processing, so it would not be a good imitaion of such processing (which exists, in the form of for example DNA-to-Protein sequence conversion etc). Thus I also created a version of the code that explicitly does string copy. How dumb it might be, I included it in the test.

Find this code here.

Finally, I made a parallel version, that spawns go routines for each of the processing steps, which in turn are multiplexed upon 2 threads (set in the program), which turned out to be the optimal for my 1.6GHz Core i5 MacBook air.

Find this parallel code here.

The results are as follows:

The data in numbers (click for code):

basecompl_copy.go: 6.98s
basecompl.go: 5.84s
basecompl_par.go: 4.23s
basecompl_par2.go: 3.87s

That is: For the parallel version:

51% speedup (cutting 34% exec time) compared to the sequential one without string copy (basecompl.go)
80% speedup (cutting 45% time) compared to the one that does string copy anyway (basecompl_copy.go).

Even though one would of course dream about even better speedups, this is not too bad IMO, given that this is the 1st (or second) time in my life that I implement a parallel program (and I picked up Go just 1-2 weeks ago!) and given that this problem mostly consists of disk access and data shuffling, which is hard to parallellize anyway.

Tags:

Moar Languagez! GC content in Python, D, FPC, C++ and C (and now Go)

Fri, 2013-05-10 17:16 | by Samuel Lampa

My two previous blog posts (this one and this one) on comparing the performance of a number of languages for a simple but very typical bioinformatics problem, spurred a whole bunch of people to suggest improvements. Nice! :) Many of the comments can be read in the comments on those posts, as well as in the G+ thread, while I got improvements sent by mail from Thomas Hume (thomas.hume -at- labri.fr) and Daniel Spångberg.

Anyway, here comes an updated post, with a whole array of improvements, and a few added languages.

One suggestion I got (by Thomas Koch, I think), was applicable to all the code examples that I got sent to me, namely to count GC, and AT separately, and add to the total sum only in the end. It basically cut a few percent of all execution times, so I went through all the code variants and added it. It did improve the speed in a simipar way for all the codes.

Many people suggested to read the whole file in once instead of reading line by line though. This can cause problems with the many very large files in bioinformatics, so I have divided the results below in solutions that read line by line, and those that read more than that.

Among the code examples I got, was "our own" Daniel Spångberg of UPPMAX, who wrote a whole bunch of C variants, even including an OpenMP parallelized variant, which takes the absolute lead. Nice! :)

More credits in the explanation of the codes, under each of the results sections.

Ok, so let's go to the results:

Performance results

Reading line by line

Here are first the codes that read only line by line. I wanted to present them first, to give a fair comparison.

Python	1.692
Cython	1.684
Go (8g)	1.241
Go (gcc)	0.883
Go Tbl (gcc)	0.785
Go Tbl (8g)	0.767
PyPy	0.612
FPC (MiSchi)	0.597
D (GDC)	0.589
FPC	0.586
FPC (leledumbo)	0.567
D (DMD)	0.54
D Tbl (DMD)	0.492
D (LDC)	0.479
C++ (HA)	0.385
D Tbl (GDC)	0.356
Go Tbl (RP) (8g)	0.337
Go Tbl (RP) Ptr (8g)	0.319
D Tbl (LDC)	0.317
C (DS)	0.276
C (TH)	0.252
C (DS Tbl)	0.183

Explanation of code, with links

All of the below codes have have been modified to benefit from the tip from Thomas Koch (thomas -at- koch.ro), to cound AC and GC separately in the inner loop, and sum up only in the end.

Click the language code to see the code!

Python: My code, with improvents from lelledumbo (here on the blog).
Cython: Same as "Python", compiled to C with Cython, with statically typed integers.
Go (8g): Go code compiled with inbuilt Go compiler.
Go (gcc): Go code compiled with GCC.
Go Tbl (gcc): Go code, table optimized, and compiled with GCC.
Go Tbl (8g): Go code, table optimized, and compiled with inbuilt Go compiler.
PyPy: My python code compiled with PyPy (1.9.0 with GCC 4.7.0)
FPC (MiSchi): FreePascal version improved by MiSchi (Tonne Schindler)
D (GDC): My D code, with many improvements suggested from the folks in the G+ thread. Compiled with GDC.
FPC: My FPC code
FPC (leledumbo): FreePascal version improved by Leledumbo (leledumbo_cool -at- yahoo.co.id)
D Tbl (DMD): My table optimized D code, compiled with DMD.
D (DMD): My D code, compiled with DMD.
D Tbl (GDC): My table optimized D code, compiled with GDC.
D (LDC): My D code, compiled with LDC.
C++ (HA): Harald Aschiz's C++ code.
Go Tbl (RP) (8g): Roger Peppe's optimized Go code, omitting string conversion and using range.
Go Tbl (RP) Ptr (8g): Roger Peppe's optimized Go code, additionally using pointers for the table opt.
D Tbl (LDC): My table optimized D code, compiled with LDC.
C (DS): Daniel Spångberg's C code (See his even faster ones in the next section!)
C (TH): Thomas Hume's C code.
C (DS Tbl): Daniel Spångberg's line-by-line version, with table optimization.

Reading more lines, or whole file

Here are the codes that don't keep themselves to reading line by line, but do more than so. For example Gerdus van Zyl's pypy-optimized code reads a bunch of lines at a time (1k seems optimal), before processing them, while still allowing to process in a line-by-line fashion. Then there is Daniel Spångberg's exercise in C performance, where by reading the file in at once, and applying some smart C tricks, and even doing some OpenMP-parallelization, cuts down the execution time under 100 ms, and that is for a 1million line file, on a rather slow MacBook air. Quite impressive!

PyPy (GvZ)	PyPy	PyPy (GvZ) + Opt	C (DS Whole)	C (DS Whole Tbl)	C (DS Whole Tbl Par)	C (DS Whole Tbl 16bit)	C (DS Whole Tbl Par 16bit)	C (DS Whole Tbl Par 16bit Unroll)
0.862s	0.626s	0.551s	0.178s	0.114s	0.095s	0.084s	0.070s	0.066s

Explanation of code, with links

Note: All of the below codes have been modified to benefit from the tip from Thomas Koch (thomas -at- koch.ro), to cound AC and GC separately in the inner loop, and sum up only in the end.

PyPy (GvZ): Gerdus van Zyl:s code optimized for PyPy.
PyPy (GvZ) + Opt: Gerdus van Zyl:s code optimized for PyPy, with even more optimal number of lines per time (1k inst of 10k).
C (DS Whole): Daniel Spångberg's C code, when reading whole file.
C (DS Whole Tbl): Like "C (DS Whole), but with Daniel's next improvement: Using a table, to avoid the comparisons in the inner loop.
C (DS Whole Tbl Par): Like "C (DS Whole Tbl), but with Daniel's final improvement: Parallelizing it with OpenMP.

C (DS Whole Tbl 16bit): Like DS Whole Tbl, but optimized with reading 16bit at a time, rather than 8.
C (DS Whole Tbl Par 16bit): Like the above, but OpenMP parallelized.
C (DS Whole Tbl Par 16bit Unroll): Like the above, but adding -funroll-all-loops to the compiler.

Conclusions

I think there are a number of conclusions one can draw from this:

C is in it's own class, when it comes to speed, which is seen in both of the categories above.
C++ is also fast, but D with the right compiler is not far behind.
Generally D and FPC is very close!
Pure python is not near the compiled ones, but surprisingly (for me), by using PyPy, it becomes totally comparable with compiled languages!
If you look both at the speed, and the compactness and readability of code at the same time, I find D to be a very strong competitor in this game, but at the same time, Go is very close, and already has a [code.google.com/p/biogo/ bioinformatics package (biogo)]!

What do you think?

Update May 11: Daniel Spångberg now provided a few even more optimized versions: DS Tbl: a table optimized version of the line-by-line reading, as well as optimized versions of the ones that read the whole file at once. On the fastest one, I also tried with -funroll-all-loops to gcc, to see how far we could get. The result (for a 58MB text file): 66 milliseconds!

Update May 16: Thanks to Franzivan Bezerra's post here, I now got some Go code here as well. I modified Francivan's gode a bit, to comply with our latest optimizations here, as well as created a table optimized version (similar to Daniel Spångberg's Table optimized C code). To give a fair comparison of table optimization, I also added a table optimized D version, compiled with DMD, GDC and LDC. Find the results in the first graph below.

Update II, May 16: The link to the input file, used in the examples, got lost. It can be downloaded here!

Update III, May 16: Thanks to Roger Peppe, suggesting some improved versions in this thread on G+, we now got two more, very fast Go versions, clocking even slightly below the idiomatic C++ code. Have a look in the first graph and table below!

Update IV, May 16: Now applied equivalent optimizations like Roger's ones for Go mentioned above, to the "D Tbl" versions (re-use the "line" var as buffer to readln(), and use foreach instead of for loop). Result: D and Go gets extremely similar numbers: 0.317 and 0.319 was the best I could get for D and Go, respectively. See updated graphs below.

Update May 17: See this post on how Francivan Bezerra acheived same performance as C with D, beating the other two (Go and Java) in the comparison. (Direct link to GDoc)

Tags:

GC content continued: Python vs D vs C++ vs Free Pascal

Thu, 2013-05-09 22:57 | by Samuel Lampa

My previous blog post, comparing a simple character counting algorithm between python and D, created an interesting discussion on Google+, with many tuning tips for the D version, and finally we also had a C++ version (kindly provided by Harald Achitz).

Then I could not resist to try out something I've been thinking about for a while (comparing D with Free Pascal (FPC)), so I went away and jotted down an FPC version as well (my first FPC code ever, I think).

Anyway, I now just wanted to summarize a comparison between all of these! I have added below the python code, the faster of the two D versions in my previous post, as well as the new additions: Harald's C++ version and my FPC version.

Performance results

Python	D (DMD)	D (GDC)	D (LDC)	C++	FPC
7.904s	0.652s	0.668s	0.538s	0.399s	0.629s

Numbers are execution time in seconds, for counting the fraction of "G" and "C" characters compared to "A", "T", "C" and "T" in a 58 MB text file of nearly 1 million lines of text. And so, why not plot it as well:

My comments

Not surprising maybe that C++ is the fastest. I find it interesting though that both D and FPC perform in the same ballpark at least, with code written by a novice user (although with some minor tweaking).

For me personally, I also got a first answer on the question I've been wondering about for a while: Will D or FreePascal be faster, given that Pascal is an older and more mature language, and doesn't use a Garbage collector, while D on the other hand have better compiler support, which it can benefit from (like LLVM).

Since I've been betting the most on D lately, I find it interesting to see that still, D can perform better, at least with the right compiler. Still I'm impressed that FPC is totally on par with D, and beaten only when D is compiled with the LLVM backed LDC.

On another note though, I tend to like the compactness of the D code. It's not really that far from the python version, if omitting the closing curly brackets, no? Not bad for a statically compiled high performance language! :) (That is, while FreePascal has it's merits from using mostly natural language words, over quirky symbols, which IMO can enhance readability a bit).

The code

Python version

import re
import string
 
def main():
    file = open("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r")
    gcCount = 0
    totalBaseCount = 0
    for line in file:
        line = line.strip("\n")
        if not line.startswith(">"):
            gcCount += len(re.findall("[GC]", line))
            totalBaseCount += len(re.findall("[GCTA]", line))
    gcFraction = float(gcCount) / totalBaseCount
    print( gcFraction * 100 )
 
 
if __name__ == '__main__':
    main()

D version

import std.stdio;
import std.string;
import std.algorithm;
import std.regex;
 
void main() {
    File file = File("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r");
    int gcCount = 0;
    int totalBaseCount = 0;
    string line;
    char c;
    while (!file.eof()) {
        line = file.readln();
        if (line.length > 0 && !startsWith(line, ">")) {
            for(uint i = 0; i < line.length; i++) {
                c = line[i];
                if (c == 'C' || c == 'T' || c == 'G' || c == 'A') {
                    totalBaseCount++;
                    if ( c == 'G' || c == 'C' ) {
                        gcCount++;
                    }
                }   
            }
        }
    }
    float gcFraction = ( cast(float)gcCount / cast(float)totalBaseCount );
    writeln( gcFraction * 100 );
}

C++ version

(Kindly shared by Harald Achitz!)

#include <string>
#include <fstream>
#include <iostream>
 
int main() {
 
    std::fstream fs("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa", std::ios::in);
 
    int gcCount = 0;
    int totalBaseCount = 0;
 
    if ( fs.is_open() )  {
        std::string line;
        while (std::getline(fs, line)) {
            if(*(line.begin()) != '>') {
                for(std::string::const_iterator pos = line.begin(); pos!=line.end() ; ++pos){
                     switch (*pos){
                       case 'A' :
                         totalBaseCount++;
                         break;
 
                       case 'C' :
                         gcCount++;
                         totalBaseCount++;
                         break;
 
                       case 'G' :
                         gcCount++;
                         totalBaseCount++;
                         break;
 
                       case  'T' :
                         totalBaseCount++;
                         break;
 
                       default:
                         break;
                     }
                }
          }
    }
    //std::cout << "gcCount: " << gcCount << " / totalBaseCount: "<< totalBaseCount << " = " ;
    std::cout << ( (float)gcCount / (float)totalBaseCount ) * 100 << std::endl ;
  } else {
    std::cout << "can't open file" ;
  }
  return 0 ;
}

Free Pascal (FPC)

program gc;
{$mode objfpc} // Do not forget this ever
 
uses
    Sysutils;
 
var
    FastaFile: TextFile;
    CurrentLine: String;
    GCCount: Integer;
    TotalBaseCount: Integer;
    i: Integer;
    c: Char;
    GCFraction: Single;
 
begin
    GCCount := 0;
    TotalBaseCount := 0;
 
    AssignFile(FastaFile, 'Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa'); 
    {$I+} //use exceptions
    try  
        Reset(FastaFile);
        repeat
            Readln(FastaFile, CurrentLine); 
            i := 0;
            while i < Length(CurrentLine) do
            begin
                c := CurrentLine[i];
                if (c = 'A') or (c = 'G') or (c = 'C') or (c = 'T')  then
                begin
                    TotalBaseCount := TotalBaseCount + 1;
                    if (c = 'G') or (c = 'C') then
                        GCCount := GCCount + 1;
                end;
                i := i + 1;
            end;
        until(EOF(FastaFile)); 
        CloseFile(FastaFile);
    except
        on E: EInOutError do
        begin
            Writeln('File handling error occurred. Details: '+E.ClassName+'/'+E.Message);
        end;    
    end;
    GCFraction := GCCount / TotalBaseCount;
    Writeln(FormatFloat('00.0000', GCFraction * 100));
end.

Compile flags

I used the following compile commands / flags for the test above (from the Makefile):

d:
        dmd -ofgc_d_dmd -O -release -inline gc.d
        gdmd -ofgc_d_gdc -O -release -inline gc.d
        ldmd2 -ofgc_d_ldc -O -release -inline gc.d
cpp:
        g++ -ogc_cpp -O3 gc.cpp
 
fpc:
        fpc -ogc_fpc -Ur -O3 -TLINUX gc.pas

Let me know if there is other flags that should be used, for some of the above compilers!

Tags:

Calculating GC content in python and D - How to get 10x speedup in D

Thu, 2013-05-09 06:07 | by Samuel Lampa

Update May 16: More optimizations and languages have been added here, and here (last one includes Python, D, FreePascal, C++, C and Go).

I'm testing to implement some simple bioinformatics algos in both python and D, for comparison.

My first test, of simply calculating the GC content (fraction of DNA letters G and C, as compared to G, C, T and A), reveals some hints about where D shines the most.

Based on this very limited test, D:s strength in terms of performance, seems to be more clear when you hand-code your inner loops, than when using some of the functionality of the standard library (phobos).

When using the countchars function in std.string of D:s standard library, D is on par with, but a tiny bit slower than python. When I hand-code the inner loop more carefully though, I get a 10x speedup of the D code, over my python code.

Test data

As test data, I use a 58MB Fasta file, containing approximately 1 million (989561) lines of DNA sequence.

Python reference implementation

As a reference, I wrote this little python script. Probably there is a faster way to write it, but this is the fastest I could come up with.

import re
import string
 
def main():
    file = open("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r")
    gcCount = 0
    totalBaseCount = 0
    for line in file:
        line = line.strip("\n")
        if not line.startswith(">"):
            gcCount += len(re.findall("[GC]", line))
            totalBaseCount += len(re.findall("[GCTA]", line))
    gcFraction = float(gcCount) / totalBaseCount
    print( gcFraction * 100 )
 
 
if __name__ == '__main__':
    main()

This takes around 8 seconds on my machine:

[samuel gc]$ time python gc_test.py 
37.6217301394
 
real    0m7.904s
user    0m7.876s
sys     0m0.020s

Then, I tried two implementations in D. First I tried to make my life as simple as possible, by using existing standard library functionality (the countchars function in std.string):

D implementation, using countchars

import std.stdio;
import std.string;
import std.algorithm;
import std.regex;
 
void main() {
    File file = File("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r");
    int gcCount = 0;
    int totalBaseCount = 0;
    string line = "";
    while (!file.eof()) {
        line = chomp(file.readln());
        if (!startsWith(line, ">")) {
            gcCount += countchars(line, "[GC]");
            totalBaseCount += countchars(line, "[GCTA]");
        }
    }
    float gcFraction = ( cast(float)gcCount / totalBaseCount );
    writeln( gcFraction * 100 );
}

The results: slightly slower than python with DMD, and slightly faster with GDC and LDC:

[samuel gc]$ dmd -ofgc_slow_dmd -O -release -inline gc_slow.d
[samuel gc]$ gdmd -ofgc_slow_gdc -O -release -inline gc_slow.d
[samuel gc]$ ldmd2 -ofgc_slow_ldc -O -release -inline gc_slow.d
[samuel gc]$ for c in gc_slow_{dmd,gdc,ldc}; do echo "---"; echo "$c:"; time ./$c; done;
---
gc_slow_dmd:
37.6217
 
real    0m8.088s
user    0m8.049s
sys     0m0.032s
---
gc_slow_gdc:
37.6217
 
real    0m6.791s
user    0m6.764s
sys     0m0.020s
---
gc_slow_ldc:
37.6217
 
real    0m7.138s
user    0m7.096s
sys     0m0.036s

D implementation with hand-written inner loop

On the other hand, when I hand code the string comparison code, like this:

import std.stdio;
import std.string;
import std.algorithm;
import std.regex;
 
void main() {
    File file = File("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r");
    int gcCount = 0;
    int totalBaseCount = 0;
    string line;
    char c;
    while (!file.eof()) {
        line = file.readln();
        if (line.length > 0 && !startsWith(line, ">")) {
            for(uint i = 0; i < line.length; i++) {
                c = line[i];
                if (c == 'C' || c == 'T' || c == 'G' || c == 'A') {
                    totalBaseCount++;
                    if ( c == 'G' || c == 'C' ) {
                        gcCount++;
                    }
                }   
            }
        }
    }
    float gcFraction = ( cast(float)gcCount / cast(float)totalBaseCount );
    writeln( gcFraction * 100 );
}

... then I get some significant speedup over python, around 10x (13x with the fastest combination of compiler and optimization flags)!:

[samuel gc]$ dmd -ofgc_fast_dmd -O -release -inline gc_fast.d
[samuel gc]$ gdmd -ofgc_fast_gdc -O -release -inline gc_fast.d
[samuel gc]$ ldmd2 -ofgc_fast_ldc -O -release -inline gc_fast.d
[samuel gc]$ for c in gc_faster_{dmd,gdc,ldc}; do echo "---"; echo "$c:"; time ./$c; done;
---
gc_faster_dmd:
37.6217
 
real    0m0.652s
user    0m0.624s
sys     0m0.024s
---
gc_faster_gdc:
37.6217
 
real    0m0.668s
user    0m0.644s
sys     0m0.020s
---
gc_faster_ldc:
37.6217
 
real    0m0.538s
user    0m0.520s
sys     0m0.016s

Version info

Machine: MacBook air "11, with 1.7GHz Core i5 (dual core).
OS: Lubuntu 12.10 32bit
Python: 2.7.3
DMD version: 2.062
GDC(GCC?) version: 4.6.3
LDC: 0.10.0 (based on DMD 2.0.60 and LLVM 3.2)

Tags:

Commandline git diff that highlights exactly what is changed

Fri, 2013-04-19 11:28 | by Samuel Lampa

Inpired by this great post on how to use the meld diff tool to get a nice GUI view of git diffs, I started looking for a way to get a similarly good result on the command line. That is, a diff that does not just show the whole lines that are changed, but that highlights the actually changed parts of each line. Here I document what I found.

It turned out I could even get something better, than meld, in that I can view both the removed and added part in the same view (color coded with red/green) which my brain find much easier to interpret.

So, to the steps.

First I specify that git should use some external tool for it's diffs, by editing my ~/.gitconfig file and adds this to the bottom:

[diff]
    external = git-wdiff

Then, I go on and create that tool, "git-wdiff", in /usr/local/bin/git-wdiff, and put this text into it:

#!/bin/bash
wdiff -n $2 $5|colordiff|sed -r 's/(\{\+|\+\}|\[\-|\-\])//'|less

You'll need to install wdiff and colordiff (available in ubuntu repos).

What it does is:

wdiff produces a syntax where each deleted chunk of word(s) is surrounded with "[-" and "-]", and each added chunk with "{+" and "+}". (the "-n" flag is to prevent diff chunks to span over newlines, which will cause problems to colordiff).
colordiff gives nice red/green (depending on your terminal color sheme) coloring of this output
the sed part removes those [- ]- and {+ +} parts, so that you only have the color coding left.
less just adds a nice pager, for easy scrolling.
See the results below:

EDIT: [http://chem-bla-ics.blogspot.se/ Egon Willighagen] hinted me at another way to do this, where you don't even need colordiff, but can do the coloring directly with wdiff, by using it's facility for adding arbitrary start and end sequences around the changed words. Then the command in /bin/git-wdiff would become:

#!/bin/bash
diff -u $2 $5 | wdiff -n -d -w $'\033[30;31m' -x $'\033[0m' -y $'\033[30;32m' -z $'\033[0m' | less

The codes for the colors might need some tweaking to fit your terminal's color scheme. (More info about that [http://en.wikipedia.org/wiki/ANSI_escape_code#Colors here] ).

Tags:

Simple command line python "IDE" with ipython, screen and vim

Thu, 2013-04-11 12:15 | by Samuel Lampa

I often need to develop python scripts on some remote server where I can't run graphical python IDEs like spyder.

I'm too lazy to set up an advanced vim config with full blown IDE-like features (except for some basic python support).

I have found that a much simpler solution is a GNU screen session with two vertically split screens, one for the main coding in vim, and a smalelr one below, for running ipython, to run and debug the script as it is written.

I found it useful enough to figure out how to start such a setup with one command. This is how to do it:

create a file ~/.screenrc.pydev and place the following in it:

split
screen
focus
screen
resize 20
exec ipython
focus
exec vim

then add this to the bottom of your ~/.bashrc or ~/.bash_aliases:

alias pydev='screen -mS PyDev -c ~/.screenrc.pydev'

... and source the file:

source ~/.bashrc

Now, you can start your command line python environment by:

cd some-folder-with-python-files
pydev

The result:

Oh, and to jump between the screen windows, do:

[Ctrl] + [A], [Tab]

Tags:

Using the Chemical Translation Service with Python to get Inchis from chemical names

Fri, 2013-03-22 16:29 | by Samuel Lampa

I needed to convert a bunch of chemical compound name into International Chemical Identifiers (Inchis), to enable easily creating links to various web services and databases that take inchis as input, such as Chembl.

I found out the very useful Chemical Translation Service, which has nice GUIs for doing this manually. In order to do this in a more automated fashion for many compounds though, I realized I'd have to script it up a bit, (in python of course).

I decided to make use of the XML format of the translation service. I have had mixed experiences with both messing with urls, and parsing xml, in python before, so I was very happy to get to know two new python packages that focus on providing a straightforward API that is "usable to humans", requests and xmltodict.

They turned out to be great combination, and IMO the conversion becomes a quite readable bunch of code lines:

# Base URL of the Chemical Translation Service
base_url = "http://cts.fiehnlab.ucdavis.edu/transform/transform"
 
# Create a dictionary with the query parameters
query_params = { "format" : "xml",
                   "extension" : "xml",
                   "to" : "inchikey",
                   "idValue" : query_compound_name,
                   "from" : "name"}
 
# Execute the query
response = requests.get(base_url, params=query_params)
 
# Parse the XML into a python dict (array) structure
xmldict = xmltodict.parse(response.text)
 
# Extract the Inchi key from the array structure
chem_data = xmldict['compoundResultSets']['compoundResultSet']
inchi_key = chem_data['inchiHashKey']

And, why not make it complete with command line flags and stuff:

Tags:

Impressions from the first D Coders Stockholm meetup

Wed, 2013-03-20 23:51 | by Samuel Lampa

The first meetup of the D Coders Stockholm meetup.com group is just finished.(and the first meetup.com D meetup ever, as far as we know! :) ). I thought to share some impressions from the meetup.

I was pleasantly surprised that just a couple of weeks after the group started we gathered a group of five D coders for a real meetup, in relatively "small" city like Stockholm. Additionally, theoretically we could have doubled the number, based on the ten members of the D Coders Stockholm meetup group already.

So, what did we do or talk about? A good part of the meeting consisted of doing rounds and sharing each of ours experiences with D coding in different aspects.

At the end, Oscar and Daniel Brynolf (brothers) showed off some quite veru impressive work they have done on a game engine. I'll leave it to them selves to how much they want to talk about it, and not tell too much about it here though.

Anyways, the group turned out to be committed to contiuing with meetups small and large, and we were discussing various ways this could be done.

The main ideas circled around a combination of show-off meetings where we can show stuff we have done and discuss and share experiences, hackathons, as well as simpler "just meetups" over a beer/coffee" type of meetups.

In terms of interest areas, many in the group were interested in game development, but most also seemed to have also a general interest for all things D. One concrete idea for a "hackathon" subject, for example, was to create a group website in vibe.d (promising web framework in D).

Practically we concluded to try to collect ideas for meetup/hackathon topics in an "idea bucket", in the D Coders Stockholm meetup group forum, and when enough topics are gathered, turn it into a concrete meetup. So, let the ideas come! Looking forward to seeing more D in Stockholm/Sweden in the future! :)

Tags:

D
Hackathon

D is great for small utility programs

Thu, 2013-02-28 20:37 | by Samuel Lampa

Look how little code is needed to read from stdin, and do something with eash line of the output, in the D programming language:

import std.stdio;
 
void main() {
  foreach(line; stdin.byLine()) {
    writeln("Got input line: ", line);
  }
}

Compile the above code with:

dmd dtest.d -of"dtest"

... and pipe some output to the newly created binary:

ls -l | ./dtest

This should make D a perfect language for creating small utility apps for parsing text, that you can pipe to!

Tags:

Primary links

Tags

Go speedup with threading, for line by line string processing

Moar Languagez! GC content in Python, D, FPC, C++ and C (and now Go)

Performance results

Reading line by line

Explanation of code, with links

Reading more lines, or whole file

Explanation of code, with links

Conclusions

GC content continued: Python vs D vs C++ vs Free Pascal

Performance results

My comments

The code

Python version

D version

C++ version

Free Pascal (FPC)

Compile flags

Calculating GC content in python and D - How to get 10x speedup in D

Test data

Python reference implementation

D implementation, using countchars

D implementation with hand-written inner loop

Version info

Commandline git diff that highlights exactly what is changed

Simple command line python "IDE" with ipython, screen and vim

Using the Chemical Translation Service with Python to get Inchis from chemical names

Impressions from the first D Coders Stockholm meetup

Top D programming language links

D is great for small utility programs

Recent blog posts