Python is a very capable scripting language, but when I was new to it, I felt that it was rather verbose, and hard to express things in a short way. I have subsequently learned that this is all about finding the name of the appropriate library, and a little bit about how to optimally use the syntactic sugars in python.
Therefore, it turns out useful to gather a bunch of snippets, for reminding about how to go about certain typical task.
One such task, when working in a unix environment, is how to write text processing scripts that accept input on standard input, and writes the output on standard out, so that the script can easily be chained together with other tools by the use of unix pipes.
The above is easily done in the following way, demonstrated below in its (almost) shortest possible syntax:
from sys import stdin, stdout for line in stdin: stdout.write(line)
Even though Python is a great platform with tons of libraries available, there are still a whole universe of functionality still available only in the Java world, and so, you will sometimes need to bridge the worlds together.
The easiest solution to do this would be to do a simple subprocess call to invoke the java code via a system call. This does not need any external libraries at all. The downside is that the startup time of the Java Virtual Machine (JVM), which is at least around 100 milliseconds for a simple hello world type of program.
What do do then?
Ever since I stumbled upon David Beazleys excellent tutorial on python generator functions, I have been very fond of them.
A generator function is basically a function that - rather than return a list of stuff - it returns a sequence of stuff, one for every time the function is executed.
This allows them to be chained together in long pipelines that don't create any temporary data in memory between the different functions, but instead only one item at a time is "drawn" through the chain of (generator) functions (typically by a loop over the last function in the chain). This all means you can process arbitrary amounts of data with constant RAM usage, and which will typically mean a slight speedup too.
After learning to know the Go programming language, I was happy to see that implementing gererator-like functions in Go is quite easy too - and if implemented as goroutines - they will even be multi-threaded by default! I wanted to show how to implement them below.
First a short look at python generator functions. They can be implemented in two ways (at least). For simplicity, say I want a generator function that takes a string and returns the letters uppercased, one at a time. The classical way to implement this in python is to create a normal function, but just swap the return statement to a yield statement instead, like so:
def generate_uppercase_letters(input_string): for letter in input_string: yield letter.upper()
The, to use this function, you can simply loop over it:
for letter_upper in generate_uppercase_letters("hej"): print letter_upper
Which would produce:
H E J
Python also has a much shorter syntax for creating a generator function though (that has no Go-counterpart), which should be worth mentioning (basically it is python's list comprehension, with the square brackets switched to normal parentheses):
(letter.upper() for letter in input_string)
... which makes chaining together even easier:
for letter_upper in (letter.upper() for letter in "hej"): print letter_upper
Now let's implement the same thing in Go.
package main import ( "fmt" "strings" ) func generateUpperCaseLetters(inputString string) chan string { // Create a channel where to send output outputChannel := make(chan string) // Launch an (anonymous) function in another thread, that does // the actual processing. go func() { // Loop over the letters in inputString for _, letter := range inputString { // Send an uppercased letter to the output channel outputChannel <- strings.ToUpper(string(letter)) } // Close the output channel, so anything that loops over it // will know that it is finished. close(outputChannel) }() return outputChannel } func main() { // Loop over the letters communicated over the channel returned // from generateUpperCaseLetters() and print them for letter := range generateUpperCaseLetters("hej") { fmt.Println(letter) } }
So what does this program do? Basically, in the getUppercaseLetters() function, it:
Now, when the getUppercaseLetters() function is executed, what you get back is not the result, and the actual content of function is not yet executed - exactly in the same way as with generator functions.
Only when you start looping over it as a range in the main() function, this "generator-like" function will start converting characters to uppercase. This all has to do with how channels work, that they both synchronize and communicate at the same time ... (that is, the channel is not filled with a new value until one is picked away from it first). But you can read much better info about that over at golang.org.
So, in conclusion: You can quite eaily implement python-generator-like functions in Go too ... definitely not as short and succinct as in in python, but for a threaded, compiled program, it's not at all too bad IMO!
Now, to see how this can be used in practice, to speed up a chain of line-by-line processing functions with up to 65%, have a look at this previous blog post!
Comments? Leave them here.
My two previous blog posts (this one and this one) on comparing the performance of a number of languages for a simple but very typical bioinformatics problem, spurred a whole bunch of people to suggest improvements. Nice! :) Many of the comments can be read in the comments on those posts, as well as in the G+ thread, while I got improvements sent by mail from Thomas Hume (thomas.hume -at- labri.fr) and Daniel Spångberg.
Anyway, here comes an updated post, with a whole array of improvements, and a few added languages.
One suggestion I got (by Thomas Koch, I think), was applicable to all the code examples that I got sent to me, namely to count GC, and AT separately, and add to the total sum only in the end. It basically cut a few percent of all execution times, so I went through all the code variants and added it. It did improve the speed in a simipar way for all the codes.
Many people suggested to read the whole file in once instead of reading line by line though. This can cause problems with the many very large files in bioinformatics, so I have divided the results below in solutions that read line by line, and those that read more than that.
Among the code examples I got, was "our own" Daniel Spångberg of UPPMAX, who wrote a whole bunch of C variants, even including an OpenMP parallelized variant, which takes the absolute lead. Nice! :)
More credits in the explanation of the codes, under each of the results sections.
Ok, so let's go to the results:
Here are first the codes that read only line by line. I wanted to present them first, to give a fair comparison.
Python | 1.692 |
---|---|
Cython | 1.684 |
Go (8g) | 1.241 |
Go (gcc) | 0.883 |
Go Tbl (gcc) | 0.785 |
Go Tbl (8g) | 0.767 |
PyPy | 0.612 |
FPC (MiSchi) | 0.597 |
D (GDC) | 0.589 |
FPC | 0.586 |
FPC (leledumbo) | 0.567 |
D (DMD) | 0.54 |
D Tbl (DMD) | 0.492 |
D (LDC) | 0.479 |
C++ (HA) | 0.385 |
D Tbl (GDC) | 0.356 |
Go Tbl (RP) (8g) | 0.337 |
Go Tbl (RP) Ptr (8g) | 0.319 |
D Tbl (LDC) | 0.317 |
C (DS) | 0.276 |
C (TH) | 0.252 |
C (DS Tbl) | 0.183 |
All of the below codes have have been modified to benefit from the tip from Thomas Koch (thomas -at- koch.ro), to cound AC and GC separately in the inner loop, and sum up only in the end.
Click the language code to see the code!
Here are the codes that don't keep themselves to reading line by line, but do more than so. For example Gerdus van Zyl's pypy-optimized code reads a bunch of lines at a time (1k seems optimal), before processing them, while still allowing to process in a line-by-line fashion. Then there is Daniel Spångberg's exercise in C performance, where by reading the file in at once, and applying some smart C tricks, and even doing some OpenMP-parallelization, cuts down the execution time under 100 ms, and that is for a 1million line file, on a rather slow MacBook air. Quite impressive!
PyPy (GvZ) | PyPy | PyPy (GvZ) + Opt | C (DS Whole) | C (DS Whole Tbl) | C (DS Whole Tbl Par) | C (DS Whole Tbl 16bit) | C (DS Whole Tbl Par 16bit) | C (DS Whole Tbl Par 16bit Unroll) |
---|---|---|---|---|---|---|---|---|
0.862s | 0.626s | 0.551s | 0.178s | 0.114s | 0.095s | 0.084s | 0.070s | 0.066s |
Note: All of the below codes have been modified to benefit from the tip from Thomas Koch (thomas -at- koch.ro), to cound AC and GC separately in the inner loop, and sum up only in the end.
I think there are a number of conclusions one can draw from this:
What do you think?
Update May 11: Daniel Spångberg now provided a few even more optimized versions: DS Tbl: a table optimized version of the line-by-line reading, as well as optimized versions of the ones that read the whole file at once. On the fastest one, I also tried with -funroll-all-loops to gcc, to see how far we could get. The result (for a 58MB text file): 66 milliseconds!
Update May 16: Thanks to Franzivan Bezerra's post here, I now got some Go code here as well. I modified Francivan's gode a bit, to comply with our latest optimizations here, as well as created a table optimized version (similar to Daniel Spångberg's Table optimized C code). To give a fair comparison of table optimization, I also added a table optimized D version, compiled with DMD, GDC and LDC. Find the results in the first graph below.
Update II, May 16: The link to the input file, used in the examples, got lost. It can be downloaded here!
Update III, May 16: Thanks to Roger Peppe, suggesting some improved versions in this thread on G+, we now got two more, very fast Go versions, clocking even slightly below the idiomatic C++ code. Have a look in the first graph and table below!
Update IV, May 16: Now applied equivalent optimizations like Roger's ones for Go mentioned above, to the "D Tbl" versions (re-use the "line" var as buffer to readln(), and use foreach instead of for loop). Result: D and Go gets extremely similar numbers: 0.317 and 0.319 was the best I could get for D and Go, respectively. See updated graphs below.
Update May 17: See this post on how Francivan Bezerra acheived same performance as C with D, beating the other two (Go and Java) in the comparison. (Direct link to GDoc)
My previous blog post, comparing a simple character counting algorithm between python and D, created an interesting discussion on Google+, with many tuning tips for the D version, and finally we also had a C++ version (kindly provided by Harald Achitz).
Then I could not resist to try out something I've been thinking about for a while (comparing D with Free Pascal (FPC)), so I went away and jotted down an FPC version as well (my first FPC code ever, I think).
Anyway, I now just wanted to summarize a comparison between all of these! I have added below the python code, the faster of the two D versions in my previous post, as well as the new additions: Harald's C++ version and my FPC version.
Python | D (DMD) | D (GDC) | D (LDC) | C++ | FPC |
---|---|---|---|---|---|
7.904s | 0.652s | 0.668s | 0.538s | 0.399s | 0.629s |
Numbers are execution time in seconds, for counting the fraction of "G" and "C" characters compared to "A", "T", "C" and "T" in a 58 MB text file of nearly 1 million lines of text. And so, why not plot it as well:
Not surprising maybe that C++ is the fastest. I find it interesting though that both D and FPC perform in the same ballpark at least, with code written by a novice user (although with some minor tweaking).
For me personally, I also got a first answer on the question I've been wondering about for a while: Will D or FreePascal be faster, given that Pascal is an older and more mature language, and doesn't use a Garbage collector, while D on the other hand have better compiler support, which it can benefit from (like LLVM).
Since I've been betting the most on D lately, I find it interesting to see that still, D can perform better, at least with the right compiler. Still I'm impressed that FPC is totally on par with D, and beaten only when D is compiled with the LLVM backed LDC.
On another note though, I tend to like the compactness of the D code. It's not really that far from the python version, if omitting the closing curly brackets, no? Not bad for a statically compiled high performance language! :) (That is, while FreePascal has it's merits from using mostly natural language words, over quirky symbols, which IMO can enhance readability a bit).
import re import string def main(): file = open("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r") gcCount = 0 totalBaseCount = 0 for line in file: line = line.strip("\n") if not line.startswith(">"): gcCount += len(re.findall("[GC]", line)) totalBaseCount += len(re.findall("[GCTA]", line)) gcFraction = float(gcCount) / totalBaseCount print( gcFraction * 100 ) if __name__ == '__main__': main()
import std.stdio; import std.string; import std.algorithm; import std.regex; void main() { File file = File("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r"); int gcCount = 0; int totalBaseCount = 0; string line; char c; while (!file.eof()) { line = file.readln(); if (line.length > 0 && !startsWith(line, ">")) { for(uint i = 0; i < line.length; i++) { c = line[i]; if (c == 'C' || c == 'T' || c == 'G' || c == 'A') { totalBaseCount++; if ( c == 'G' || c == 'C' ) { gcCount++; } } } } } float gcFraction = ( cast(float)gcCount / cast(float)totalBaseCount ); writeln( gcFraction * 100 ); }
(Kindly shared by Harald Achitz!)
#include <string> #include <fstream> #include <iostream> int main() { std::fstream fs("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa", std::ios::in); int gcCount = 0; int totalBaseCount = 0; if ( fs.is_open() ) { std::string line; while (std::getline(fs, line)) { if(*(line.begin()) != '>') { for(std::string::const_iterator pos = line.begin(); pos!=line.end() ; ++pos){ switch (*pos){ case 'A' : totalBaseCount++; break; case 'C' : gcCount++; totalBaseCount++; break; case 'G' : gcCount++; totalBaseCount++; break; case 'T' : totalBaseCount++; break; default: break; } } } } //std::cout << "gcCount: " << gcCount << " / totalBaseCount: "<< totalBaseCount << " = " ; std::cout << ( (float)gcCount / (float)totalBaseCount ) * 100 << std::endl ; } else { std::cout << "can't open file" ; } return 0 ; }
program gc; {$mode objfpc} // Do not forget this ever uses Sysutils; var FastaFile: TextFile; CurrentLine: String; GCCount: Integer; TotalBaseCount: Integer; i: Integer; c: Char; GCFraction: Single; begin GCCount := 0; TotalBaseCount := 0; AssignFile(FastaFile, 'Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa'); {$I+} //use exceptions try Reset(FastaFile); repeat Readln(FastaFile, CurrentLine); i := 0; while i < Length(CurrentLine) do begin c := CurrentLine[i]; if (c = 'A') or (c = 'G') or (c = 'C') or (c = 'T') then begin TotalBaseCount := TotalBaseCount + 1; if (c = 'G') or (c = 'C') then GCCount := GCCount + 1; end; i := i + 1; end; until(EOF(FastaFile)); CloseFile(FastaFile); except on E: EInOutError do begin Writeln('File handling error occurred. Details: '+E.ClassName+'/'+E.Message); end; end; GCFraction := GCCount / TotalBaseCount; Writeln(FormatFloat('00.0000', GCFraction * 100)); end.
I used the following compile commands / flags for the test above (from the Makefile):
d: dmd -ofgc_d_dmd -O -release -inline gc.d gdmd -ofgc_d_gdc -O -release -inline gc.d ldmd2 -ofgc_d_ldc -O -release -inline gc.d cpp: g++ -ogc_cpp -O3 gc.cpp fpc: fpc -ogc_fpc -Ur -O3 -TLINUX gc.pas
Let me know if there is other flags that should be used, for some of the above compilers!
Update May 16: More optimizations and languages have been added here, and here (last one includes Python, D, FreePascal, C++, C and Go).
I'm testing to implement some simple bioinformatics algos in both python and D, for comparison.
My first test, of simply calculating the GC content (fraction of DNA letters G and C, as compared to G, C, T and A), reveals some hints about where D shines the most.
Based on this very limited test, D:s strength in terms of performance, seems to be more clear when you hand-code your inner loops, than when using some of the functionality of the standard library (phobos).
When using the countchars function in std.string of D:s standard library, D is on par with, but a tiny bit slower than python. When I hand-code the inner loop more carefully though, I get a 10x speedup of the D code, over my python code.
As test data, I use a 58MB Fasta file, containing approximately 1 million (989561) lines of DNA sequence.
As a reference, I wrote this little python script. Probably there is a faster way to write it, but this is the fastest I could come up with.
import re import string def main(): file = open("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r") gcCount = 0 totalBaseCount = 0 for line in file: line = line.strip("\n") if not line.startswith(">"): gcCount += len(re.findall("[GC]", line)) totalBaseCount += len(re.findall("[GCTA]", line)) gcFraction = float(gcCount) / totalBaseCount print( gcFraction * 100 ) if __name__ == '__main__': main()
This takes around 8 seconds on my machine:
[samuel gc]$ time python gc_test.py 37.6217301394 real 0m7.904s user 0m7.876s sys 0m0.020s
Then, I tried two implementations in D. First I tried to make my life as simple as possible, by using existing standard library functionality (the countchars function in std.string):
import std.stdio; import std.string; import std.algorithm; import std.regex; void main() { File file = File("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r"); int gcCount = 0; int totalBaseCount = 0; string line = ""; while (!file.eof()) { line = chomp(file.readln()); if (!startsWith(line, ">")) { gcCount += countchars(line, "[GC]"); totalBaseCount += countchars(line, "[GCTA]"); } } float gcFraction = ( cast(float)gcCount / totalBaseCount ); writeln( gcFraction * 100 ); }
The results: slightly slower than python with DMD, and slightly faster with GDC and LDC:
[samuel gc]$ dmd -ofgc_slow_dmd -O -release -inline gc_slow.d [samuel gc]$ gdmd -ofgc_slow_gdc -O -release -inline gc_slow.d [samuel gc]$ ldmd2 -ofgc_slow_ldc -O -release -inline gc_slow.d [samuel gc]$ for c in gc_slow_{dmd,gdc,ldc}; do echo "---"; echo "$c:"; time ./$c; done; --- gc_slow_dmd: 37.6217 real 0m8.088s user 0m8.049s sys 0m0.032s --- gc_slow_gdc: 37.6217 real 0m6.791s user 0m6.764s sys 0m0.020s --- gc_slow_ldc: 37.6217 real 0m7.138s user 0m7.096s sys 0m0.036s
On the other hand, when I hand code the string comparison code, like this:
import std.stdio; import std.string; import std.algorithm; import std.regex; void main() { File file = File("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r"); int gcCount = 0; int totalBaseCount = 0; string line; char c; while (!file.eof()) { line = file.readln(); if (line.length > 0 && !startsWith(line, ">")) { for(uint i = 0; i < line.length; i++) { c = line[i]; if (c == 'C' || c == 'T' || c == 'G' || c == 'A') { totalBaseCount++; if ( c == 'G' || c == 'C' ) { gcCount++; } } } } } float gcFraction = ( cast(float)gcCount / cast(float)totalBaseCount ); writeln( gcFraction * 100 ); }
... then I get some significant speedup over python, around 10x (13x with the fastest combination of compiler and optimization flags)!:
[samuel gc]$ dmd -ofgc_fast_dmd -O -release -inline gc_fast.d [samuel gc]$ gdmd -ofgc_fast_gdc -O -release -inline gc_fast.d [samuel gc]$ ldmd2 -ofgc_fast_ldc -O -release -inline gc_fast.d [samuel gc]$ for c in gc_faster_{dmd,gdc,ldc}; do echo "---"; echo "$c:"; time ./$c; done; --- gc_faster_dmd: 37.6217 real 0m0.652s user 0m0.624s sys 0m0.024s --- gc_faster_gdc: 37.6217 real 0m0.668s user 0m0.644s sys 0m0.020s --- gc_faster_ldc: 37.6217 real 0m0.538s user 0m0.520s sys 0m0.016s
I often need to develop python scripts on some remote server where I can't run graphical python IDEs like spyder.
I'm too lazy to set up an advanced vim config with full blown IDE-like features (except for some basic python support).
I have found that a much simpler solution is a GNU screen session with two vertically split screens, one for the main coding in vim, and a smalelr one below, for running ipython, to run and debug the script as it is written.
I found it useful enough to figure out how to start such a setup with one command. This is how to do it:
create a file ~/.screenrc.pydev and place the following in it:
split screen focus screen resize 20 exec ipython focus exec vim
then add this to the bottom of your ~/.bashrc or ~/.bash_aliases:
alias pydev='screen -mS PyDev -c ~/.screenrc.pydev'
... and source the file:
source ~/.bashrc
Now, you can start your command line python environment by:
cd some-folder-with-python-files pydev
The result:
Oh, and to jump between the screen windows, do:
[Ctrl] + [A], [Tab]
I needed to convert a bunch of chemical compound name into International Chemical Identifiers (Inchis), to enable easily creating links to various web services and databases that take inchis as input, such as Chembl.
I found out the very useful Chemical Translation Service, which has nice GUIs for doing this manually. In order to do this in a more automated fashion for many compounds though, I realized I'd have to script it up a bit, (in python of course).
I decided to make use of the XML format of the translation service. I have had mixed experiences with both messing with urls, and parsing xml, in python before, so I was very happy to get to know two new python packages that focus on providing a straightforward API that is "usable to humans", requests and xmltodict.
They turned out to be great combination, and IMO the conversion becomes a quite readable bunch of code lines:
# Base URL of the Chemical Translation Service base_url = "http://cts.fiehnlab.ucdavis.edu/transform/transform" # Create a dictionary with the query parameters query_params = { "format" : "xml", "extension" : "xml", "to" : "inchikey", "idValue" : query_compound_name, "from" : "name"} # Execute the query response = requests.get(base_url, params=query_params) # Parse the XML into a python dict (array) structure xmldict = xmltodict.parse(response.text) # Extract the Inchi key from the array structure chem_data = xmldict['compoundResultSets']['compoundResultSet'] inchi_key = chem_data['inchiHashKey']
And, why not make it complete with command line flags and stuff:
time.strftime("%Y-%m-%d", time.gmtime(1320966056))
def exec_piped_command(self, cmd1, cmd2, cmd1stdin=None): print("DEBUG: Executing piped command: %s | %s" % (cmd1, cmd2)) try: if cmd1stdin is None: proc1 = subprocess.Popen(cmd1, stdout=subprocess.PIPE) proc2 = subprocess.Popen(cmd2, stdin=proc1.stdout, stdout=subprocess.PIPE) commandoutput = proc2.communicate()[0] else: cmd1str = " ".join(cmd1) cmd2str = " ".join(cmd2) cmdstr = "%s | %s" % (cmd1str, cmd2str)