Update May 16: More optimizations and languages have been added here, and here (last one includes Python, D, FreePascal, C++, C and Go).
I'm testing to implement some simple bioinformatics algos in both python and D, for comparison.
My first test, of simply calculating the GC content (fraction of DNA letters G and C, as compared to G, C, T and A), reveals some hints about where D shines the most.
Based on this very limited test, D:s strength in terms of performance, seems to be more clear when you hand-code your inner loops, than when using some of the functionality of the standard library (phobos).
When using the countchars function in std.string of D:s standard library, D is on par with, but a tiny bit slower than python. When I hand-code the inner loop more carefully though, I get a 10x speedup of the D code, over my python code.
As test data, I use a 58MB Fasta file, containing approximately 1 million (989561) lines of DNA sequence.
As a reference, I wrote this little python script. Probably there is a faster way to write it, but this is the fastest I could come up with.
import re import string def main(): file = open("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r") gcCount = 0 totalBaseCount = 0 for line in file: line = line.strip("\n") if not line.startswith(">"): gcCount += len(re.findall("[GC]", line)) totalBaseCount += len(re.findall("[GCTA]", line)) gcFraction = float(gcCount) / totalBaseCount print( gcFraction * 100 ) if __name__ == '__main__': main()
This takes around 8 seconds on my machine:
[samuel gc]$ time python gc_test.py 37.6217301394 real 0m7.904s user 0m7.876s sys 0m0.020s
Then, I tried two implementations in D. First I tried to make my life as simple as possible, by using existing standard library functionality (the countchars function in std.string):
import std.stdio; import std.string; import std.algorithm; import std.regex; void main() { File file = File("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r"); int gcCount = 0; int totalBaseCount = 0; string line = ""; while (!file.eof()) { line = chomp(file.readln()); if (!startsWith(line, ">")) { gcCount += countchars(line, "[GC]"); totalBaseCount += countchars(line, "[GCTA]"); } } float gcFraction = ( cast(float)gcCount / totalBaseCount ); writeln( gcFraction * 100 ); }
The results: slightly slower than python with DMD, and slightly faster with GDC and LDC:
[samuel gc]$ dmd -ofgc_slow_dmd -O -release -inline gc_slow.d [samuel gc]$ gdmd -ofgc_slow_gdc -O -release -inline gc_slow.d [samuel gc]$ ldmd2 -ofgc_slow_ldc -O -release -inline gc_slow.d [samuel gc]$ for c in gc_slow_{dmd,gdc,ldc}; do echo "---"; echo "$c:"; time ./$c; done; --- gc_slow_dmd: 37.6217 real 0m8.088s user 0m8.049s sys 0m0.032s --- gc_slow_gdc: 37.6217 real 0m6.791s user 0m6.764s sys 0m0.020s --- gc_slow_ldc: 37.6217 real 0m7.138s user 0m7.096s sys 0m0.036s
On the other hand, when I hand code the string comparison code, like this:
import std.stdio; import std.string; import std.algorithm; import std.regex; void main() { File file = File("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r"); int gcCount = 0; int totalBaseCount = 0; string line; char c; while (!file.eof()) { line = file.readln(); if (line.length > 0 && !startsWith(line, ">")) { for(uint i = 0; i < line.length; i++) { c = line[i]; if (c == 'C' || c == 'T' || c == 'G' || c == 'A') { totalBaseCount++; if ( c == 'G' || c == 'C' ) { gcCount++; } } } } } float gcFraction = ( cast(float)gcCount / cast(float)totalBaseCount ); writeln( gcFraction * 100 ); }
... then I get some significant speedup over python, around 10x (13x with the fastest combination of compiler and optimization flags)!:
[samuel gc]$ dmd -ofgc_fast_dmd -O -release -inline gc_fast.d [samuel gc]$ gdmd -ofgc_fast_gdc -O -release -inline gc_fast.d [samuel gc]$ ldmd2 -ofgc_fast_ldc -O -release -inline gc_fast.d [samuel gc]$ for c in gc_faster_{dmd,gdc,ldc}; do echo "---"; echo "$c:"; time ./$c; done; --- gc_faster_dmd: 37.6217 real 0m0.652s user 0m0.624s sys 0m0.024s --- gc_faster_gdc: 37.6217 real 0m0.668s user 0m0.644s sys 0m0.020s --- gc_faster_ldc: 37.6217 real 0m0.538s user 0m0.520s sys 0m0.016s
Comments
python may be faster
try this python version:
import re
import string
file = open("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r")
file.readline() # skip header
body = file.read()
gcCount = len(re.findall("[GC]", body))
totalBaseCount = len(re.findall("[GCTA]", body))
gcFraction = float(gcCount) / totalBaseCount
print( gcFraction * 100 )
Yeah, cuts python execution
Yeah, cuts python execution time to 3.326s. A fasta file may contain many header files though, so generality is affected.
Furthermore, if I change
to
and
to
There's a point in reading line by line though, that it allows you to read arbitrary number of.
I've got some additional tips on an optimized version, using pypy as well, so I'll need to update with another blog post. Stay tuned! :)
python may be faster
Yes. I hope you are doing something like this with body.count():
import re
import string
file = open("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r")
file.readline() # skip header
body = file.read()
g = body.count("G")
c = body.count("C")
t = body.count("T")
a = body.count("A")
gcCount = g+c
totalBaseCount = g+c+t+a
gcFraction = float(gcCount) / totalBaseCount
print( gcFraction * 100 )
This gives 0.25 s on my machine.
Basically, I am trading memory usage with time. If you have several comment lines inside the file, it is probably better to grep them out before doing the counting, i.e. prepare the file to speedup the process.
If the file you need to process does not fit the memory, you may still read file in chunks using file.read(size) in a loop. That will let you read say 1Gb of data in each loop. The code will be pythonically simple and quick enough still.
C++ and D versions could be also improved on using the memory more efficiently, but python is always the slickest one.