Calculating GC content in python and D - How to get 10x speedup in D

Update May 16: More optimizations and languages have been added here, and here (last one includes Python, D, FreePascal, C++, C and Go).

I'm testing to implement some simple bioinformatics algos in both python and D, for comparison.

My first test, of simply calculating the GC content (fraction of DNA letters G and C, as compared to G, C, T and A), reveals some hints about where D shines the most.

Based on this very limited test, D:s strength in terms of performance, seems to be more clear when you hand-code your inner loops, than when using some of the functionality of the standard library (phobos).

When using the countchars function in std.string of D:s standard library, D is on par with, but a tiny bit slower than python. When I hand-code the inner loop more carefully though, I get a 10x speedup of the D code, over my python code.

Test data

As test data, I use a 58MB Fasta file, containing approximately 1 million (989561) lines of DNA sequence.

Python reference implementation

As a reference, I wrote this little python script. Probably there is a faster way to write it, but this is the fastest I could come up with.

import re
import string
 
def main():
    file = open("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r")
    gcCount = 0
    totalBaseCount = 0
    for line in file:
        line = line.strip("\n")
        if not line.startswith(">"):
            gcCount += len(re.findall("[GC]", line))
            totalBaseCount += len(re.findall("[GCTA]", line))
    gcFraction = float(gcCount) / totalBaseCount
    print( gcFraction * 100 )
 
 
if __name__ == '__main__':
    main()

This takes around 8 seconds on my machine:

[samuel gc]$ time python gc_test.py 
37.6217301394
 
real    0m7.904s
user    0m7.876s
sys     0m0.020s

Then, I tried two implementations in D. First I tried to make my life as simple as possible, by using existing standard library functionality (the countchars function in std.string):

D implementation, using countchars

import std.stdio;
import std.string;
import std.algorithm;
import std.regex;
 
void main() {
    File file = File("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r");
    int gcCount = 0;
    int totalBaseCount = 0;
    string line = "";
    while (!file.eof()) {
        line = chomp(file.readln());
        if (!startsWith(line, ">")) {
            gcCount += countchars(line, "[GC]");
            totalBaseCount += countchars(line, "[GCTA]");
        }
    }
    float gcFraction = ( cast(float)gcCount / totalBaseCount );
    writeln( gcFraction * 100 );
}

The results: slightly slower than python with DMD, and slightly faster with GDC and LDC:

[samuel gc]$ dmd -ofgc_slow_dmd -O -release -inline gc_slow.d
[samuel gc]$ gdmd -ofgc_slow_gdc -O -release -inline gc_slow.d
[samuel gc]$ ldmd2 -ofgc_slow_ldc -O -release -inline gc_slow.d
[samuel gc]$ for c in gc_slow_{dmd,gdc,ldc}; do echo "---"; echo "$c:"; time ./$c; done;
---
gc_slow_dmd:
37.6217
 
real    0m8.088s
user    0m8.049s
sys     0m0.032s
---
gc_slow_gdc:
37.6217
 
real    0m6.791s
user    0m6.764s
sys     0m0.020s
---
gc_slow_ldc:
37.6217
 
real    0m7.138s
user    0m7.096s
sys     0m0.036s

D implementation with hand-written inner loop

On the other hand, when I hand code the string comparison code, like this:

import std.stdio;
import std.string;
import std.algorithm;
import std.regex;
 
void main() {
    File file = File("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r");
    int gcCount = 0;
    int totalBaseCount = 0;
    string line;
    char c;
    while (!file.eof()) {
        line = file.readln();
        if (line.length > 0 && !startsWith(line, ">")) {
            for(uint i = 0; i < line.length; i++) {
                c = line[i];
                if (c == 'C' || c == 'T' || c == 'G' || c == 'A') {
                    totalBaseCount++;
                    if ( c == 'G' || c == 'C' ) {
                        gcCount++;
                    }
                }   
            }
        }
    }
    float gcFraction = ( cast(float)gcCount / cast(float)totalBaseCount );
    writeln( gcFraction * 100 );
}

... then I get some significant speedup over python, around 10x (13x with the fastest combination of compiler and optimization flags)!:

[samuel gc]$ dmd -ofgc_fast_dmd -O -release -inline gc_fast.d
[samuel gc]$ gdmd -ofgc_fast_gdc -O -release -inline gc_fast.d
[samuel gc]$ ldmd2 -ofgc_fast_ldc -O -release -inline gc_fast.d
[samuel gc]$ for c in gc_faster_{dmd,gdc,ldc}; do echo "---"; echo "$c:"; time ./$c; done;
---
gc_faster_dmd:
37.6217
 
real    0m0.652s
user    0m0.624s
sys     0m0.024s
---
gc_faster_gdc:
37.6217
 
real    0m0.668s
user    0m0.644s
sys     0m0.020s
---
gc_faster_ldc:
37.6217
 
real    0m0.538s
user    0m0.520s
sys     0m0.016s

Version info

  • Machine: MacBook air "11, with 1.7GHz Core i5 (dual core).
  • OS: Lubuntu 12.10 32bit
  • Python: 2.7.3
  • DMD version: 2.062
  • GDC(GCC?) version: 4.6.3
  • LDC: 0.10.0 (based on DMD 2.0.60 and LLVM 3.2)

Comments

python may be faster

try this python version:

import re
import string

file = open("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r")
file.readline() # skip header
body = file.read()
gcCount = len(re.findall("[GC]", body))
totalBaseCount = len(re.findall("[GCTA]", body))
gcFraction = float(gcCount) / totalBaseCount
print( gcFraction * 100 )

Yeah, cuts python execution

Yeah, cuts python execution time to 3.326s. A fasta file may contain many header files though, so generality is affected.

Furthermore, if I change

len(re.findall("[GC]", body)) 

to
body.count("G") + body.count("C")

and
len(re.findall("[GCTA]", body))

to
body.count("G") + body.count("C") + body.count("A") + body.count("T")

 ... then I'm down to 0.546s.

There's a point in reading line by line though, that it allows you to read arbitrary number of.

I've got some additional tips on an optimized version, using pypy as well, so I'll need to update with another blog post. Stay tuned! :)

python may be faster

Yes. I hope you are doing something like this with body.count():

import re
import string

file = open("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r")
file.readline() # skip header
body = file.read()
g = body.count("G")
c = body.count("C")
t = body.count("T")
a = body.count("A")
gcCount = g+c
totalBaseCount = g+c+t+a
gcFraction = float(gcCount) / totalBaseCount
print( gcFraction * 100 )

This gives 0.25 s on my machine.

Basically, I am trading memory usage with time. If you have several comment lines inside the file, it is probably better to grep them out before doing the counting, i.e. prepare the file to speedup the process.
If the file you need to process does not fit the memory, you may still read file in chunks using file.read(size) in a loop. That will let you read say 1Gb of data in each loop. The code will be pythonically simple and quick enough still.

C++ and D versions could be also improved on using the memory more efficiently, but python is always the slickest one.