Moar Languagez! GC content in Python, D, FPC, C++ and C (and now Go)

My two previous blog posts (this one and this one) on comparing the performance of a number of languages for a simple but very typical bioinformatics problem, spurred a whole bunch of people to suggest improvements. Nice! :) Many of the comments can be read in the comments on those posts, as well as in the G+ thread, while I got improvements sent by mail from Thomas Hume (thomas.hume -at- and Daniel Spångberg.

Anyway, here comes an updated post, with a whole array of improvements, and a few added languages.

One suggestion I got (by Thomas Koch, I think), was applicable to all the code examples that I got sent to me, namely to count GC, and AT separately, and add to the total sum only in the end. It basically cut a few percent of all execution times, so I went through all the code variants and added it. It did improve the speed in a simipar way for all the codes.

Many people suggested to read the whole file in once instead of reading line by line though. This can cause problems with the many very large files in bioinformatics, so I have divided the results below in solutions that read line by line, and those that read more than that.

Among the code examples I got, was "our own" Daniel Spångberg of UPPMAX, who wrote a whole bunch of C variants, even including an OpenMP parallelized variant, which takes the absolute lead. Nice! :)

More credits in the explanation of the codes, under each of the results sections.

Ok, so let's go to the results:

Performance results

Reading line by line

Here are first the codes that read only line by line. I wanted to present them first, to give a fair comparison.

Go (8g)1.241
Go (gcc)0.883
Go Tbl (gcc)0.785
Go Tbl (8g)0.767
FPC (MiSchi)0.597
D (GDC)0.589
FPC (leledumbo)0.567
D (DMD)0.54
D Tbl (DMD)0.492
D (LDC)0.479
C++ (HA)0.385
D Tbl (GDC)0.356
Go Tbl (RP) (8g)0.337
Go Tbl (RP) Ptr (8g)0.319
D Tbl (LDC)0.317
C (DS)0.276
C (TH)0.252
C (DS Tbl)0.183

Explanation of code, with links

All of the below codes have have been modified to benefit from the tip from Thomas Koch (thomas -at-, to cound AC and GC separately in the inner loop, and sum up only in the end.

Click the language code to see the code!

  • Python: My code, with improvents from lelledumbo (here on the blog).
  • Cython: Same as "Python", compiled to C with Cython, with statically typed integers.
  • Go (8g): Go code compiled with inbuilt Go compiler.
  • Go (gcc): Go code compiled with GCC.
  • Go Tbl (gcc): Go code, table optimized, and compiled with GCC.
  • Go Tbl (8g): Go code, table optimized, and compiled with inbuilt Go compiler.
  • PyPy: My python code compiled with PyPy (1.9.0 with GCC 4.7.0)
  • FPC (MiSchi): FreePascal version improved by MiSchi (Tonne Schindler)
  • D (GDC): My D code, with many improvements suggested from the folks in the G+ thread. Compiled with GDC.
  • FPC: My FPC code
  • FPC (leledumbo): FreePascal version improved by Leledumbo (leledumbo_cool -at-
  • D Tbl (DMD): My table optimized D code, compiled with DMD.
  • D (DMD): My D code, compiled with DMD.
  • D Tbl (GDC): My table optimized D code, compiled with GDC.
  • D (LDC): My D code, compiled with LDC.
  • C++ (HA): Harald Aschiz's C++ code.
  • Go Tbl (RP) (8g): Roger Peppe's optimized Go code, omitting string conversion and using range.
  • Go Tbl (RP) Ptr (8g): Roger Peppe's optimized Go code, additionally using pointers for the table opt.
  • D Tbl (LDC): My table optimized D code, compiled with LDC.
  • C (DS): Daniel Spångberg's C code (See his even faster ones in the next section!)
  • C (TH): Thomas Hume's C code.
  • C (DS Tbl): Daniel Spångberg's line-by-line version, with table optimization.

Reading more lines, or whole file

Here are the codes that don't keep themselves to reading line by line, but do more than so. For example Gerdus van Zyl's pypy-optimized code reads a bunch of lines at a time (1k seems optimal), before processing them, while still allowing to process in a line-by-line fashion. Then there is Daniel Spångberg's exercise in C performance, where by reading the file in at once, and applying some smart C tricks, and even doing some OpenMP-parallelization, cuts down the execution time under 100 ms, and that is for a 1million line file, on a rather slow MacBook air. Quite impressive!

PyPy (GvZ) PyPy PyPy (GvZ) + Opt C (DS Whole) C (DS Whole Tbl) C (DS Whole Tbl Par) C (DS Whole Tbl 16bit) C (DS Whole Tbl Par 16bit) C (DS Whole Tbl Par 16bit Unroll)
0.862s 0.626s 0.551s 0.178s 0.114s 0.095s 0.084s 0.070s 0.066s

Explanation of code, with links

Note: All of the below codes have been modified to benefit from the tip from Thomas Koch (thomas -at-, to cound AC and GC separately in the inner loop, and sum up only in the end.


I think there are a number of conclusions one can draw from this:

  • C is in it's own class, when it comes to speed, which is seen in both of the categories above.
  • C++ is also fast, but D with the right compiler is not far behind.
  • Generally D and FPC is very close!
  • Pure python is not near the compiled ones, but surprisingly (for me), by using PyPy, it becomes totally comparable with compiled languages!
  • If you look both at the speed, and the compactness and readability of code at the same time, I find D to be a very strong competitor in this game, but at the same time, Go is very close, and already has a [ bioinformatics package (biogo)]!

What do you think?

Update May 11: Daniel Spångberg now provided a few even more optimized versions: DS Tbl: a table optimized version of the line-by-line reading, as well as optimized versions of the ones that read the whole file at once. On the fastest one, I also tried with -funroll-all-loops to gcc, to see how far we could get. The result (for a 58MB text file): 66 milliseconds!

Update May 16: Thanks to Franzivan Bezerra's post here, I now got some Go code here as well. I modified Francivan's gode a bit, to comply with our latest optimizations here, as well as created a table optimized version (similar to Daniel Spångberg's Table optimized C code). To give a fair comparison of table optimization, I also added a table optimized D version, compiled with DMD, GDC and LDC. Find the results in the first graph below.

Update II, May 16: The link to the input file, used in the examples, got lost. It can be downloaded here!

Update III, May 16: Thanks to Roger Peppe, suggesting some improved versions in this thread on G+, we now got two more, very fast Go versions, clocking even slightly below the idiomatic C++ code. Have a look in the first graph and table below!

Update IV, May 16: Now applied equivalent optimizations like Roger's ones for Go mentioned above, to the "D Tbl" versions (re-use the "line" var as buffer to readln(), and use foreach instead of for loop). Result: D and Go gets extremely similar numbers: 0.317 and 0.319 was the best I could get for D and Go, respectively. See updated graphs below.

Update May 17: See this post on how Francivan Bezerra acheived same performance as C with D, beating the other two (Go and Java) in the comparison. (Direct link to GDoc)


I find python running with PyPy to be the best

Looking at results, I find python running with PyPy to be the strongest competitor to be honest. The cleanest code and still very good result.
I can't see how this D code:

if (line.length > 0 && !startsWith(line, ">")) {
   for(uint i = 0; i < line.length; i++) {
      c = line[i];
      if (c == 'G' || c == 'C') {
      } else if ( c == 'A' || c == 'T' ) {

maybe considered to be more readable than this python:

if not line.startswith(">"):
   g += line.count("G")
   c += line.count("C")
   a += line.count("A")
   t += line.count("T")

If you are going to process gigabytes of data, PyPy may show itself to be closure to C++ or D I suspect.

Also, it is not fair to test different algorithms saying this is a language difference.
For example, you may use the same table optimization in D and Pascal and C++.
And you may try writing python code in C/C++/D style. However, it may give worse results for python (not sure about PyPy).

Thanks for the feedback!

Thanks for the feedback! Yeah, I absolutely agree that python knocks out even D in terms of readability, although I think the D code could be prettified quite a lot still.

But I actually found that PyPy was quite picky about how I wrote the algorithm. It could produce results similar to normal python sometimes, and then sometimes it got this close to the compiled crowd, so I D seems to be more consistently fast.

Also, the main reason I started feeling unsafe staying only with python was it's lack of multithreading capabilities. With this multi-core revolution, I don't think it is reasonable to think about writing high performance tools for massive data analysis in something which lacks multi-threading (although I might be wrong).

Anyway, I'm not throwing out Python - I love it (and use it all the time in my daily job)! ... instead I was not happy to go to Java (or learn C++), in order to write more high performance code. So, it seems Python and D will both be on my favourite list for a long time.

If you really just want it to

If you really just want it to be pretty instead of fast, why not do this?

if(!line[0] != '>') {
   g += line.countchars("G");
   c += line.countchars("C");
   a += line.countchars("A");
   t += line.countchars("T");

Hell, you can always just move it all out to a function (and probably should in a proper project anyway) and then just do:

count_bases(line, g, c, a, t);