DNA

Go speedup with threading, for line by line string processing

UPDATE, 2013004: With a slightly modified code, and 3 threads (see basecompl_par2.go below), the numbers are now improved: 52%, resp. 80% speedup, depending on how you count, so, almost doubling the speed!

Note: Use this G+ thread if you want to comment/discuss!

Note II: Of course, in reality, for this kind of basic bioinformatics processing, one would use a pre-existing package, like BioGo instead!

Bioinformaticians process a lot of text based files (although the situation is improving and some better formats are emerging, slowly).

The text formats in question typically need to be parsed line-by-line, for practical reasons and for getting clean, modular and understandeable code. I was thus interested to see how much speedup one could gain for such a string processing task, by using the very easy concurrency / parallelization features of the Go programming language.

As test data I used the same 58MB fasta file (uncompressed) as used in my previous explorations.

As a simple test code, I made a very "dumb" code that just computes the DNA base complement back and fourth 4 times (I only do the 4 common letters, no other logic), in order to emulate a 4-step string processing workflow.

  • Find the code here.

This code does not do any creation of a new string, or string copying, as part of the processing, so it would not be a good imitaion of such processing (which exists, in the form of for example DNA-to-Protein sequence conversion etc). Thus I also created a version of the code that explicitly does string copy. How dumb it might be, I included it in the test.

  • Find this code here.

Finally, I made a parallel version, that spawns go routines for each of the processing steps, which in turn are multiplexed upon 2 threads (set in the program), which turned out to be the optimal for my 1.6GHz Core i5 MacBook air.

  • Find this parallel code here.

The results are as follows:

The data in numbers (click for code):

That is: For the parallel version:

  • 51% speedup (cutting 34% exec time) compared to the sequential one without string copy (basecompl.go)
  • 80% speedup (cutting 45% time) compared to the one that does string copy anyway (basecompl_copy.go).

Even though one would of course dream about even better speedups, this is not too bad IMO, given that this is the 1st (or second) time in my life that I implement a parallel program (and I picked up Go just 1-2 weeks ago!) and given that this problem mostly consists of disk access and data shuffling, which is hard to parallellize anyway.