fasta

UPDATE, 2013004: With a slightly modified code, and 3 threads (see basecompl_par2.go below), the numbers are now improved: 52%, resp. 80% speedup, depending on how you count, so, almost doubling the speed!

Note: Use this G+ thread if you want to comment/discuss!

Note II: Of course, in reality, for this kind of basic bioinformatics processing, one would use a pre-existing package, like BioGo instead!

Bioinformaticians process a lot of text based files (although the situation is improving and some better formats are emerging, slowly).

The text formats in question typically need to be parsed line-by-line, for practical reasons and for getting clean, modular and understandeable code. I was thus interested to see how much speedup one could gain for such a string processing task, by using the very easy concurrency / parallelization features of the Go programming language.

As test data I used the same 58MB fasta file (uncompressed) as used in my previous explorations.

As a simple test code, I made a very "dumb" code that just computes the DNA base complement back and fourth 4 times (I only do the 4 common letters, no other logic), in order to emulate a 4-step string processing workflow.

Find the code here.

This code does not do any creation of a new string, or string copying, as part of the processing, so it would not be a good imitaion of such processing (which exists, in the form of for example DNA-to-Protein sequence conversion etc). Thus I also created a version of the code that explicitly does string copy. How dumb it might be, I included it in the test.

Find this code here.

Finally, I made a parallel version, that spawns go routines for each of the processing steps, which in turn are multiplexed upon 2 threads (set in the program), which turned out to be the optimal for my 1.6GHz Core i5 MacBook air.

Find this parallel code here.

The results are as follows:

The data in numbers (click for code):

basecompl_copy.go: 6.98s
basecompl.go: 5.84s
basecompl_par.go: 4.23s
basecompl_par2.go: 3.87s

That is: For the parallel version:

51% speedup (cutting 34% exec time) compared to the sequential one without string copy (basecompl.go)
80% speedup (cutting 45% time) compared to the one that does string copy anyway (basecompl_copy.go).

Even though one would of course dream about even better speedups, this is not too bad IMO, given that this is the 1st (or second) time in my life that I implement a parallel program (and I picked up Go just 1-2 weeks ago!) and given that this problem mostly consists of disk access and data shuffling, which is hard to parallellize anyway.

Samuel's Tech Blog

Primary links

Tags

Go speedup with threading, for line by line string processing

Recent blog posts