CUDA vs. C++ vs. Node.js: Genomic Algorithm Performance
The algorithm itself is a variation of the longest common substring problem, but due to the nature of the variation it can't be solved using suffix trees. The next most efficient option (in big O terms) requires dynamic programming.
The two search strings are 400,000 characters and 6,000,000,000 characters long. All data is loaded into memory before the algorithm begins, so IO does not affect the performance results.
Unfortunately, performance was abysmal. Node.JS is single threaded. On my datasets, the algorithm would have taken months to complete.
Language 2: C++
Language 3: CUDA
My initial implementation ran on my laptop's NVidia GeForce GT 650M chip with the latest CUDA drivers.
Both search strings were loaded into device memory. Each thread would perform one comparison, and each invocation would fill one row in the LCSuff table and dump the result to device memory. Only two rows are ever used at one time, so only enough memory for two rows need be allocated.
>10x perf improvement over C++ running on my Intel Core i7 @ 2.70 GHz
Thanks to the generous folks at NVidia and Microway, I was given remote access to a server running an NVIDIA Tesla K40.
Out of the gate, the algorithm performed 5x faster.
Shared Memory and Intra-Warp Sharing
I now had the performance I needed. At 500x faster than Node.JS, the algorithm completed within hours. I had another day on the K40, so I figured I'd see what performance I could squeeze out of some basic optimizations.
Up until now, my algorithm was performing one character comparison per thread and then outputting the results to device memory (there was a bit more math going on, but it was fundamentally memory-bound).
The primary optimization was to take a hierarchical approach. Do multiple compares per thread, and store the intermediate results in thread Shared Memory, performing 256/512/1024 compares per thread, dumping the output in shared memory after each one, and then dumping the result to device memory after all threads have completed. Doing so requires you to 'throw out' a number of comparisons that will be invalid because the data they need won't have been computed by your local thread pool. Still a net win, however, because you save a great deal of latency. Another 30% performance improvement.
The next level of optimization involved adding another level to the hierarchy. The K40 runs threads in 16-wide "warps". By using intra-warp communication, the intermediate results computed by neighbor threads could be accessed, allowing each thread to compute up to 16 results before even having to output to shared memory. Spill to shared memory after 16 iterations, then to device memory after 16 of those. Another 50% performance improvement.
These optimizations would only run on the K40, as they significantly increased the time spent in the GPU per invocation, which caused a WDDM timeout on my laptop GPU. On the remote server, the K40 wasn't used for desktop rendering so it was more forgiving.
I'd like to thank the folks at Nvidia and Microway for providing free access to the K40 for this GPGPU adventure.
K40 System Specs:
(2) 4-core Xeon E5-2643 v2 CPUs @ 3.5GHz
(1) NVIDIA Tesla K40, (1) NVIDIA Quadro K6000, and (1) NVIDIA Quadro K600
64GB DDR3 1866MHz Memory