When we started looking into CUDA support for PreonLab I didn’t really have any experience programming for GPUs. Getting from the first experiments in 2021 to production code has been a long learning process. For CUDA, there is no shortage of useful articles, documentation and examples (in fact this was one of the reasons to pick CUDA as a platform). Nevertheless, I thought it might be interesting to go over the most important lessons from my perspective as a developer used to x86-based CPUs, so here we go!
CUDA threads seem to resemble the traditional threads on CPUs. Each thread seems to have its own state and seems to run its code independently of other threads. For the most part, it is possible to treat them just like traditional threads on the CPU. In simple examples this will even result in amazing performance results. But as soon as some branching and indirection is introduced, the performance often degrades quickly. It might even be incredibly bad and run 100 times slower than the good old CPU. This is because not all CUDA threads are independent of each other, and ignoring this will lead to terrible results.
Fortunately, there is something that CPU developers are familiar with that serves as a much better comparison. Instead of thinking of CUDA threads as threads, I find it much more helpful to compare them to SIMD on CPU. SIMD (Single Instruction, Multiple Data) allows to run the same instruction on multiple variables in parallel. For example, with AVX we can run a floating-point operation on eight 32-bit floats in parallel and retrieve 8 results. In CUDA, 32 adjacent threads are organized in a so-called “warp”. Effectively, this warp behaves very similar to AVX, only processing 32 instead of 8 variables in parallel. The main difference is that the threads in a warp are much more flexible because they can handle branching. If there is a branch and only one thread wants to enter it, the other threads will just wait for it and then resume execution together (side note: this is not strictly accurate since independent thread scheduling but in terms of performance it is still valid to think of warps this way). This way, the threads in a warp “feel” like proper threads, but good performance is only achieved if the threads spend most of the time executing the same instructions. The ability to handle branching is still extremely useful. With SIMD, it is annoying to work around branching manually and it often involves unreadable code that is hard to maintain. On GPU, a bit of branching is no problem if the code quickly reconverges. This way, it is possible to write much cleaner code that still runs fast. This kind of programming model is called SIMT (Single Data, Multiple Threads) and honestly, I don’t want to go back to SIMD!
Another difference between CUDA threads and CPU threads is memory bandwidth. If possible, CUDA threads in a warp should access memory in a coalesced way to maximize bandwidth. For example, if thread 1 writes an integer to an address in global memory, thread 2 should write to the integer right next to this address. On CPU, this behavior is actively discouraged because it can lead to a problem called false sharing.
Of course, parallelization in GPUs does not only take place within a warp. Warps in turn are organized in thread blocks which are executed in parallel. This kind of parallelization is much more similar to what we know from CPU threads.
When starting out with CUDA, the possibilities can be a bit overwhelming. What block size should my kernel use? Do I have to use shared memory, texture memory, constant memory and register shuffling? In my experience, at least on modern GPUs it is not necessary to obsess over these things. Using flexible kernels and device lambdas it was possible to hide kernel launch configuration details from most code with almost no performance impact. It is much more important to minimize warp divergence and make sure that access to global memory is coalesced in performance-critical code. Because of the unstructured nature of fluid particle data sets, I found it difficult to use shared memory in an elegant way. I usually ended up programming some kind of caching mechanism, something CUDA is doing anyway behind the scenes. My suspicion is that the importance of shared memory has diminished a bit as the amount of cache has increased in modern GPUs. This does not mean that I discourage anybody from using shared memory or register shuffling. For many algorithms it’s a great fit and will achieve great speedups. But it should be benchmarked against simpler code to make sure that it’s worth it.
On CPU, effective SIMD is important to get good performance. On GPU, using SIMT properly is not just important, but an absolute must. If all the threads in each warp do completely different things, GPUs can’t match the performance of good CPUs. Unfortunately, this means that some algorithms that work well on CPUs are just not suited for GPUs. In these cases, it may be necessary to switch to another algorithm that better fits the SIMT model. In PreonLab, I had to use a different neighbor search algorithm on GPU because there was no way to adapt the existing algorithm for SIMT. The old CPU algorithm is better in a lot of ways, requiring less computations and less bandwidth than the GPU algorithm. In contrast, the GPU algorithm is simpler and performs more work, but because of the sheer firepower of GPUs it still offers great runtimes compared to CPUs.
Debugging massively parallel applications can be hard. Fortunately, CUDA Compute Sanitizer is there to help. It’s like valgrind, but for CUDA kernels and usually it is also much faster. It cannot just detect illegal memory accesses, but also race conditions. I used the latter only once, but it helped me to find a non-deterministic problem that was so nasty that I don’t know how else I could have found it.
Overall, bringing PreonLab to CUDA has been fun and a very rewarding experience. Running experimental code and seeing a great speed-up is always a thrilling experience (at least for me) and there has been plenty of that in the last 2 years. Could this also have been achieved with a platform-independent solution such as OpenCL or SYCL? Maybe, but the tools, libraries and articles that are available for CUDA make it a great platform for development. We still plan to explore more platforms in the future, so stay tuned for that!