The Wonderful World of Heterogeneous Computing

There was once the promise of OpenCL, that all code written against the specification is able to run on any device capable of running OpenCL code. However, I was bitten by NVIDIA’s particular implementation and lost almost two full days on figuring out what was the cause for a bug in our system.

After unsuccessful hours of tinkering with our own software, I thought the problem might have to do with our heterogeneous system (at least in terms of different GPUs: one GTX 680 and eight GTX 590s). I tried to replicate the bug with a SSCCE, that is now available at GitHub. When calling the regression tool, you can specify the range of GPUs you want to include in the cl_context.

So, for the first GTX 680 I got:

$ ./build/check --first=0 --last=0
# Platform: OpenCL 1.1 CUDA 5.0.1
# Device 0: GeForce GTX 680
Creating kernel `two_const_params`: OK
Creating kernel `three_const_params`: OK
Creating kernel `two_local_params`: OK
Creating kernel `three_local_params`: OK
Creating kernel `two_global_params`: OK
Creating kernel `three_global_params`: OK

Everything is fine. So let’s see what the eight GTX 580s do:

$ ./build/check --first=1 --last=8
# Platform: OpenCL 1.1 CUDA 5.0.1
# Device 0: GeForce GTX 590
  ...
# Device 7: GeForce GTX 590
Creating kernel `two_const_params`: OK
Creating kernel `three_const_params`: OK
Creating kernel `two_local_params`: OK
Creating kernel `three_local_params`: OK
Creating kernel `two_global_params`: OK
Creating kernel `three_global_params`: OK

Again, no problems. Now, if I combine the GTX 680 with the first GTX 590, I get this:

./build/check --first=0 --last=1
# Platform: OpenCL 1.1 CUDA 5.0.1
# Device 0: GeForce GTX 680
# Device 1: GeForce GTX 590
Creating kernel `two_const_params`: OK
Creating kernel `three_const_params`: Error: CL_INVALID_KERNEL_DEFINITION
Creating kernel `two_local_params`: OK
Creating kernel `three_local_params`: Error: CL_INVALID_KERNEL_DEFINITION
Creating kernel `two_global_params`: OK
Creating kernel `three_global_params`: OK

This is the problem I experienced and as you can see, it turned out that the combination of a GTX 590 with a GTX 680 and kernels with more than two __constant parameters triggered the bug.

As you can imagine, it was tricky to find the cause: On my system (two GTX 580s) I never had any problems regardless of the number of __constant parameters. There were also no problems on our development compute server, that has six GTX 580s. Moreover the error code returned by the system was not helpful at all. It merely says that clCreateKernel() returns CL_INVALID_KERNEL_DEFINITION if

the function definition for __kernel function given by [the] kernel name such as the number of arguments, the argument types are not the same for all devices for which the program executable has been built.

This is not case here. In fact, the error code is misleading as we always built the kernels for all devices in one go, without changing the kernel function signature for different devices.

Now, the real tragedy is, that one of my colleagues changed the GPUs on our production compute server, without telling me and without running any regression tests. Although this is not the cause of the bug, it really helped not to detect it in the first place.

At the end of the day, I can just give you the advice to never trust the implementation of a specification and to keep an eye on a changing environment, be it hardware or software dependencies. Nevertheless, I hope this tool will be helpful for others to check their particular environment.