AMD's OpenCL multi GPU bug

Implementing open standards such as OpenCL seems to be a daunting and error prone task. Here’s a problem that went unnoticed for quite some time on an AMD-based system.

To compile an OpenCL kernel for several devices, one has to use clBuildProgram that takes a program created with clCreateProgramWithSource or clCreateProgramWithBinary and a list of devices. In plain C it looks like this:

static const char *source = "__kernel void foo(void) {}";
cl_program program;
cl_device_id devices[2];

program = clCreateProgramWithSource (context, 1, &source, NULL, NULL);
clBuildProgram (program, 2, devices, "", NULL, NULL);

In a heterogeneous environment, it is possible that different devices of the same vendor happily co-exist on the same platform. At least, that’s the core idea of OpenCL. In order to write optimized kernel code, we can define pre-processor defines and compile the kernel for each device, just like this:

clBuildProgram (program, 1, &devices[0], "-DTYPE=FOO", NULL, NULL);
clBuildProgram (program, 1, &devices[1], "-DTYPE=BAR", NULL, NULL);

Inside the kernel you could write conditional code with #ifdef and switch between two alternative implementations. However, this works only on NVIDIA hardware. Once you enqueue a kernel on the first command queue (the one attached to the first device) of an AMD device, you will receive a CL_INVALID_PROGRAM_EXECUTABLE error. In my humble opinion this is a clear violation against the spec:

The function [clBuildProgram] builds (compiles & links) a program executable from the program source or binary for all the devices or a specific device(s) in the OpenCL context associated with program.

When I interpret this correctly, the standard does not restrict usage on “specific device(s)”, hence I must be able to run clBuildProgram as often as I wish. Now, to check if you are affected, you can use my regression tool. When run on an AMD system, you will see something like this:

# Platform: OpenCL 1.2 AMD-APP (1084.4)
# Device 0: Tahiti
# Device 1: Tahiti
# Device 2: Tahiti
# Device 3: Tahiti
[...]
Creating program `amd_bug`: OK
Build program for both GPU 1: OK
Build program for both GPU 2: OK
Created kernel `amd_bug': OK
Started kernel `amd_bug' on queue 0 [expect CL_INVALID_PROGRAM_EXECUTABLE]: Error: CL_INVALID_PROGRAM_EXECUTABLE
Started kernel `amd_bug' on queue 1 [expect CL_INVALID_PROGRAM_EXECUTABLE]: OK

As with NVIDIA’s bug, there is no real workaround for this issue. So, let’s hope that AMD will address this problem.