OpenCL resource management

It is common sense that for a long-running system, it’s of utmost importance that no piece of code is leaking memory or other resources. Otherwise, the system will crash sooner or later due to memory exhaustion. This is what happened with a distributed system I developed for high performance image processing.

In this system, each network node – each having at least one GPU – runs a small piece of server software that prepares an image processing graph and processes incoming data. When the data stream has been processed, the server releases almost all resources and waits for the next data processing task. To be sure, that everything was working as expected, I ran a stress test during the night. Of course, at some point the server crashed with a segmentation fault. Using nvidia-smi, I found out that the GPU has allocated more and more memory and faced an out-of-memory situation at some point.

OpenCL has a straightforward system of reference counted resources. You create a context, command queue, buffer, program or kernel with its corresponding clCreateFoo() function. If several independent software parts reference an OpenCL resource for internal use, the reference count can be increased with clRetainFoo(). When a resource is no longer of any use, it can be discarded with clReleaseFoo(). Once the reference count reaches zero, any associated resources such as memory are freed. At least, that is what I expected.

However on NVIDIA systems, a final clReleaseMemObject() will not free the memory segment in GPU memory, if not every other OpenCL object has been freed too. Consider this small snippet:

n_elements = 1024 * 1024;
mem = clCreateBuffer (context, CL_MEM_READ_WRITE,
                      n_elements * sizeof (float),
                      NULL, &errcode);

/* Launch kernel with one parameter */
clSetKernelArg (kernel, 0, sizeof (cl_mem), &mem);
clEnqueueNDRangeKernel (cmd_queue, kernel,
                        1, NULL, &n_elements, NULL,
                        0, NULL, &event);

/* Wait for end of execution and release all resources */
clWaitForEvents (1, &event));
clReleaseMemObject (mem);
clReleaseKernel (kernel);
clReleaseProgram (program);
clReleaseCommandQueue (cmd_queue);
clReleaseContext (context);

This looks innocent, however repeating this over and over again will give you a CL_MEM_OBJECT_ALLOCATION_FAILURE eventually because we did not release the event object and thus did not clean up GPU memory.

So, if you are developing with NVIDIA’s OpenCL implementation and what to ensure a stable system:

  • Check that each call to clCreateFoo() and clRetainFoo() is accompanied by a clReleaseFoo() call.
  • Beware of cl_event objects that are created implicitly by calls to the clEnqueueFoo() function family.
  • Do not assume that clReleaseContext() will release resources that are associated with the context.

The oclkit API

To check the leakage behaviour, I wrote a small OpenCL application. I did this many times before, but until now, I always copied the boiler plate from here to there. To avoid this in the future, I came up with a sanitized version of the boiler plate code called oclkit. It’s simple C99-only code that sets up a valid OpenCL context and command queues for each device. You can plug oclkit into your own (GPL-compatible) apps or just run the leak example with ./configure && make && ./leak.