Interfacing OpenCL with PCIe devices

Pushing data from a PCIe device to a GPU or letting a GPU write into a PCIe device directly is necessary to optimize for latencies by avoiding an intermediate hop through the system memory. For this purpose NVIDIA provides the GPUDirect technology which is CUDA only. Your only bet to achieve the same on OpenCL is to buy AMD cards and use the open DirectGMA (marketing speak) or bus-addressable memory (technical speak) OpenCL extension. However, the documentation is both sparse and out-of-date, which is why I want to give a quick but complete overview how to transfer data back and forth between a PCIe device and a GPU.


Obviously, you will need the AMD OpenCL SDK in order to use the AMD extension, so get that and set up your build environment accordingly. Besides, the normal <CL/cl.h> include you have to include <CL/cl_ext.h>. This extension allows data transfers in both directions, however only a pusing is possible in either direction. That means, when transfering from data the device to the GPU, the device writes to the GPU and vice versa.

To call functions exported by an extension, you have to load it via an OpenCL mechanism because of the way OpenCL function calls are dispatched through the ICD loader. For example, to call clEnqueueMakeBuffersResidentAMD you would load it beforehand like that:

clEnqueueMakeBuffersResidentAMD_fn clEnqueueMakeBuffersResidentAMD;

clEnqueueMakeBuffersResidentAMD = clGetExtensionFunctionAddressForPlatform (platform,

if (clEnqueueMakeBuffersResidentAMD == NULL) {
    fprintf (stderr, "Could not load clEnqueueMakeBuffersResidentAMD");

For brevity reasons, I omit error checking in the remaining post, but you should always check every possible error.

Writing to the GPU

To write from a PCIe device to the GPU you have to let the PCIe device know the bus address of a cl_mem object. For this you create a buffer with the CL_MEM_BUS_ADDRESSABLE_AMD flag (and any other except for CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR)

cl_mem buffer;
cl_mem_flags flags;

buffer = clCreateBuffer (context, flags, size, NULL, &error);

and “pin” it before using it from the remote PCIe device

cl_bus_address_amd addr;

clEnqueueMakeBuffersResidentAMD (queue,
  1,            /* in this case we pin only one buffer */
  &buffer,      /* array of buffers */
  CL_TRUE,      /* block */
  &addr,        /* array of bus addresses */
  0, NULL, NULL /* event infrastructure... */);

Making the buffer resident fills the surface_bus_address member of addr (unlike the extension docs which call it surfbusaddress) that you then pass to your PCIe device one way or another. Now the device can write in whatever way, however for full speed a DMA transfer is much preferred.

Note that a bus-addressable buffer usually cannot be as large as reported for CL_DEVICE_MAX_MEM_ALLOC_SIZE using clGetDeviceInfo. For example, on a FirePro W9100 buffers with a size of about 96 MB can be allocated at most.

Writing to the device

The most important thing to figure out is the physical memory address of your device. This information might come from the device’s driver but a quick look for memory regions using lspci -vv might give you a hint …

To set up a buffer for writing, we do the reverse operation of writing to the GPU: we set up a cl_bus_address_amd structure with the bus address of the PCIe device and pass that and the CL_MEM_EXTERNAL_PHYSICAL_AMD flag to clCreateBuffer:

cl_mem remote_buffer;
cl_mem_flags flags;
cl_bus_address_amd addr;

addr.surface_bus_address = (cl_ulong) physical_bus_address;
addr.marker_bus_address = (cl_ulong) physical_bus_address;

remote_buffer = clCreateBuffer (context, flags, size, &addr, &error);

The surface bus address is the main address that you want to access from within a kernel or when copying between GPUs. You can now use buffer in regular kernels as if it were memory on the GPU. Be aware though that the addresses must be page-aligned however anything else should actually be pretty uncommon.

The marker address is a special memory region that is used for inter-context synchronization. To denote that a transfer has finished you would enqueue a write signal command

clEnqueueWriteSignalAMD (queue, remote_buffer, value, 0, 0, NULL, &event);

where value is a monotonically increasing number. Another device listening for change can then wait on the signal with

clEnqueueWaitSignalAMD (queue, local_buffer, value, 0, NULL, &event);