Pushing data from a PCIe device to a GPU or letting a GPU write into a PCIe device directly is necessary to optimize for latencies by avoiding an intermediate hop through the system memory. For this purpose NVIDIA provides the GPUDirect technology which is CUDA only. Your only bet to achieve the same on OpenCL is to buy AMD cards and use the open DirectGMA (marketing speak) or bus-addressable memory (technical speak) OpenCL extension. However, the documentation is both sparse and out-of-date, which is why I want to give a quick but complete overview how to transfer data back and forth between a PCIe device and a GPU.


Obviously, you will need the AMD OpenCL SDK in order to use the AMD extension, so get that and set up your build environment accordingly. Besides, the normal <CL/cl.h> include you have to include <CL/cl_ext.h>. This extension allows data transfers in both directions, however only a pusing is possible in either direction. That means, when transfering from data the device to the GPU, the device writes to the GPU and vice versa.

To call functions exported by an extension, you have to load it via an OpenCL mechanism because of the way OpenCL function calls are dispatched through the ICD loader. For example, to call clEnqueueMakeBuffersResidentAMD you would load it beforehand like that:

clEnqueueMakeBuffersResidentAMD_fn clEnqueueMakeBuffersResidentAMD;

clEnqueueMakeBuffersResidentAMD = clGetExtensionFunctionAddressForPlatform (platform,

if (clEnqueueMakeBuffersResidentAMD == NULL) {
    fprintf (stderr, "Could not load clEnqueueMakeBuffersResidentAMD");

For brevity reasons, I omit error checking in the remaining post, but you should always check every possible error.

Writing to the GPU

To write from a PCIe device to the GPU you have to let the PCIe device know the bus address of a cl_mem object. For this you create a buffer with the CL_MEM_BUS_ADDRESSABLE_AMD flag (and any other except for CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR)

cl_mem buffer;
cl_mem_flags flags;

buffer = clCreateBuffer (context, flags, size, NULL, &error);

and “pin” it before using it from the remote PCIe device

cl_bus_address_amd addr;

clEnqueueMakeBuffersResidentAMD (queue,
  1,            /* in this case we pin only one buffer */
  &buffer,      /* array of buffers */
  CL_TRUE,      /* block */
  &addr,        /* array of bus addresses */
  0, NULL, NULL /* event infrastructure... */);

Making the buffer resident fills the surface_bus_address member of addr (unlike the extension docs which call it surfbusaddress) that you then pass to your PCIe device one way or another. Now the device can write in whatever way, however for full speed a DMA transfer is much preferred.

Note that a bus-addressable buffer usually cannot be as large as reported for CL_DEVICE_MAX_MEM_ALLOC_SIZE using clGetDeviceInfo. For example, on a FirePro W9100 buffers with a size of about 96 MB can be allocated at most.

Writing to the device

The most important thing to figure out is the physical memory address of your device. This information might come from the device’s driver but a quick look for memory regions using lspci -vv might give you a hint …

To set up a buffer for writing, we do the reverse operation of writing to the GPU: we set up a cl_bus_address_amd structure with the bus address of the PCIe device and pass that and the CL_MEM_EXTERNAL_PHYSICAL_AMD flag to clCreateBuffer:

cl_mem remote_buffer;
cl_mem_flags flags;
cl_bus_address_amd addr;

addr.surface_bus_address = (cl_ulong) physical_bus_address;
addr.marker_bus_address = (cl_ulong) physical_bus_address;

remote_buffer = clCreateBuffer (context, flags, size, &addr, &error);

The surface bus address is the main address that you want to access from within a kernel or when copying between GPUs. You can now use buffer in regular kernels as if it were memory on the GPU. Be aware though that the addresses must be page-aligned however anything else should actually be pretty uncommon.

The marker address is a special memory region that is used for inter-context synchronization. To denote that a transfer has finished you would enqueue a write signal command

clEnqueueWriteSignalAMD (queue, remote_buffer, value, 0, 0, NULL, &event);

where value is a monotonically increasing number. Another device listening for change can then wait on the signal with

clEnqueueWaitSignalAMD (queue, local_buffer, value, 0, NULL, &event);

Although I am a Vim user for almost ten years now, I still learn something new every once in a while that will improve my editing to be more efficient or comfortable. This time – and I have to admit that it’s almost embarrassing – I quickly grew accustomed to the f and t motions. In essence you type f or t followed by a character to which the cursor will jump on or right before. Typing ; and , repeats the motion forward and backward. Before this enlightenment, I was trodding through long lines by repeating w and e motions until I reached my final destination.

There are two reasons why I missed these character-based motions for so long. First of all f and t can only reach targets on the current line whereas programming tasks often involve line and word crossings. Second of all, the non-existant visualization of the motion destinations make it really hard for me to make sense of these motions.

The very handy vim-sneak plugin alleviates both of these problems with the additional benefit of allowing two-character search motions using the s and S keys thus covering middle ground between f motions and full-blown / searches. To skip unused targets, vim-sneak provides the streak mode similar to vim-easymotion that allows to reach a target by typing a third character.

Difference between standard sneak (top) and streak (bottom) mode.

In the given example I was searching for “mo” starting on the first character on line 6. Streak provides shortcuts a and s to avoid having to type ; two or three times. To enable the streak mode, you have to add the following to your .vimrc:

let g:sneak#streak = 1

By default, vim-sneak does not interfere with f and t motions thus to benefit from highlighting and multi-line f and t motions you have to map the corresponding keys:

nmap f <Plug>Sneak_f
nmap F <Plug>Sneak_F
nmap t <Plug>Sneak_t
nmap T <Plug>Sneak_T

By the way, I link the target colors to type and function syntax colors as follows to get rid of the pesky default pink:

hi link SneakPluginTarget Type
hi link SneakPluginScope Function
hi link SneakStreakTarget Type
hi link SneakStreakMask Function

Lately, a post on writing good Git commit messages and the subsequent Reddit discussion caught my attention. I fully agree with each and every point made and in fact alway try to convince colleagues and friends to follow this model. However, what to do if you are happily coding away, churning out commit after commit and then end up with a larger number of commits with summaries such as “Fixes this” or “Forgot to add that”? You use an interactive rebase, a Git feature that is surprisingly unknown among my fellow peers.

As the name suggests, an interactive rebase is a rebase done in a more intuitive way. That does not sound like a lot because most people assume a rebase to be taking commits from one branch and putting them on top of a commit of another branch. However, with the default workflow an interactive rebase happens on the same branch. To initiate such a rebase, simply find a suitable range of commits that you want to remove, edit or combine (e.g. from HEAD down the last ten commits) and type

$ git rebase -i HEAD~10

Your favorite editor will then be opened with the list of commits in chronological order. You can remove commits simply by removing the corresponding line or reordering commits by moving lines up and down. Concerning the Git commit messages, you can edit a message by replacing pick with reword (short r) or edit (short e) which also gives you the opportunity to change the author of a commit. To combine commits, you can either use squash (short s) which creates a single commit out of all the marked commits and the one commit leading to the first squash commit. fixup does the same but will use the commit message of the previous commit. These tools help to consolidate related commits to one logical commit.

Interactive rebase is a pleasant way of reorganizing the private history. In case you are using Vim, this can be an even more pleasing experience with a small plugin that I wrote some time ago.

To fetch a pull requests by its ID, I probably googled “github checkout pull request” a hundred times by now. Here is a simple Git alias which goes into the .gitconfig to make that a bit easier:

  checkout-pr = "!f() { git fetch origin pull/$1/head:pr-$1 && git checkout pr-$1; }; f"

Now it’s just a matter of git checkout-pr 42 to check the changes of pull request #42 on my local system.