Bloerg – OpenCL 2.0 is here

OpenCL 2.0 is here

22 Jul 2013

Today at SIGGRAPH ’13, Khronos released a provisional specification specification of OpenCL 2.0. Hooray! Compared to 1.2, this is release is packed with unexpected yet cool stuff and features that were requested by users for quite some time now. The provisional specification can be found online, however the manpages are not fully built yet.

So, let’s don’t waste time. Here is what’s coming up.

Kernel argument introspection

I can only imagine that the Khronos developers came up with the existing clSetKernelArg API, to provide programmatic access from languages other than C. However even with these good intents, it was not even possible to determine the types of the argument of a clKernel object hence making it extremely difficult to provide a better API in C itself. After many users requested a way to introspect kernel arguments, OpenCL 2.0 now brings clGetKernelArgInfo. Unfortunately, it still is a bit awkward to use. For example, you cannot get the size in bytes of the argument but instead you can query for the type name, from which you could infer this information.

Shared Virtual Memory

Until version 1.2, you could use the same memory from host and device by using the clEnqueueMapBuffer and clEnqueuUnmapBuffer operations. However, this was restricted to contiguous memory buffers. OpenCL 2.0 provides a slew of functions dedicated for shared virtual memory (… I guess pages) that allow access of arbitrary pointer-based data structures.

Pipes

I am not sure about the usefulness of the new pipe objects. Apparently, you create a new pipe on the host and pass the cl_mem object to two kernels that use a number of new built-in functions to communicate through these objects. As far as I can see, the pipes only allow buffered, asynchronous operation.

At the moment, I can hardly see any use case for these new objects. Do you?

Enqueueing kernels from within kernels

… also called “Dynamic Parallelism” by the Khronos folks. The new built-in enqueue_kernel allows a work item to schedule a new kernel. Instead of a receiving a kernel object or function pointer to a kernel function, the consortium decided to re-use Apple’s blocks to create enqueue-able closures. According to the specification, enqueueing looks like this:

kernel void
outer (global int *a, global int *b)
{
    ndrange_t range;

    void (^accumulate)(void) =
        ^{
            size_t id = get_global_id (0);
            b[id] += a[id];
        };

    /* Pass block as variable */
    enqueue_kernel (get_default_queue (),
                    CLK_ENQUEUE_FLAGS_WAIT_KERNEL,
                    range,
                    accumulate);

    /* Pass block directly */
    enqueue_kernel (get_default_queue ()
                    CLK_ENQUEUE_FLAGS_WAIT_KERNEL,
                    range,
                    ^{
                        size_t id = get_global_id (0);
                        b[id] = sin(a[id]);
                    });
}

SPIR intermediate representation

This is probably less of importance for users of the C and C++ API, but the Khronos group also started to define a LLVM-based intermediate representation of up to OpenCL 1.2 code. This will probably spawn a multitude of new OpenCL front-end languages.

Other built-ins and extensions

Tired of calculating the linear index for two- and three-dimensional data accesses? With the upcoming specification you can use get_global_linear_id and get_local_linear_id for these tasks.
There is now a whole bunch of atomic functions to change memory locations or test for conditions while being immediately visible to other work items of the same work group.
Sometimes, it’s a pain to get resource de-allocation right. For sloppy people like me, there is a new cl_khr_terminate_context extension, that provides a clTerminateContextKHR to quickly shut down an OpenCL context.

Outlook

The specification is still in provisional state and the Khronos group encourages you to give feedback for the next six months. After that it will be set in stone. Probably two years later, NVIDIA will then provide an implementation for OpenCL 1.2.