Profiling an OpenCL program is mandatory for optimizing the kernel performance.
Timers have the inevitable drawback, that you cannot measure asynchronous kernel
calls reliably without inserting synchronous calls such as
clFlush(). This is usually not a problem, you would just put these into your
code and guard them with
#ifdef DEBUG/#endif. You would lose the possibility
to do overlapping computation but this would not affect release builds.
However, in some cases you might want to measure the performance of your kernels
in production runs and still ensure asynchronous kernel execution. In this case,
you have to query the profiling information associated with event objects using
clGetEventProfilingInfo() and make sure to enable profiling of command queues
CL_QUEUE_PROFILING_ENABLE property when calling
clCreateCommandQueue(). Usually, these kind of things incur a performance
overhead. Because I wanted to know how big this impact would be, I wrote a small
benchmark utility which measures the run time
for two small kernels and different input sizes. I ran this benchmark on my home
computer that sports a NVIDIA GTX 480 and a Core i5 3450.
As you can see for very small input sizes (less than 512×512 pixels), could have a tremendous effect on the run-time. However, this also depends on the kind of operation you execute. The simple kernel was merely calculating
output[tid] = input[tid] * 2.0f;
whereas the more computationally demanding kernel computed
output[tid] = cos(input[tid]) * exp(input[tid] * 2.0f);
Of course, relatively less time is spent managing the event objects when a more complicated kernel is in place. This is even more so true, if you increase the number of computed elements. With more than 512×512 elements, the overhead becomes negligible and is sometimes even negative.
As a conclusion, you could do run-time evaluation for whatever reason, if your kernels are computationally demanding (which they are anyway, right?).