James Price is currently completing a PhD degree at the Department of Computer Science, University of Bristol.

When developing programs that utilise GPU compute via OpenCL, we can’t use our traditional CPU development tools. This can make debugging complex OpenCL kernels challenging. As part of my PhD, funded by Imagination Technologies, I’ve developed an OpenCL device simulator called Oclgrind, which enables a raft of tools that can make debugging OpenCL kernels a much simpler task.

Overview

At its core, Oclgrind simulates how an OpenCL device executes a kernel. It does this in a manner which is independent from any specific architecture, which enables it to uncover many portability issues that arise during OpenCL development. By exposing a simple plugin interface, Oclgrind enables the creation of a wide variety of tools that can be used to analyse or debug OpenCL kernels.

Using Oclgrind is straightforward, as it implements the full OpenCL 1.2 runtime API. This means that existing OpenCL programs can be simulated without requiring modifications. An alternative interface for simulating specific kernels in isolation is also provided.

Detecting invalid memory accesses

Accessing memory locations that have not been properly allocated is a common problem that can cause headaches for developers using GPU compute. This varies between different platforms, but many GPUs don’t have the capability to detect these errors at the hardware level and can’t produce meaningful feedback about what went wrong.

Oclgrind provides a simple plugin that will check each memory access that your OpenCL kernels perform to ensure they do not access memory that they shouldn’t, and provides clear diagnostics to aid the developer in finding the problem when something does go wrong. Take the following kernel as an example – a trivial parallel vector addition.

kernel void vecadd(global float *a, global float *b, global float *c)
{
  int i = get_global_id(0);
  c[i] = a[i] + b[i];
}

Running this kernel through Oclgrind with a global work size of 1024 but using buffers that are only 4000 bytes in size produces the following error message:

Invalid write of size 4 at global memory address 0x3000000000fa0
  Kernel: vecadd
  Entity: Global(1000,0,0) Local(8,0,0) Group(31,0,0)
    store float %7, float addrspace(1)* %8, align 4, !dbg !21
At line 8 of input.cl:
  c[i] = a[i] + b[i];

This shows us that we were trying to write 4 bytes (one 32-bit float) to global memory. It gives us the specific kernel and line of code, and also tells us which work-item was responsible. In this case, it’s clear that we need to guard these memory accesses by checking that the index is within the bounds of our problem.

Detecting race conditions

Another common bug that arises when writing parallel programs is when insufficient synchronisation is provided when memory is accessed concurrently from multiple units of execution, resulting in a race condition. These defects can be difficult to spot when dealing with complex programs, and might only cause problems on certain devices.

Consider the simple kernel depicted below. Here we have a shared variable named ‘temp’ which is allocated in the local address space, and is therefore shared between work-items in the same work-group. In this kernel, the work-item with a local ID of 1 assigns a value to this variable, and then the work-item with local ID 0 tries to read and use the variable.

kernel void racy(global float *input, global float *output)
{
  local float temp;
  int lid = get_local_id(0);

  if (lid == 1)
    temp = input[lid];
  if (lid == 0)
  output[0] = temp;
}

We can enable the data-race detection plugin in Oclgrind by passing the –data-races flags. If we simulate the above kernel, Oclgrind will inform us of the potential race with a message like this:

Read-write data race at local memory address 0x1000000000000
  Kernel: racy
  First entity: Global(1,0,0) Local(1,0,0) Group(0,0,0)
    store float %5, float addrspace(3)* @racy.temp, align 4, !dbg !23
  At line 7 of input.cl:
    temp = input[lid];
  Second entity: Global(0,0,0) Local(0,0,0) Group(0,0,0)
    %7 = load float, float addrspace(3)* @racy.temp, align 4, !dbg !26
  At line 10 of input.cl:
    output[0] = temp;

This tells us that we have a data-race on a local memory address, between two work-items with IDs 0 and 1. It also shows the lines of code that were responsible for the problem. Adding a barrier(CLK_LOCAL_MEM_FENCE) between these two statements and then running the code again gives a clean bill of health.

Interactive source-line debugging

Sometimes, getting error messages after-the-fact is not enough to get to the root of a problem, and you need to really delve into a kernel to find out why it is misbehaving. Oclgrind provides a plugin that allows a developer to interactively step through kernel execution in a GDB-like command-line environment. The plugin can be enabled by passing the –interactive flag when we launch Oclgrind. When kernel execution begins, we will be provided with a prompt at which we can enter a variety of commands. Much like GDB, we can step through our kernel line-by-line, print out the values of variables or memory addresses, or instruct the plugin to break at a particular line of code (type help for a full list of commands). This plugin also interacts with the other plugins, and will automatically break into a prompt if an error is encountered.

Checking for runtime API errors

Writing OpenCL programs isn’t just about developing kernels, and there’s plenty of issues that can arise when using the various runtime API functions from the host. The error messages that these API calls return can sometimes be confusing, or they could indicate multiple possible causes. By passing the –check-api flag to Oclgrind we can get it to give us human-readable error messages that describe the exact problem that an API call is trying to report:

Oclgrind - OpenCL runtime error detected
  Function: clEnqueueNDRangeKernel
  Error:   CL_INVALID_WORK_GROUP_SIZE
  local_work_size[0]=32 does not divide global_work_size[0]=1000

Additional information

Oclgrind is an open source tool, and is provided under a BSD license. Binaries for Windows, Linux and OS X are available.

More information about using Oclgrind can be found on its Wiki.

About the author: Guest Blog

Profile photo of guestblog