Debugging SYCL Applications

29 October 2019

It's inevitable that you will need to debug your SYCL code at some point. While many of the techniques used are similar to the way this would be done in any C++ code, there are some things to keep in mind. It's not always easy to debug your code on the actual device you are executing on, for example a GPU, so SYCL offers a "host device" that can be used to emulate what happens on the target device, and makes it much easier to track down where things are going wrong. This post also outlines how you might go about debugging on your target device.

Debugging on a SYCL Host Device

Debugging kernels running on a device back-end requires support in the underlying platform, so debugging a SYCL kernel on a device such as a GPU is not possible.
However, developers can use standard tools such as valgrind, gdb, lldb and other tools to debug their SYCL applications on the host_device instead.

The host device is an emulated back-end that executes code as native C++ and emulates the SYCL execution and memory models. Having a host device is a requirement for every SYCL implementation and allows developers to run a SYCL application without having the supported back-ends (e.g. OpenCL 1.2 or above for ComputeCpp) set up. The host device will not provide a fast multi-threading interface on the host but a working one that uses the core API of SYCL (excluding vendor-specific extensions) and needs only pure C++ (at least C++11) to do that.

Having clarified what exactly the host device is, let's see the purpose of having such an implementation.

  • The host device exists to be used as a fallback execution target for a SYCL application rather than being an optimal solution for parallel execution, it shouldn't be used in production code
  • It is the perfect environment for debugging SYCL code since all of the API calls and kernel code are basically standard C++, which can be compiled using any modern compiler
  • When SYCL parallel_for is invoked, the SYCL host implementation will spawn native OS threads, allowing developers to use standard multi-threaded debugging methods

In order to enable debugging on the host device without having to modify the program source every time, a simple option would be to set a preprocessor macro when creating a custom debug build and use it as shown below.

Here's an example, using modified version of the reduction sample from the computecpp-sdk, of how to use this macro to select the host device.

#if DEBUG_IN_HOST_DEVICE
cl::sycl::queue queue(cl::sycl::host_selector{}, ...);
#else
cl::sycl::queue queue(cl::sycl::default_selector{}, ...);
#endif 

If the macro is set, the queue is initialized with a host device. Otherwise it uses the default selector which will select a GPU if one is available on the hardware.
The rest of the program source remains unchanged but, when generating the executable, the program has to be compiled with -g (for gcc), or CMAKE_BUILD_TYPE=Debug in the case of using CMake to enable the use of extra debugging information on the host device.

However, code modifications are not the most advisable and convenient for such tasks, which is why the ComputeCpp SYCL implementation provides an environment variable COMPUTECPP_TARGET. When this variable is set to host, the default_selector will be forced to select the host device, thus eliminating the need for a custom preprocessor definition and changes in your code. It is important to note that this is not part of the standard and can be used only with ComputeCpp.

Once we are sure that the host device has been selected, we can run valgrind to debug our application (the reduction sample in this case).

Checking for memory leaks

Let's start by using memcheck as it is a built-in valgrind command and detects the use of uninitialized memory, out-of-bounds read/write access to memory and memory leaks.

To run valgrind with memcheck, type the following command:

valgrind --leak-check=full --show-leak-kinds=all ./reduction

It's now possible to extract the leak summary, which in this case is as follows.

==10853== LEAK SUMMARY:
==10853==    definitely lost: 6,377 bytes in 4 blocks
==10853==    indirectly lost: 76 bytes in 2 blocks
==10853==      possibly lost: 0 bytes in 0 blocks
==10853==    still reachable: 138,748 bytes in 1,731 blocks
==10853==         suppressed: 0 bytes in 0 blocks 

And you can also track any uninitialized values using the --track-origins=yes option.

Interpreting cache utilization

Another tool that we can use is C achegrind which simulates the first-level and last-level caches (usually the last-level cache has the most influence on runtime performance).

The experimental host device is an Intel CPU with 3 levels of cache, where the first level has separate instructions and data cache, and the rest are unified.

On Linux, you can check these details using the command:

lscpu | grep "cache"

The output looks something like this.

L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K 

To run valgrind with cachegrind, we are using the following command:

valgrind --tool=cachegrind ./reduction

This will simulate the reduction program's interaction with the machine's cache hierarchy as visible in the output below.

==9436==
==9436== I   refs:      299,487,394
==9436== I1  misses:        103,329
==9436== LLi misses:         49,491
==9436== I1  miss rate:        0.03%
==9436== LLi miss rate:        0.02%
==9436==
==9436== D   refs:      122,361,759  (74,394,550 rd   + 47,967,209 wr)
==9436== D1  misses:      3,461,492  ( 2,527,357 rd   +    934,135 wr)
==9436== LLd misses:      1,388,245  (   722,130 rd   +    666,115 wr)
==9436== D1  miss rate:         2.8% (       3.4%     +        1.9%  )
==9436== LLd miss rate:         1.1% (       1.0%     +        1.4%  )
==9436==
==9436== LL refs:         3,564,821  ( 2,630,686 rd   +    934,135 wr)
==9436== LL misses:       1,437,736  (   771,621 rd   +    666,115 wr)
==9436== LL miss rate:          0.3% (       0.2%     +        1.4%  )

We can interpret the results and see that we have a not so significant cache miss in the first level instruction cache I1 with rate of 0.03%, but a more significant cache miss rate of 2.8% in the data cache D1.
That said, the reduction program couldn't fit in the L1 cache, therefore cache optimization approaches can be applied. However, there were hundreds of millions of instructions in L1 so some compulsory cache misses may be unavoidable.
Furthermore, we can also see that there is a very small cache miss rate of 0.3% in L3 cache (shown as LL).

Detecting race conditions

You can also check for possible data races and observe the thread allocations by using the Hellgrind tool: valgrind --tool=hellgrind ./reduction
In this case however, you need to be careful with interpreting the output as there may be warnings saying that there is a possible data race if one thread is reading a block of bytes that is written by another thread without a lock being held.
Hellgrind can't know if there are any other means in the program that prevent the certain condition to happen in the two threads simultaneously, thus it will flag it as a possibility.

Debugging by stepping-through the program

You can also use gdb or other source-line debuggers (e.g. lldb) on the host device to step through your program just as you are stepping through a standard C++ application.

Debugging on an OpenCL Device

While debugging on the host device works for the most part, not every situation is the same. There may be cases where developers run into problems that can be only be observed in a device.
If you happen to find yourself in a situation where an issue only occurs when running on an OpenCL device, you have to debug it on the device.

Fortunately, there are existing tools that are capable of simulating an OpenCL device as well as providing validation of the OpenCL (1.2) code.

Oclgrind is the go to tool for this purpose; it is an OpenCL device simulator and debugger that can interpret OpenCL SPIR (1.2). (You can learn more about SPIR by having a look at its specification hosted by the Khronos Group .)
At its core, Oclgrind simulates how an OpenCL device executes a kernel independently from any specific architecture. This is particularly useful when facing portability issues during development.

Setting up Oclgrind

The easiest way to install oclgrind is from an OS package manager that maintains a stable version of the project.
For example, on Ubuntu, using the APT package manager, you can simply type the following command in the Terminal:

sudo apt-get install oclgrind

This will get you the latest stable version of the tool maintained for your Ubuntu version. For Ubuntu-16/17/18 this will get you oclgrind 15.5. To check the version you have install, simply type:

oclgrind --version

In case you want to install the latest version of oclgrind (18.3 at the time of writing) which is usually not maintained by most Linux package managers such Ubuntu's APT, it is best to install directly from the the project source files.
The Building and Installing section on their Github wiki gives a thorough explanations on how to proceed this way, even explaining how to run oclgrind with via the Khronos' OpenCL ICD loader. As a prerequisite, you will need to CMake installed for you platform.

Using Oclgrind

Fortunately, oclgrind is not difficult to use. It implements the full OpenCL 1.2 runtime API, so developers are not required to make any manual changes to their SYCL (1.2.1) or OpenCL programs.

Here are a few examples, simply to scratch the surface of what's possible as well as demonstrate how easy it is to do things. For example, enabling data-race detection in the kernel or collecting kernel information such as the count of instructions called during execution on the OpenCL device simulator. In addition to the above, you can do things like checking for errors in API calls, dump OpenCL SPIR and generate IR (intermediate representation) files.

Oclgrind incorporates a simple plugin for detecting invalid memory access which is a well-known and common problem for GPU compute. It may be the case that the platform you are running on does not provide meaningful feedback about what went wrong, while oclgrind comes with a plugin that checks each memory access that your OpenCL kernels perform to ensure they do not access memory that they shouldn't. This option is enabled by default. If an error is encountered, the plugin will provide a diagnostic on it and feedback including the fail code with an error message.

Checking for runtime API errors

There’s plenty of issues that can arise when using the various runtime API functions from the host. Unfortunately, the error messages returned from these API calls can point to multiple possible causes. For detecting API errors, oclgrind provides a plugin called check-api that needs to be manually specified as a flag when running the application with oclgrind.

Consider the following attempt to launching a kernel with NDRange where the value for the local work size does not divide into the global work size. The simple-local-barrier sample from the SDK is a handy sample for the purpose of this demonstration.

const int size = 65;

[...] // program logic.

cgh.parallel_for<example_kernel>(nd_range<1>(range<1>(size), range<1>(2)), [=](nd_item<1> item) {
  [...] // kernel logic.
}

Now we can run the program by adding the --check-api flag to oclgrind and enable error checking for the underlying OpenCL API calls:

oclgrind --check-api ./simple-local-barrier

In this scenario, the result of dividing the global work size,which is 65, by the local work size of 2 returns a fraction (32.5) and this fraction can't be used to define the number of work groups for execution. Instead what we need is a natural, whole number to specify the work group size.
So, the code generated from the API call doesn't return successfully due to the description of the error detected by oclgrind (below).

Oclgrind - OpenCL runtime error detected
    Function: clEnqueueNDRangeKernel
    Error:    CL_INVALID_WORK_GROUP_SIZE
    Dimension 0: local_work_size (2) does not divide global_work_size (65)  

In addition, another case of invoking the CL_INVALID_WORK_GROUP_SIZE error could be having set too large a local work size which oclgrind will report with an appropriate error message.

Detecting race conditions

Similarly to the valgrind --memcheck option, you can check for race conditions using oclgrind. If this behavior can only be observed when running on an OpenCL device, we can enable the data-race detection plugin in oclgrind by passing the –data-races flags.

oclgrind --data-races ./application

In case of detecting a data race, the tool will output the address at which the condition has been detected, followed by the address memory space and the failing instruction alongside other useful debugging information if available.

An example for a race condition inside a kernel could be one where a work-item with one local id assigns a value to some variable and a work-item with another local id tries to read and use that same variable with no synchronization between the work-items in the work-group.
Consider the following SYCL kernel code:

auto inAcc = bufIn.get_access<access::mode::read>(cgh);
auto outAcc = bufOut.get_access<access::mode::write>(cgh);
accessor<int, 1, access::mode::read_write, access::target::local> localAcc(range<1>{1}, cgh);

cgh.parallel_for<racy_kernel>(nd_range<1>{globalRange, localRange}, [=](nd_item<1> item) {
  const auto localId = item.get_local_id(0);
  if (localId == 1) {
    localAcc[0] = inAcc[localId];
  }
  if (localId == 0) {
    outAcc[localId] = localAcc[0];
  }
});

This code implements a case of a race condition inside a SYCL kernel as described above. Running it through oclgrind with the data-races plugin enabled we will get the following output:

Read-write data race at local memory address 0x1000000000000
	Kernel: SYCL_class_racy_kernel

	First entity:  Global(1,0,0) Local(1,0,0) Group(0,0,0)
	  store i32 %7, i32 addrspace(3)* %0, align 4, !tbaa !11
    Debugging information not available.


	Second entity: Global(0,0,0) Local(0,0,0) Group(0,0,0)
	  %9 = load i32, i32 addrspace(3)* %0, align 4, !tbaa !11
    Debugging information not available. 

Adding a work-group barrier with memory ordering on the local address space between the two lines of assignment.

item.barrier(access::fence_space::local_space);

With this the application will run without any data races being introduced. In the case of an OpenCL program, where the kernel is separated from the host source, you will get the line of where the race condition has been detected and the faulty code but this information is not available in our case. However, you can deduce where this is happening in the code by interpreting the returned SPIR instructions.

Collecting profiling information

Another interesting function, if you are familiar with the LLVM instruction set, is the --inst-counts option that can be passed to oclgrind. Even if you are not that familiar, the SPIR spec includes a reference table (p. 20, 21) to all LLVM instructions that may be used in SPIR. You can use this information to understand more about your program's execution.


Let's go back to the reduction sample we used in the "Debugging on Host Device" section, and run it with oclgrind passing the --inst-counts flag.

oclgrind --inst-counts ./reduction 

Oclgrind outputs a histogram in the terminal, which shows the collected counts of instructions (as well as the actual instructions) that were executed while running the kernel. In more detail, the counts for the memory loads and stores which are in the Memory Access & Addressing LLVM Instruction Family are split into the separate address spaces to show the number of bytes read or written for the correct space (e.g., global or local).

Device Name: Oclgrind Simulator
Platform Name: Oclgrind
Instructions executed for kernel 'SYCL_class_sycl_reduction_int_':
           2,688 - br
           2,304 - icmp
           1,024 - call _Z7barrierj()
           1,024 - phi
             896 - lshr
             512 - getelementptr
             382 - add
             255 - load local (1,020 bytes)
             255 - store local (1,020 bytes)
             128 - call _Z12get_group_idj()
             128 - call _Z12get_local_idj()
             128 - call _Z13get_global_idj()
             128 - load global (512 bytes)
             128 - ret
             128 - sdiv
             128 - select
             128 - sext
             128 - trunc
               1 - store global (4 bytes) 

This functionality is particularly helpful for optimizing kernels, involving synchronizations between the work-items in a work-group executing the kernel. For example, in the reduction sample we have 1024 barrier calls, which may be too many in this case but in general is a helpful indicator that can be used to profile slow kernels. The number of synchronizations (between work-items) being performed could help point towards ways to speed up the kernel execution.

Interactive-style debugging - Stepping through the kernel

In case you want to step through the kernel, you can do that by running oclgrind in interactive mode - that provides a simple gdb-style step-through debugger with a limited subset of the commands supported by gdb.

Command list:
  backtrace    (bt)
  break        (b)
  continue     (c)
  delete       (d)
  gmem         (gm)
  help         (h)
  info         (i)
  list         (l)
  next         (n)
  lmem         (lm)
  pmem         (pm)
  print        (p)
  quit         (q)
  step         (s)
  workitem     (wi)

Running an application with oclgrind in interactive mode is done as follows:

oclgrind --interactive ./application # or just type -i for short

Oclgrind will automatically break at the start of each kernel invocation. Additionally, the interactive debugger plugin also interacts perfectly with the other oclgrind plugins, which means it can automatically break into a prompt if an error is encountered (e.g. invalid memory access).
You can manually set breakpoints for specific source lines just like with gdb by using the break command. You can also switch between work-items. While oclgrind will do a sequential execution of the work-items, you can use the workitem command to do the switch manually.

Other command features include viewing a variable and inspecting regions of memory.

  • You can't watch specific variables but you can use the print command to view the contents of a specific variable.
  • To inspect regions of memory, you can use the mem commands - lmem, pmem and gmem.

Furthermore, to get a general overview of the NDRange of the kernel you are debugging, you can use the info command in the debugger.
Here's an example using the reduction kernel as a sample:

(oclgrind) info
Running kernel 'SYCL_class_sycl_reduction_int_'
-> Global work size:   (128,1,1)
-> Global work offset: (0,0,0)
-> Local work size:    (128,1,1)

The SYCL host device makes debugging much easier for developers. Since SYCL is written entirely in standard C++, we can use common and robust tools to debug SYCL applications. It's also possible to use OpenCL debugging tools out of the box without any code modifications. This makes the debugging of kernel specific errors possible without vendor-dependent tools.

If you'd like to find out about how to profile your SYCL application, read our Optimizing Your SYCL Code Using Profiling guide.

Codeplay Software Ltd has published this article only as an opinion piece. Although every effort has been made to ensure the information contained in this post is accurate and reliable, Codeplay cannot and does not guarantee the accuracy, validity or completeness of this information. The information contained within this blog is provided "as is" without any representations or warranties, expressed or implied. Codeplay Sofware Ltd makes no representations or warranties in relation to the information in this post.
Georgi Mirazchiyski's Avatar

Georgi Mirazchiyski

Developer Relations Engineer