SYCL Tutorial 2: The Vector Addition Re-visited

13 August 2014

This blog post has been archived since ComputeCpp, the Codeplay implementation of the SYCL standard, is now available. You can find the "Getting Started" guide for ComputeCpp available on our website here that explains how to get set up and start using SYCL.

Articles in this series

Tutorial 1: The Vector Addition
Tutorial 2: The Vector Addition Re-visited
Tutorial 3: Integrating SYCL Into Stanford University Unstructured

1 Introduction

Continuing in the SYCL™ Tutorial series, this post uses another exercise from Simon McIntosh-Smith's tutorial: Exercise number 4. Here we also implement a vector addition, in this case by using a pair-wise addition kernel three times.

Although the implementation used in the previous post is probably more efficient in the majority of the cases, this implementation shows certain SYCL features that make the example worth the effort.

2 Using the OpenCL™ C++ bindings

The code in the HandsOnOpenCL github showcases the implementation of this chained vector addition using the OpenCL C++ wrappers:

Seven host-side STL vectors are defined (h_a to h_f) and their device counterparts (cl::Buffer) are declared shortly after.
From line 73 to line 84, the code initializes the devices and loads the source file for the program. Note that at this point (line 84), we define the type of the vectors to be float.
The kernel itself, in a separate file (vadd.cl) is pre-defined to work with floats.
The following lines (86 to 93) create the buffers in the device (again, pre-set to work with floats).
Then, the kernel is enqueued three times with different parameters (95, 104, 113), and, finally, data is copied back to the host (line 122).

This is actually a good example of how using the C++ bindings can ease the bare OpenCL C implementation. See the C version as a comparison.

3 Using SYCL

The simplest way of using a command group, as we did in the previous post in this series, is to use a lambda inside the command group constructor to define the kernel function. Lambdas simplify writing kernel code, especially for simple kernels. However, in this sample we will use a functor to illustrate other features of SYCL. The listing below shows the definition of the vector addition kernel from within a command group functor:

template<typename TYPE>
class vadd_params
{
private:
   buffer<TYPE, 1> * m_va;
   buffer<TYPE, 1> * m_vb;
   buffer<TYPE, 1> * m_vc;
   unsigned int m_nelems;
public:
   vadd_params( buffer<TYPE, 1> * va,
      buffer<TYPE, 1> * vb,
      buffer<TYPE, 1> * vc,
      unsigned int nelems
    ):
      m_va(va), m_vb(vb), m_vc(vc), m_nelems(nelems) { };
  void operator()()
  {
   auto ptrA = m_va->template get_access<access::read>();
   auto ptrB = m_vb->template get_access<access::read>();
   auto ptrC = m_vc->template get_access<access::read_write>();
   parallel_for(m_nelems, kernel_lambda<class vadd_params_kernel>
                ([=] (id<1> i) {
                                    ptrC[i] = ptrA[i] + ptrB[i];
                               }
                ));
  }
};

Note that the command group functor (vadd_params) is just a normal C++ class, so we can declare it as a template instead of specifying a fixed type. Yes, this is exactly what you were thinking: we have the implementation of all the type-variants of the vector addition for free. No need for multiple different kernels for every type, or macros to switch between float and int.

The constructor of the command group functor receives pointers to the buffers that we want to add, and the number of elements of the vectors. Since they are pointers there are no move semantics and the ownership is not transferred. The accessors created inside the command group will indicate the runtime the access requirement on the buffers.

The parallel_for function API call specifies the kind of kernel that we want to launch, and the lambda contains the kernel itself. Another advantage of using command group functors is that we can have the kernel and its accessors grouped in a class. This facilitates the creation of complex class hierarchies, like composition of kernel functors, or doing constexpr magic.

How do we use this functor? Well, we just have to replace all the humongous boilerplate code with the creation of the SYCL buffers, and the default queue. In the listing below, we create a new scope (A) so that we control when the SYCL buffers are copied back to their host counterparts (in B). Inside the scope, we create the command group objects and pass them our lovely functor, as so:

const unsigned NELEMS = 1024u;
(...)
{ /* A: Create scope */
  buffer<float, 1> bufA(h_A.data(), h_A.size());
  buffer<float, 1> bufB(h_B.data(), h_B.size());
  buffer<float, 1> bufC(h_C.data(), h_C.size());
  buffer<float, 1> bufD(h_D.data(), h_D.size());
  buffer<float, 1> bufE(h_E.data(), h_E.size());
  buffer<float, 1> bufF(h_F.data(), h_F.size());
  buffer<float, 1> bufG(h_G.data(), h_G.size());
  /* The default constructor will use a default selector */
  queue myQueue;
  /* Now we create the command group objects to enqueue the command group
   * objects with different parameters.
   */
  command_group(myQueue, vadd_params<float>(&bufA, &bufB, &bufC, NELEMS));
  command_group(myQueue, vadd_params<float> (&bufE, &bufC, &bufD, NELEMS));
  command_group(myQueue, vadd_params<float> (&bufG, &bufD, &bufF, NELEMS));
} /* B: Will wait until execution here */
(...)

The functor is instantiated for floats and passed to the constructor of the command group, which enqueues it on the given queue. Although we only use floats in this sample (as we are following the tutorial), we could be using any type. The compiler will take care of creating the various implementations for us.

The three command groups will then be executed in order. When the execution reaches the end of the block statement at B, the destructor of the buffers will trigger the copying back of the result. Note also that we are not copying data in and out for each kernel, and the runtime will take care of copying the data required for each kernel.

4 Using bind

If the above code was too large for your taste, here's a different way to implement it using STL bind, as suggested by a colleague:

template<typename TYPE>
void vadd_params_function(buffer<TYPE, 1> * va, buffer<TYPE, 1> * vb,
                          buffer<TYPE, 1> * vc, unsigned int nelems)
{
  auto ptrA = _va->template get_access<access::read>();
  auto ptrB = _vb->template get_access<access::read>();
  auto ptrC = _vc->template get_access<access::read_write>();

  parallel_for(nelems, kernel_functor<class vadd_params_kernel>(
                ([=] (id<1> i) {
                                 ptrC[i] = ptrA[i] + ptrB[i];
                               }
                ));
}
(...)

  command_group(myQueue,
                 std::bind(vadd_params_function, &bufA, &bufB, &bufC, NELEMS));
  command_group(myQueue,
                 std::bind(vadd_params_function, &bufE, &bufC, &bufD, NELEMS));
  command_group(myQueue,
                 std::bind(vadd_params_function, &bufG, &bufD, &bufF, NELEMS));

Given a templated function, std::bind will create a functor with the given parameters. This saves you the nuisance of writing the class. This is one of the coolest characteristics of SYCL: you get all the nice C++ idioms and tricks nicely integrated with your performance-optimal OpenCL code, all of which will work out of the box.

My colleague is neither a SYCL or an OpenCL developer, but he is a highly skilled C++ developer, so he used his experience to simplify the code and integrate it into an existing codebase. This is standard C++, so any standard compiler will produce valid C++ code that will run on the host - providing it has access to the SYCL headers. A SYCL-enabled compiler will also produce a device kernel and handle all the execution for you.

5 Conclusion

In this short blog post, we have implemented a vector addition kernel using SYCL, using templated command groups to reduce the work required to write kernels. As you can see, SYCL provides a great productivity boost for C++ programmers, and integrates well with common C++ techniques (functors, templates, etc). We are confident that this small example has shown lots of ways in which SYCL integrates with your programs, so why not write your own sample code? Even though there is not an implementation available yet, we encourage you to post your ideas in the comments, below or on the Khronos™ forums, as we are eager to see what SYCL will enable you to do! Alternatively, you can also contact us directly.

6 Disclaimer

Please, note that the above code is based on the provisional specification of SYCL, which can be found on the official Khronos webpage. The final specification is subject to change from the current draft.

Khronos and SYCL are trademarks of the Khronos Group Inc.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Codeplay Software Ltd has published this article only as an opinion piece. Although every effort has been made to ensure the information contained in this post is accurate and reliable, Codeplay cannot and does not guarantee the accuracy, validity or completeness of this information. The information contained within this blog is provided "as is" without any representations or warranties, expressed or implied. Codeplay Sofware Ltd makes no representations or warranties in relation to the information in this post.

oneAPI

oneAPI for NVIDIA®/AMD

oneAPI Construction Kit

SYCL™

Research Projects

All Updates

News

Press Updates

Blogs

Videos

About Us

Careers

Management Team

Collaborations

Press-Packs

Contact Us