Using the SYCL Kernel Fusion Extension - A Hands-On Introduction
21 June 2023
In a previous blog post, we introduced the SYCL extension for user-driven kernel fusion developed by Codeplay, and how it can improve SYCL application performance.
In this post, we will take a more hands-on approach to show how to set up DPC++ with kernel fusion support and compile and run a SYCL application. Using that knowledge will allow you to apply the kernel fusion extension to your own SYCL applications.
We will also get a glimpse into SYCL application performance analysis with Intel VTune, investigating the reason for performance improvement through kernel fusion.
So, let's get started!
Prerequisites
Throughout this tutorial, we will be assuming a Linux system with at least one SYCL device compatible with the kernel fusion extension. Concretely, this would need to be a device that is compatible with a SPIR-V compatible SYCL backend. This would for example be the OpenCL backend with an Intel CPU or GPU (also integrated), or the Intel LevelZero backend with an Intel GPU (also integrated).
The installation of the necessary drivers and device backends is out-of-scope for this blog post, consult the corresponding manuals for OpenCL and LevelZero for installation instructions.
For the section on VTune performance analysis, we will assume an OpenCL CPU, suited for the selection of metrics that we will investigate. Intel VTune supports performance analysis across a wide range of different platforms and devices, for example, SYCL application performance analysis via a graphical user interface or via the command-line.
The installation of VTune itself is out-of-scope for this blog post, see the installation manual for instructions.
As an example, the system used throughout this blog post uses an Intel i7 6700K CPU with the OpenCL backend, Ubuntu 20.04 LTS and Intel VTune 2023.1.0 preview.
Setup of DPC++ with kernel fusion support
If you were hoping for a long setup procedure now (who isn't ? 😉), I'll have to disappoint you: Current DPC++ daily releases already support kernel fusion by default.
So all we need to do is download a daily release, unpack it and setup up some environment variables.
First, go to some directory of your choosing (you should have write permissions) and then execute the following command to download a daily release:
wget https://github.com/intel/llvm/releases/download/sycl-nightly%2F20230408/dpcpp-compiler.tar.gz
Note that newer daily releases found here might also work.
In the next step, we'll unpack the DPC++ release and eventually remove the TAR file we downloaded:
tar xfz dpcpp-compiler.tar.gz
rm dpcpp-compiler.tar.gz
The last step of the setup is to setup some environment variables:
export PATH=$(pwd)/dpcpp_compiler/bin:$PATH
export LD_LIBRARY_PATH=$(pwd)/dpcpp_compiler/lib:$LD_LIBRARY_PATH
Note that this step needs to be repeated each time you re-open the terminal and assumes your current working directory matches the directory you choose for setup.
To verify that the setup worked, use the following commands:
which clang++
which sycl-ls
In both cases, the output should point to an executable in the dpcpp_compiler/bin subdirectory of the setup directory you chose.
Also, the following command should show at least one device for either the OpenCL or LevelZero backend:
sycl-ls
An example output would be:
[opencl:cpu:0] Intel(R) OpenCL, Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz OpenCL 3.0
In case
this command does not print a device, see the OpenCL or LevelZero installation manual for instructions.
Example application
For demonstration purposes, we will be using the example application from the original blog-post.
You can download the complete source for the example application here.
After completing the setup above, we can compile the example application, still without kernel fusion:
clang++ -fsycl no-fusion.cpp -o no-fusion
After that, we can execute the application:
./no-fusion
The output should be similar to this, the exact runtime is of course dependent on
your hardware configuration*:
Elapsed time in microseconds: 298080
Elapsed time in microseconds: 235596
Elapsed time in microseconds: 235771
Elapsed time in microseconds: 235865
Elapsed time in microseconds: 235926
Elapsed time in microseconds: 235780
Elapsed time in microseconds: 235566
Elapsed time in microseconds: 235890
Elapsed time in microseconds: 235718
Elapsed time in microseconds: 235786
The application performs ten timed iterations of the main workload. Due to the necessary translation from SPIR-V to device binary format and the caching of that translation, explained in the
first
blog-post, the first iteration takes a little longer than the remaining ones*Fusion
In the next step, we can enable kernel fusion for the application, but still without dataflow internalization. To that end, a few modifications to the source-code are necessary; you can download the resulting source file here.
We can compare the previous source file with the new one:
git diff --no-index no-fusion.cpp fusion.cpp
From the output, we can see that only minimal
code changes are necesary to enable kernel fusion in the application:
diff --git a/no-fusion.cpp b/fusion.cpp
index 08595a0..4cd0702 100644
--- a/no-fusion.cpp
+++ b/fusion.cpp
@@ -28,7 +28,7 @@ int main() {
auto in4 = get_random_data();
std::vector<float> out(dataSize, -1.0);
- queue q{};
+ queue q{ext::codeplay::experimental::property::queue::enable_fusion{}};
{
buffer<float> bIn1{in1.data(), range{dataSize}};
@@ -40,10 +40,14 @@ int main() {
buffer<float> bTmp2{range{dataSize}};
buffer<float> bTmp3{range{dataSize}};
+ ext::codeplay::experimental::fusion_wrapper fw{q};
+
for (size_t i = 0; i < 10; ++i) {
auto start = std::chrono::high_resolution_clock::now();
+ fw.start_fusion();
+
// tmp1 = in1 * in2
q.submit([&](handler &cgh) {
auto accIn1 = bIn1.get_access(cgh);
@@ -80,6 +84,8 @@ int main() {
dataSize, [=](id<1> i) { accOut[i] = accTmp1[i] - accTmp3[i]; });
});
+ fw.complete_fusion();
+
q.wait();
auto stop = std::chrono::high_resolution_clock::now();
The new source file can be compiled with this command:
clang++ -fsycl fusion.cpp -o fusion
As you can see, there are no additional compilation flags needed for kernel fusion.
There are also no additional flags needed for kernel fusion when executing the application:
./fusion
This should result in output similar to the following*:
Elapsed time in microseconds: 337596
Elapsed time in microseconds: 184143
Elapsed time in microseconds: 184681
Elapsed time in microseconds: 184205
Elapsed time in microseconds: 184065
Elapsed time in microseconds: 184574
Elapsed time in microseconds: 184253
Elapsed time in microseconds: 184524
Elapsed time in microseconds: 184426
Elapsed time in microseconds: 184191
As numbers show, the first iteration is slightly slower than in the case without fusion, due to the additional overhead for
JIT compiling for
fusion. On the other hand, the remaining iterations are faster than in the non-fused case*. In the later section on VTune analysis, we will explore the reason for that more.
Dataflow Internalization
As discussed in the first blog-post, internalization of dataflow in the fused kernel can be an important optimization technique.
We can apply dataflow internalization to the buffers bTmp1
, bTmp2
and
bTmp3
in our application, resulting in the modified source code that you can download from here.
Using the following command, we can make the necessary code changes for internalization visible:
git diff --no-index fusion.cpp internalization.cpp
The output shows that we mainly need to
pass an additional property to the buffer definition. As an additional optimization, we have disabled the insertion
of extra work-group barriers by the fusion JIT compiler by passing a property to
complete_fusion()
:
diff --git a/fusion.cpp b/internalization.cpp
index 4cd0702..430c020 100644
--- a/fusion.cpp
+++ b/internalization.cpp
@@ -36,9 +36,15 @@ int main() {
buffer<float> bIn3{in3.data(), range{dataSize}};
buffer<float> bIn4{in4.data(), range{dataSize}};
buffer<float> bOut{out.data(), range{dataSize}};
- buffer<float> bTmp1{range{dataSize}};
- buffer<float> bTmp2{range{dataSize}};
- buffer<float> bTmp3{range{dataSize}};
+ buffer<float> bTmp1{
+ range{dataSize},
+ {sycl::ext::codeplay::experimental::property::promote_private{}}};
+ buffer<float> bTmp2{
+ range{dataSize},
+ {sycl::ext::codeplay::experimental::property::promote_private{}}};
+ buffer<float> bTmp3{
+ range{dataSize},
+ {sycl::ext::codeplay::experimental::property::promote_private{}}};
ext::codeplay::experimental::fusion_wrapper fw{q};
@@ -84,7 +90,7 @@ int main() {
dataSize, [=](id<1> i) { accOut[i] = accTmp1[i] - accTmp3[i]; });
});
- fw.complete_fusion();
+ fw.complete_fusion(ext::codeplay::experimental::property::no_barriers{});
q.wait();
We can compile and run the application with the following commands:clang++ -fsycl internalization.cpp -o internalization
./internalization
Again, no additional flags are needed for compilation or execution to enable fusion and dataflow internalization.
The output shows the performance improvement we can get from fusion with dataflow internalization*:
Elapsed time in microseconds: 266811
Elapsed time in microseconds: 89876
Elapsed time in microseconds: 89861
Elapsed time in microseconds: 89939
Elapsed time in microseconds: 89904
Elapsed time in microseconds: 89844
Elapsed time in microseconds: 90018
Elapsed time in microseconds: 89887
Elapsed time in microseconds: 89891
Elapsed time in microseconds: 89831
As in both versions above, the JIT and SPIR-V compilation overhead makes the first iteration slower than the following ones. However, the performance benefit from fusion with dataflow internalization is so notable that even the first iteration is faster than the first iteration in the case without fusion. In the remaining iterations, the version with fusion and internalization outperforms the non-fused version by more than 2x.
In the next section, we will use VTune to get a glimpse into why performance improves so significantly with fusion.
VTune Performance Analysis
The Intel VTune profiler can provide great insight into the performance of applications (not only for SYCL), allowing users to identify hotspots, analyze bottlenecks and guiding optimizations.
We will be using it, or, more specifically, its command-line interface, to get some insight into performance differences between the three different versions of our application.
Fusion Performance
First, in the attempt to determine the reason fusion, even without dataflow internalization, improves application performance, we will investigate the cache performance of both version.
To analyze applications, we first need to collect some metrics with VTune. For the non-fused version, this can be achieved with this command, limiting execution to the OpenCL CPU device:
ONEAPI_DEVICE_SELECTOR=opencl:cpu vtune -collect memory-access -r report-no-fusion ./no-fusion
We can do the same for the fused version with no dataflow internalization:
ONEAPI_DEVICE_SELECTOR=opencl:cpu vtune -collect memory-access -r report-fusion ./fusion
Now that we have collected the metrics, we can generate reports for both versions, specifically focusing on the actual kernel functions. As the application performs only few arithmetic operations for each item of data loaded, i.e., it is memory-bound, we will investigate memory metrics in particular**. For this first step of fusion, the most relevant metric will be the cache metric, as fusion, even without internalization, can improve cache hit rate*.
For the version without fusion, this works with the following command:
vtune -report hw-events -r report-no-fusion --column="stalls_l1d_miss,stalls_l2_miss,stalls_l3_miss"\
--filter function="main::{lambda(sycl::_V1::handler&)#1}::operator()(sycl::_V1::handler&) const::KernelOne"\
--filter function="main::{lambda(sycl::_V1::handler&)#2}::operator()(sycl::_V1::handler&) const::KernelTwo"\
--filter function="main::{lambda(sycl::_V1::handler&)#3}::operator()(sycl::_V1::handler&) const::KernelThree"\
--filter function="main::{lambda(sycl::_V1::handler&)#4}::operator()(sycl::_V1::handler&) const::KernelFour"
The output of this command should be similar to the following, while the exact numbers can deviate based on hardware configuration*:
Function Hardware Event Count:CYCLE_ACTIVITY.STALLS_L1D_MISS (M) Hardware Event Count:CYCLE_ACTIVITY.STALLS_L2_MISS (M) Hardware Event Count:CYCLE_ACTIVITY.STALLS_L3_MISS (M)
----------------------------------------------------------------------------------------- ------------------------------------------------------- ------------------------------------------------------ ---------------------------------------
main::{lambda(sycl::_V1::handler&)#1}::operator()(sycl::_V1::handler&) const::KernelOne 14,352 13,858 13,286
main::{lambda(sycl::_V1::handler&)#3}::operator()(sycl::_V1::handler&) const::KernelThree 14,196 13,598 13,208
main::{lambda(sycl::_V1::handler&)#2}::operator()(sycl::_V1::handler&) const::KernelTwo 14,716 13,832 13,494
main::{lambda(sycl::_V1::handler&)#4}::operator()(sycl::_V1::handler&) const::KernelFour 16,016 15,262 15,002
For the non-fused version, we only have to specify the name of the fused kernel function. The fused functions use the
fused_ prefix and are consecutively numbered. As the application only contains a single fusion, the name will be
fused_0
vtune -report hw-events -r report-fusion --column="stalls_l1d_miss,stalls_l2_miss,stalls_l3_miss" --filter function=fused_0
In this case, the output should be similar to the following*:
Function Hardware Event Count:CYCLE_ACTIVITY.STALLS_L1D_MISS (M) Hardware Event Count:CYCLE_ACTIVITY.STALLS_L2_MISS (M) Hardware Event Count:CYCLE_ACTIVITY.STALLS_L3_MISS (M)
-------- ------------------------------------------------------- ------------------------------------------------------ ------------------------------------------------------
fused_0 40,586 38,532 37,232
When comparing the sum of the L1 cache misses observed by the four kernels in the case without fusion (ca. 59 billion stall cycles) with the number of L1 caches misses for the fused version (ca. 40 billion stall cycles), we can see that the fused version observes significantly fewer L1 cache misses. As the fused kernel executes all operations for each work-item in a sequence in a single kernel, the temporal locality is better, leading to a reduction in cache misses*.
Effect of dataflow internalization
When comparing the fused version with and without internalization above, we have seen that major performance improvements can result from dataflow internalization.
To analyze this improvement, we will investigate another VTune metric.
For the fused case, we can reuse the existing report. For the version with internalization, we can generate the report using the following command:
ONEAPI_DEVICE_SELECTOR=opencl:cpu vtune -collect memory-access -r report-internalization ./internalization
After that, we can compare the number of memory loads and stores performed by the two different versions of the fused
kernel.
For the version without dataflow internalization:
vtune -report hw-events -r report-fusion --column=all_load,all_store --filter function=fused_0
This yields an output similar to the following*:
Function Hardware Event Count:MEM_INST_RETIRED.ALL_LOADS_PS (M) Hardware Event Count:MEM_INST_RETIRED.ALL_STORES_PS (M)
-------- ------------------------------------------------------ -------------------------------------------------------
fused_0 995 501
We can do the same for the version with dataflow internalization:
vtune -report hw-events -r report-internalization --column=all_load,all_store --filterfunction=fused_0
The resulting output will be similar to the following*:
Function Hardware Event Count:MEM_INST_RETIRED.ALL_LOADS_PS (M) Hardware Event Count:MEM_INST_RETIRED.ALL_STORES_PS (M)
-------- ------------------------------------------------------ -------------------------------------------------------
fused_0 499 124
Comparing the output, we can see that dataflow internalization reduces the number of loads by almost 500 million and the number of stores by close to 380 million, resulting in performance improvements*.
This analysis of course only scratches the surface of performance analysis and what Intel VTune supports. If you want to learn more about these topics, have a look at the Intel VTune cookbook (in particular the SYCL sections) or the getting started guide Intel VTune.Outlook
The SYCL extension for kernel fusion is currently an experimental feature, allowing users to experiment with the feature and us to gather early user feedback on the API and functionality.
At the same time, there is a larger effort to define a graph API for SYCL. SYCL graphs allows users to define a directed acyclic graph of dependent SYCL commands ahead of execution and is meant to open new optimization opportunities for SYCL applications, for example for workloads that repeatedly submit a similar sequence of kernels.
Fusing the kernels in a graph into a single kernel is one of those potential optimizations. Codeplay have therefore started work on a SYCL extension for graph fusion, building on top of the SYCL graph API. The sequence of kernels to fuse is defined through the graph API, which offers two different mechanisms: one recording mode very similar to the existing kernel fusion extension, and one API for explicitly constructing a graph from the kernel and dependencies. Using fusion on top of graphs provides a number of advantages, for example a more fine-grained control over when the JIT compilation for fusion actually takes place.
If you want to follow the development of the new extension, you can do so here.
Disclaimer*
Experiments performed on 10/04/2023 by Codeplay, with Intel Core i7-6700K, Ubuntu 20.04.5 LTS, Linux kernel 5.15, Intel VTune Profiler 2023.1.0 pre-release (build 625246), and OpenCL driver version 2022.14.10.0.20_160000.xmain-hotfix.
DPC++ nightly version 2023-04-08 (git commit 3d6917f
) was
used for measurements.
Performance varies by use, configuration and other factors. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. Intel, the Intel logo, Codeplay and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
Metrics**
Availability and naming of hardware events are CPU-specific. To obtain a
full list of available metrics for a report, a command similar to the following can be used:
vtune -report hw-events --column="?" -r ./report-internalization