Part Two - Porting AI codes from CUDA to SYCL and oneAPI, one llama at a time
13 August 2024
Prelude
In our first part, we looked at the conversion from CUDA to SYCL via the whole project migration tool, SYCLomatic. Now we are going to take this portable code, and run it across an NVIDIA and Intel GPU.
Building on the NVIDIA system
Now we are going to build the converted code directly using the CMake file that SYCLomatic has created for us, and then build the main binary for llama.cpp.
$ cd dpct_out && mkdir syclbuild && cd syclbuild
$ MKLROOT=/home/ruyman/soft/mkl CC=icx CXX=icpx cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_CXX_FLAGS="-fsycl -fsycl-targets=nvptx64-nvidia-cuda -L${MKLROOT}/lib"
$ make main
Note that now we are not using the CUDA compiler to build, but the Intel SYCL compiler, so we are passing the CC and CXX flags accordingly. We also pass manually the target triple (`-(`-fsycl-targets=nvptx64-nvidia-cuda`) which tells the SYCL compiler to generate code for NVIDIA CUDA architectures (using PTX). We can now run our model using the following command:
$ ONEAPI_DEVICE_SELECTOR=cuda:gpu ./bin/main -m ../../models/ -ngl 12899 -no-mmap
The environment variable ONEAPI_DEVICE_SELECTOR allows users to override the default selection mechanism of the SYCL queue in favour of a user-defined setting. The default selection in this case would use OpenCL for the CPU, which won’t work because we explicitly build for NVIDIA GPUs.
The conversion out of the box won’t be fast, as it won’t be using the most optimized path for NVIDIA. But it is a good starting point that allows you to try your SYCL code on the existing environment before moving to a new machine with an Intel GPU, and you can also re-use your CI infrastructure to test the SYCL path.
Running on an Intel GPU system
To prove we have now a truly portable application, let's take this code and build it and run it for an Intel GPU.
Log onto your system with the Intel Data Center Max GPU and repeat the cloning and building for CUDA steps, so you can run intercept-build on the new system, or copy over the DPCT generated project. Now, let's configure and build for Intel GPUs, using the original CMake flags we used to convert the project.
$ CC=icx CXX=icpx cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES=80
Yes, you still use the CUBLAS and CUDA CMake flags, the user visible CMake flags won’t change, but the internal logic on the CMake file generated by SYCLomatic will handle finding the paths for the Intel oneAPI base toolkit dependencies. Once it is configured, you can
$ make main
Which will build llama.cpp for the default target – Intel GPUs (using SPIR-V binaries). To run llama on your Intel GPU, just use the level zero GPU backend, as shown below:
$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./bin/main -m ../../llama-2-7b-chat.Q4_K_M.gguf --no-mmap -ngl 128
Now this is the same application running on an Intel GPU with no user intervention! That means all the heavy lifting is done by the tool, and you can focus on optimization and refactoring of the generated code.
Conclusions
In this article we have shown a practical use case of a CUDA to SYCL C++ application for AI, and a popular one at that! The conversion works straight out of the box, no code changes needed. Typically the SYCLomatic tool is there to assist you with porting applications from CUDA to SYCL, so it gives you good warning messages and introduces code that you can then replace later on for code that suits your application better.
We have also shown that the same code works on two completely different GPU’s without any modification, NVIDIA and Intel with the potential for others through the use of open standard SYCL. Although llama.cpp has a CUDA backend already, having the SYCL backend run on both platforms means we can re-use CI infrastructure for testing and run the application in a wider set of platforms with less code changes.
The current SYCL backend supported in upstream llama.cpp started as a DPCT conversion, not too dissimilar to the one we just did in this article. Developers have been working on the SYCL backend to improve performance on a wide variety of platforms (NVIDIA, AMD, Intel GPUs on client and datacenter, and others incl RISC-V), but we still re-use some of the original code that SYCLomatic generated for us. That original conversion saved several engineering months to get something up and running, and allowed us to focus on the important parts of the project: performance and code quality.
If you want help porting a CUDA application to SYCL, or have questions about anything in this article, reach out to us at dev-rel@codeplay.com