Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL

17 June 2021

Presented at IWOCL and SYCLcon 2021

Presented in 1979, BLAS is, to this day, the de-facto standard for low-level linear algebra routines. BLAS provides essential linear algebra routines used in various domains such as numerical and scientific computing, weather simulation, computational fluid dynamics, machine learning and adopted for a broad range of hardware from HPC to embedded systems and AI specialized accelerators.

While originally BLAS routines have been implemented for CPU, with the emergence of GPGPU BLAS routines had to be re-written to exploit the provided extensive computational power. Machine learning is rapidly changing this landscape again by incentivizing the development of specialized hardware that can perform certain operations more efficiently. With various range of hardware, having different memory hierarchy, different cache line size, and various memory access pattern, with different number of registers and different type of memory connections, performance portability of BLAS routine across various platforms while avoiding rewrites of existing code is a major challenge of the heterogeneous programming world.

Written in SYCL programming Language, SYCL-BLAS is an open-source BLAS library that provides performance portability across various SYCL-enabled platforms.

This paper presents the implementation of a parametric tile-based TRSM routine for SYCL-BLAS by employing a highly optimized GEMM routine provided in SYCL-BLAS.

Our results shows that we can achieve up to 2.6x speedup on Intel GPU, 7x on AMD GPU and up to 3.4x speedup on ARM GPU compared with the highly optimized clBLAST and clBLAS libraries by tuning the tile size per-device without reimplementing the kernel.

Codeplay Software Ltd has published this article only as an opinion piece. Although every effort has been made to ensure the information contained in this post is accurate and reliable, Codeplay cannot and does not guarantee the accuracy, validity or completeness of this information. The information contained within this blog is provided "as is" without any representations or warranties, expressed or implied. Codeplay Sofware Ltd makes no representations or warranties in relation to the information in this post.

oneAPI

oneAPI for NVIDIA®/AMD

oneAPI Construction Kit

SYCL™

Research Projects

All Updates

News

Press Updates

Blogs

Videos

About Us

Careers

Management Team

Collaborations

Press-Packs

Contact Us

Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL

17 June 2021

Rod Burns

VP Ecosystem