Tracking down a register allocator bug

20 August 2025

This article describes the investigation of a bug found when running GROMACS testing using SYCL with the DPC++ compiler while targeting AMD GPUs.

This was initially reported in the intel/llvm GitHub repository issue-tracker intel/llvm#6209 and was using a specific version of GROMACS with SYCL support.

This bug will take us from the GROMACS source code all the way down to the register allocation in the compiler backend.

Description of the issue

When running one of the GROMACS tests built with DPC++ targeting AMD, the test crashes with the following output:

Memory access fault by GPU node-4 (Agent handle: 0x215c1f0) on address 0x7ff5f0d6d000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

This error indicates that the kernel is accessing invalid memory, for example this can happen in simple out-of-bounds accesses.

It is reported that this issue appears on MI50 (gfx906 ISA), and MI200 (gfx90a ISA), but that it doesn’t fail on MI100 (gfx908 ISA).

The specific test is being run as follows:

SYCL_DEVICE_FILTER=hip:gpu ./bin/mdrun-pull-test

Finding the kernel

The GROMACS application contains a lot of kernels so the first step is to figure out exactly which kernel is causing the issue.

One easy way to do that is to use the ROCm debug environment variable AMD_LOG_LEVEL=4. We can then find the last ShaderName debug output before the crash:

:3:rocvirtual.cpp           :2738: 6536108370963 us: 24086: [tid:0x7f5e78ab4740] ShaderName : _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE

Which gives us the name of the last kernel to run on the GPU, which is almost certainly the kernel that caused the access fault. This debug output gives us the kernel name in its mangled form, but using the c++filt tool we can easily turn it into a more readable format, and so we get:

c++filt _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE
typeinfo name for NbnxmKernel<true, true, (Nbnxm::ElecType)1, (Nbnxm::VdwType)1>

Alternatively we can use the ROCm debugger rocgdb, simply running the program under rocgdb after the segfault the debugger will show which kernel it happened in and also allow us to disassemble the kernel. Using the debugger can unfortunately only go so far because debug information is not supported in DPC++ for AMD targets.

With the debugger we get the following:

Thread 3 "mdrun-pull-test" received signal SIGBUS, Bus error.↩
[Switching to thread 3, lane 0 (AMDGPU Lane 4:13:1:1/0 (0,0,0)[0,0,0])]↩
0x00007fffe5297078 in typeinfo name for NbnxmKernel<true, true, (Nbnxm::ElecType)1, (Nbnxm::VdwType)1> () from file:///path/to/gromacs/build/lib/libgromacs.so.7    #offset=75955456&size=17664↩

Using the gdb command disas we can then do our first proper analysis step and look at the assembly dump showing on which instruction the memory access fault is happening.

Dump of assembler code for function _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE:
[...]
   0x00007fffe5297064 <+1124>:| global_load_dwordx2 v[14:15], v[5:6], off offset:16
   0x00007fffe529706c <+1132>:| s_waitcnt lgkmcnt(0)
   0x00007fffe5297070 <+1136>:| v_mov_b32_e32 v10, s41
   0x00007fffe5297074 <+1140>:| s_mov_b32 s64, s80
=> 0x00007fffe5297078 <+1144>:| s_waitcnt vmcnt(0)
[...]

The arrow indicates that the error is happening on the s_waitcnt vmcnt(0). This instruction waits for all the memory operations using vector registers to be completed. Therefore the likely culprit in this case is actually the global_load_dwordx2 instruction above.

At this point we know that the issue happens in the NbnxmKernel and from the disassembly it seems likely to be caused by a load instruction, this is consistent with the memory access fault issue we’re seeing, the address used for this load instruction must be incorrect in some way.

Comparing the assembly

We know that the application works on gfx908 but not on gfx906 so a first step can be to compare the assembly generated by both and see if any significant differences can be identified.

To obtain the assembly we can use the environment variable SYCL_DUMP_IMAGES=1, this is a DPC++ environment variable which will dump all the available kernels images when running an application. GROMACS is a fairly large application so we end up with 309 .bin images.

We can then use grep to search for the mangled kernel name we got out of the debugger:

% grep _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE *.bin
Binary file sycl_amdgcn142.bin matches

The .bin files are HIP fat binaries, we can then extract the actual device object file from them with:

clang-offload-bundler --unbundle --type=o --targets=hipv4-amdgcn-amd-amdhsa--gfx906 --input=sycl_amdgcn142.bin --output=sycl_amdgcn142.bin.o

And then disassemble the device objects with:

llvm-objdump -d sycl_amdgcn142.bin.o &> sycl_amdgcn142.s

Doing that for both architectures we end up with the assembly for both and we can then use a diff tool to compare them.

Unfortunately at this point the differences between the two are very significant and it’s very difficult to track down exactly where the offending instruction is in the gfx908 assembly, so more work will be required to make use of this and we’ll come back to it later.

Tracking down the issue in the source

Knowing the kernel name we can find the source for it in GROMACS:

src/gromacs/nbnxm/sycl/nbnxm_sycl_kernel_body.h
  • https://gitlab.com/gromacs/gromacs/-/blob/aa-hwe-release-2022-dpcpp-hip/src/gromacs/nbnxm/sycl/nbnxm_sycl_kernel_body.h#L663

Unfortunately for us this kernel is quite large and complicated, roughly 400 lines of code with loops and many conditionals. And our debugging options are very limited, indeed as mentioned above source level debugging is not available, and neither is printf! Both of these are available in the regular AMD toolchain but not yet enabled in DPC++ for AMD at the time of this investigation.

So as a way to get source level information on where this bug is happening we used the horrible snippet of code below:

volatile int* crash = nullptr;
*crash++;

You see adding this in the body of a kernel will cause it to crash with a memory access fault error nearly identical to the one caused by our bug, but since the error message prints the address of the memory access fault it will always show 0 if this piece of code caused it. And so this will show us if *crash++ was run before or after the offending line of code, which means that by moving *crash++ around in our kernel code and looking at the resulting error we can eventually track down which line of code is causing the issue.

After a lot of trial and error the offending line in the kernel source was identified to be:

unsigned imask = a_plistCJ4[j4].imei[imeiIdx].imask;

This matches what we were expecting since this line loads from both a_plistCJ4 and then from imei, so it must be that either one of these loads triggers the error.

In addition, we also observed that the error never happens on the first iteration of the loop this line is in, this information will become important later on.

At this stage the obvious next step is to check whether j4 or imeiIdx are out of bounds for what they’re indexing. A cursory look through the code didn’t spot anything obvious that would suggest these indices may be wrong. But by that point even though we still didn’t have access to a proper printf we figured out that it was possible to use lower level printf primitives provided by the HIP toolchain from within the kernel, so we were able to try printing these indices:

long msg = __ockl_fprintf_stdout_begin();
msg = __ockl_fprintf_append_string_n(msg, "debug: %lu, %lu\n", 16, 0);
msg = __ockl_fprintf_append_args(msg, 2, j4, imeiIdx, 0, 0, 0, 0, 1);

Now surprisingly after adding this bit of code before the offending line the program started working correctly! This is usually indicative that whatever we did to the code caused the compiler take a different path and avoided triggering the issue.

Digging a little further around that idea, we found out that printing just imeiIdx fixed the issue, and furthermore marking imeiIdx as volatile also fixed the problem, and finally since we also know that the first iteration is always correct, we also ended up figuring out that disabling loop unrolling with -fno-unroll-loops also fixes the issue.

So we now have a lot more information about what’s going on and multiple ways of avoiding the issue, so we’re almost ready to dive back in the assembly to see what we can figure out. But before we do, one of the helpful thing we can do is to surround the offending line with barriers:

itemIdx.barrier(fence_space::local_space);
unsigned imask = a_plistCJ4[j4].imei[imeiIdx].imask;
itemIdx.barrier(fence_space::local_space);

Barriers are helpful because they’re lowered to recognizable s_barrier instructions and they limit how the compiler can re-organize the assembly around them which makes it a lot easier to identify in the assembly where this line of source code is. And luckily for us adding them doesn’t fix the issue we’re seeing, so we can use them as a handy marker.

Diving back into the assembly

Now we can look again at comparing assembly between all the different cases we’ve identified, and see if we can spot what’s incorrect:

Assembly for gfx906 with the extra barriers and no other modification, the offending instruction is the global_load_dword at the bottom. The generated assembly looks different than what we originally got in rocgdb but it still triggers the error:

s_cbranch_scc1 2000                                        // 0000000020AC: BF8507D0 <_ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE+0x23f0>
s_ashr_i32 s43, s42, 31                                    // 0000000020B0: 902B9F2A
s_lshl_b64 s[48:49], s[42:43], 5                           // 0000000020B4: 8EB0852A
s_add_u32 s16, s40, s48                                    // 0000000020B8: 80103028
v_lshlrev_b64 v[7:8], 3, v[8:9]                            // 0000000020BC: D28F0007 00021083
s_addc_u32 s17, s41, s49                                   // 0000000020C4: 82113129
s_waitcnt lgkmcnt(0)                                       // 0000000020C8: BF8CC07F
v_mov_b32_e32 v6, s17                                      // 0000000020CC: 7E0C0211
v_add_co_u32_e64 v5, s[16:17], s16, v7                     // 0000000020D0: D1191005 00020E10
v_addc_co_u32_e64 v6, s[16:17], v6, v8, s[16:17]           // 0000000020D8: D11C1006 00421106
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)                    // 0000000020E0: BF8C0000
s_barrier                                                  // 0000000020E4: BF8A0000
global_load_dword v52, v[5:6], off offset:16               // 0000000020E8: DC508010 347F0005
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)                    // 0000000020F0: BF8C0000
s_barrier

Assembly for gfx906 with imeiIdx marked as volatile (test passing):

s_cbranch_scc1 2012                                        // 000000002114: BF8507DC <_ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE+0x2488>
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
s_barrier                                                  // 00000000211C: BF8A0000
buffer_load_dword v8, off, s[0:3], s33 offset:16           // 000000002120: E0500010 21000800
s_nop 0                                                    // 000000002128: BF800000
buffer_load_dword v9, off, s[0:3], s33 offset:20           // 00000000212C: E0500014 21000900
s_ashr_i32 s47, s46, 31                                    // 000000002134: 902F9F2E
s_lshl_b64 s[52:53], s[46:47], 5                           // 000000002138: 8EB4852E
s_add_u32 s20, s44, s52                                    // 00000000213C: 8014342C
s_addc_u32 s21, s45, s53                                   // 000000002140: 8215352D
s_add_u32 s24, s20, 16                                     // 000000002144: 80189014
s_addc_u32 s25, s21, 0                                     // 000000002148: 82198015
v_mov_b32_e32 v7, s25                                      // 00000000214C: 7E0E0219
s_mov_b32 s58, s81                                         // 000000002150: BEBA0051
s_waitcnt vmcnt(0)                                         // 000000002154: BF8C0F70
flat_load_dword v11, v[8:9] glc                            // 000000002158: DC510000 0B000008
s_waitcnt vmcnt(0) lgkmcnt(0)                              // 000000002160: BF8C0070
v_lshlrev_b64 v[5:6], 3, v[11:12]                          // 000000002164: D28F0005 00021683
v_add_co_u32_e64 v5, s[20:21], s24, v5                     // 00000000216C: D1191405 00020A18
v_addc_co_u32_e64 v6, s[20:21], v7, v6, s[20:21]           // 000000002174: D11C1406 00520D07
global_load_dword v51, v[5:6], off                         // 00000000217C: DC508000 337F0005
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)                    // 000000002184: BF8C0000
s_barrier                                                  // 000000002188: BF8A0000

With volatile the code is changed a fair bit, in particular a lot of the address calculation code is moved inside the two barriers. But looking at the assembly closely we can see that it is doing the same operations and the global_load_dword at the bottom of that assembly snippet corresponds to the offending global_load_dword in the previous snippet.

The interesting part about the assembly with volatile is that it can help us figure out what corresponds to imeiIdx in the assembly, here we can see the instruction:

flat_load_dword v11, v[8:9] glc                            // 000000002158: DC510000 0B000008

Just before the last stages of address computation before the global_load_dword and that instruction is not in the snippet without volatile. In addition looking at the ISA document for gfx906 you can see that glc stands for Globally Coherent and marks that this instruction bypasses the L1 cache. So it is quite likely that imeiIdx, our volatile variable is being loaded by this instruction, and so that in this assembly snippet it would correspond to the vector register v11.

We can then see how v11 is used:

v_lshlrev_b64 v[5:6], 3, v[11:12]                          // 000000002164: D28F0005 00021683
v_add_co_u32_e64 v5, s[20:21], s24, v5                     // 00000000216C: D1191405 00020A18
v_addc_co_u32_e64 v6, s[20:21], v7, v6, s[20:21]           // 000000002174: D11C1406 00520D07
global_load_dword v51, v[5:6], off                         // 00000000217C: DC508000 337F0005

It is shifted by 3 into v[5:6], as a register pair v[11:12], then v5 and v6 are used in the addition instructions, which outputs are then used as the address for global_load_dword.

We can then go back to the assembly of the case with errors and look for this address computation pattern, to find which register corresponds to imeiIdx, and we can easily find:

[...]
v_lshlrev_b64 v[7:8], 3, v[8:9]                            // 0000000020BC: D28F0007 00021083
[...]
v_add_co_u32_e64 v5, s[16:17], s16, v7                     // 0000000020D0: D1191005 00020E10
v_addc_co_u32_e64 v6, s[16:17], v6, v8, s[16:17]           // 0000000020D8: D11C1006 00421106
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)                    // 0000000020E0: BF8C0000
s_barrier                                                  // 0000000020E4: BF8A0000
global_load_dword v52, v[5:6], off offset:16               // 0000000020E8: DC508010 347F0005
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)                    // 0000000020F0: BF8C0000
s_barrier

The shift by 3, followed by the two add instructions and then the load, therefore in the broken assembly imeiIdx is stored in v8.

So now that we know v8 is supposed to be imeiIdx we can scroll up in the assembly without volatile and find out how it’s calculated, knowing that v0 and v1 contain thread indices:

// v3 = v1
v_mov_b32_e32 v3, v1                                       // 000000001C90: 7E060301
[...]
//  v15 = (v3 << 3) + v0
//  v15 = idy * 8 + idx  // index flipped
v_lshl_add_u32 v15, v3, 3, v0                              // 000000001D30: D1FD000F 04010703
[...]
// v8 = v15 >> 5
// v8 = v15 / 32
v_lshrrev_b32_e32 v8, 5, v15                               // 000000001F3C: 20101E85

And noting that in the source imeiIdx is calculated by:

const unsigned tidxi = itemIdx.get_local_id(2);↩
const unsigned tidxj = itemIdx.get_local_id(1);↩
const unsigned tidx  = tidxj * c_clSize + tidxi;
const unsigned imeiIdx = tidx / prunedClusterPairSize;↩

With c_clSize = 8 and prunedClusterPairSize = 32.

So we can see that the calculation of imeiIdx before the loop is correct, this is consistent with what we were seeing that the first iteration of the loop was always correct. But we can look further into the loop after the load instruction to see what happens to v8.

And very quickly we can see that after the incorrect instruction v8 is almost immediately re-used and its value erased, and it doesn’t seem to be spilled to memory and re-loaded later or to have its value re-calculated:

v_ashrrev_i32_e32 v8, 31, v7                               // 00000000210C: 22100E9F

At this point the suspicion is clear, v8 which is supposed to contain imeiIdx is overridden when it shouldn’t be. However it’s hard to say for sure as the loop is thousands of instructions long. But we can try to confirm this by comparing further with other cases that we know are working.

Now comparing the same assembly for gfx908 that we also know to be working, we observe the following:

v_accvgpr_read_b32 v2, a8                                  // 000000002084: D3D84002 18000107
v_add_co_u32_e64 v5, s[16:17], s16, v1                     // 00000000208C: D1191005 00020210
v_addc_co_u32_e64 v6, s[16:17], v6, v2, s[16:17]           // 000000002094: D11C1006 00420506
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)                    // 00000000209C: BF8C0000
s_barrier                                                  // 0000000020A0: BF8A0000
global_load_dword v52, v[5:6], off offset:16               // 0000000020A4: DC508010 347F0005
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)                    // 0000000020AC: BF8C0000
s_barrier 

In the gfx908 snippet we see that the shift by 3 that we were observing previously is not near the load but instead there is a v_accvgpr_read_b32 instruction. Looking further up in the assembly we can see that imeiIdx is calculated outside of the loop as previously, but then its value is stored in a register a8, and then loaded before the global_load_dword. And a8 is not written to anywhere else in the program so the value of imeiIdx will be correct for the entire loop.

Looking at the ISA documents we can see that a8 is an ACC vector register, these registers are part of the matrix multiplication unit, but it seems here that it is used for spilling the value of imeiIdx, it is important to note that gfx906 doesn’t have this matrix multiplication unit or ACC register and that gfx90a MI200, which we know also fails, does have these registers, but looking through the LLVM code base we could figure out that on gfx90a these can be used as general purpose vector registers and so they are not used for spilling like they are on gfx908.

Furthermore we can also look at the assembly for gfx906 but when loop unrolling is disabled. In that scenario it looks just like the regular gfx906 however, the register matching imeiIdx is not re-used in the body of the loop.

So we have a fairly strong idea of why it is failing, that is to say v8 is being incorrectly re-used, but we don’t now why or how to fix it yet.

Diving further into the compiler

Now we need to analyze what the compiler is doing and try to understand why v8 is re-used, but to do that we first need to narrow down a bit the compilation of the kernels so we can look at what the compiler is doing without too much noise from unrelated kernels.

So first we build with verbose output to try and identify the compilation commands for the Nbnxm kernel:

make VERBOSE=1 mdrun-pull-test -j$(nproc)

From that we see that the Nbnxm kernel is actually built in four different configurations:

[ 28%] Building CXX object src/gromacs/CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_f_prune.cpp.o
[ 28%] Building CXX object src/gromacs/CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_f_noprune.cpp.o
[ 28%] Building CXX object src/gromacs/CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_fv_noprune.cpp.o
[ 28%] Building CXX object src/gromacs/CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_fv_prune.cpp.o

Going into that build directory we can use grep again to figure out which of these contain our offending kernel:

% grep _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE *
Binary file nbnxm_sycl_kernel_body_fv_prune.cpp.o matches

Which means it’s the fv_prune configuration that is causing issues. This .o file here doesn’t actually contain assembly, but LLVM IR bitcode, with DPC++ for AMD the actual assembly is generated during the linking stage by lld.

We can extract the bitcode with the clang-offload-bundler:

clang-offload-bundler --unbundle --type=o --targets=sycl-amdgcn-amd-amdhsa-gfx906 --input=nbnxm_sycl_kernel_body_fv_prune.cpp.o --output=fv_prune.bc

Now in theory we could then simply build this bitcode file with llc and get similar assembly and use that to investigate the compiler, however trying that produces assembly that is fairly different, so to manually reproduce what the compiler is doing in the regular build we need to look further.

As stated previously on AMD the assembly is generated during the link step by lld, looking at the output of the verbose make command, we can identify the linking stage:

/path/to/llvm/build/bin/clang++ -fPIC -O3 -DNDEBUG -shared -Wl,-soname,libgromacs.so.7 -o ../../lib/libgromacs.so.7.0.0 @CMakeFiles/libgromacs.dir/objects1.rsp -Wl,-rpath,/path/to/gromacs/build2/lib: -ffast-math -fsycl -fsycl-device-code-split=per_kernel -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload -arch=gfx906 -lrt ../external/build-fftw/fftwBuild-prefix/lib/libfftw3f.a -lpthread -ffast-math -fsycl -fsycl-device-code-split=per_kernel -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx906 -lm ../../lib/libmuparser.so.2.3.2 -lm

The input files, including the nbnxm_sycl_kernel_body_fv_prune.cpp.o file we’re interested in are listed in the objects1.rsp file. We can then run this command with -###, this will make the clang driver list all the underlying commands that would be executed during this specific operation.

The output of this is quite large as GROMACS is a large application so we’ll just show the interesting lines, but looking at it you can find a command similar to the one we’ve used above to extract the bitcode:

"/path/to/llvm/build/bin/clang-offload-bundler" "-type=o" "-targets=host-x86_64-unknown-linux-gnu,sycl-amdgcn-amd-amdhsa-gfx906" "-input=CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_fv_prune.cpp.o" "-output=/tmp/nbnxm_sycl_kernel_body_fv_prune-b557cf.o" "-output=/tmp/nbnxm_sycl_kernel_body_fv_prune-ec4a19/nbnxm_sycl_kernel_body_fv_prune-gfx906.o" "-unbundle" "-allow-missing-bundles"

Then you can track the bitcode file being used in a very long llvm-link command:

"/path/to/llvm/build/bin/llvm-link" [...] "/tmp/nbnxm_sycl_kernel_body_fv_prune-ec4a19/nbnxm_sycl_kernel_body_fv_prune-gfx906.o" [...] "-o" "/tmp/alignedallocator-a0f82a/alignedallocator-gfx906.bc" "--suppress-warnings"

This command links all the bitcode files from all the kernels in GROMACS together into one very large bitcode file alignedallocator-gfx906.bc, this is because with SYCL_EXPORT some symbols may be defined in different translation units.

Then sycl-post-link is used and splits this large bitcode file per kernel instead of originally per translation unit:

"/path/to/llvm/build/bin/sycl-post-link" "-split=kernel" "-symbols" "-emit-exported-symbols" "-lower-esimd" "-O3" "-spec-const=default" "-o" "/tmp/alignedallocator-c32144/alignedallocator-gfx906.bc" "/tmp/alignedallocator-a0f82a/alignedallocator-gfx906.bc"

Since this is splitting one large bitcode files into a number of other bitcode files the following commands working on the bitcode files will be wrapped in llvm-foreach. And this is where we reach the command we’re actually interested in:

"/path/to/llvm/build/bin/llvm-foreach" "--out-ext=out" "--in-file-list=/tmp/alignedallocator-7746ed/alignedallocator-gfx906.o" "--in-replace=/tmp/alignedallocator-7746ed/alignedallocator-gfx906.o" "--out-file-list=/tmp/alignedallocator-d6ebb0/alignedallocator-gfx906.out" "--out-replace=/tmp/alignedallocator-d6ebb0/alignedallocator-gfx906.out" "--" "/path/to/llvm/build/bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx906" "-plugin-opt=O3" "-o" "/tmp/alignedallocator-d6ebb0/alignedallocator-gfx906.out" "/tmp/alignedallocator-7746ed/alignedallocator-gfx906.o"

This calls lld over all of the bitcode files, and this is the step that actually ends up generating the assembly and object file. So we can try to extract just the lld command and use that on our bitcode file we extracted manually earlier:

"/path/to/llvm/build/bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx906" "-plugin-opt=O3" "-o" "fv_prune.out" "fv_prune.bc"

And then we can disassemble fv_prune.out with llvm-objdump and as opposed to llc this file is actually very similar to our problematic assembly and shows the pattern we are looking for of v8 being overridden.

Now that we have narrowed down building the Nbnxm kernel to a fairly simple command we can move on to using one of the most powerful LLVM debugging tool: -print-after-all, with this flag the LLVM compiler will print the IR and Machine IR after every single pass or stage of the compiler. This is extremely helpful to debug however it does produce huge amounts of output which is why we couldn’t use it on the commands building the entirety of GROMACS and had to narrow it down first.

"/path/to/llvm/build/bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx906" "-plugin-opt=O3" "-o" "fv_prune.out" "fv_prune.bc" -mllvm -print-after-all -mllvm -filter-print-funcs=_ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE &> print-after-all.txt

And -filter-print-funcs narrows down the output further to only the function we’re interested in.

Now we can look through the output of this command and look for our problematic instructions, so again the pattern of a load between two barriers, and starting from the bottom of the file we end up finding the following Machine IR:

renamable $sgpr57 = S_ASHR_I32 renamable $sgpr56, 31, implicit-def dead $scc
renamable $sgpr58_sgpr59 = S_LSHL_B64 renamable $sgpr56_sgpr57, 5, implicit-def dead $scc
renamable $sgpr16 = S_ADD_U32 renamable $sgpr42, renamable $sgpr58, implicit-def $scc
renamable $vgpr7_vgpr8 = V_LSHLREV_B64_e64 3, $vgpr8_vgpr9, implicit $exec
renamable $sgpr17 = S_ADDC_U32 renamable $sgpr43, renamable $sgpr59, implicit-def dead $scc, implicit killed $scc
S_WAITCNT 49279
$vgpr6 = V_MOV_B32_e32 killed $sgpr17, implicit $exec, implicit $exec
renamable $vgpr5, renamable $sgpr16_sgpr17 = V_ADD_CO_U32_e64 killed $sgpr16, killed $vgpr7, 0, implicit $exec
renamable $vgpr6, dead renamable $sgpr16_sgpr17 = V_ADDC_U32_e64 killed $vgpr6, killed $vgpr8, killed $sgpr16_sgpr17, 0, implicit $exec
S_WAITCNT 0
S_BARRIER
renamable $vgpr50 = GLOBAL_LOAD_DWORD renamable $vgpr5_vgpr6, 16, 0, implicit $exec :: (load (s32) from %ir.1262, !tbaa !84, addrspace 1)
S_WAITCNT 0
S_BARRIER

This is clearly our problematic code, we can see the two barriers, the load, the shift by 3 and the additions, and we can see vgpr8 being used, the important instructions are the following:

[...]
renamable $vgpr7_vgpr8 = V_LSHLREV_B64_e64 3, $vgpr8_vgpr9, implicit $exec
[...]
renamable $vgpr6, dead renamable $sgpr16_sgpr17 = V_ADDC_U32_e64 killed $vgpr6, killed $vgpr8, killed $sgpr16_sgpr17, 0, implicit $exec

What is very interesting here is that in the V_ADDC instruction the $vgpr8 operand is marked killed which tells the compiler that the register is not used after this instruction and can be re-used. This is why v8 is being re-used in the body of the loop. In addition even the shift by 3 is overriding v8 so it definitely won’t be correct in the next iteration.

So we scroll back up our print-after-all.txt file up until we find the first pass that introduced these seemingly incorrect instructions, and we track it down to:

# *** IR Dump After Virtual Register Rewriter (virtregrewriter) ***:

This is the first pass that introduces this killed $vgpr8, but this pass is also the first pass where the Machine IR has machine registers, it runs right after the register allocator to do the actual replacement between the virtual registers and the newly allocated physical registers.

Scrolling up further to look at the code after the greedy register allocator, we can see that the code at that point looks like:

# *** IR Dump After Greedy Register Allocator (greedy) ***:
[...]
19776B|   S_WAITCNT 0
19792B|   S_BARRIER
19824B|   renamable $sgpr57 = S_ASHR_I32 renamable $sgpr56, 31, implicit-def dead $scc
19872B|   renamable $sgpr58_sgpr59 = S_LSHL_B64 renamable $sgpr56_sgpr57, 5, implicit-def dead $scc
19888B|   renamable $sgpr16 = S_ADD_U32 renamable $sgpr42, renamable $sgpr58, implicit-def $scc
19904B|   renamable $sgpr17 = S_ADDC_U32 renamable $sgpr43, renamable $sgpr59, implicit-def dead $scc, implicit $scc
19936B|   %3340:vgpr_32 = COPY killed renamable $sgpr17
19944B|   %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
19952B|   %3962:vreg_64 = COPY %3963:vreg_64
19960B|   undef %893.sub0:vreg_64, renamable $sgpr16_sgpr17 = V_ADD_CO_U32_e64 killed $sgpr16, %3962.sub0:vreg_64, 0, implicit $exec
19976B|   %893.sub1:vreg_64, dead renamable $sgpr16_sgpr17 = V_ADDC_U32_e64 %3340:vgpr_32, %3962.sub1:vreg_64, killed $sgpr16_sgpr17, 0, implicit $exec
20064B|   %3772:vgpr_32 = GLOBAL_LOAD_DWORD %893:vreg_64, 16, 0, implicit $exec :: (load (s32) from %ir.1262, !tbaa !84, addrspace 1)
20080B|   S_WAITCNT 0
20096B|   S_BARRIER

We can see the shift by 3, with virtual register %3078, that goes into %3963, which is then copied to %3962, and then used as operand for the addition, but as you can see at this point it is not yet marked killed.

And looking at the pass before the register allocation we see the following code:

4376B|    %889:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec↩
[...]
19776B|   S_WAITCNT 0
19792B|   S_BARRIER
19824B|   %887.sub1:sreg_64 = S_ASHR_I32 %887.sub0:sreg_64, 31, implicit-def dead $scc
19872B|   %891:sreg_64 = S_LSHL_B64 %887:sreg_64, 5, implicit-def dead $scc
19888B|   %3326:sreg_32 = S_ADD_U32 %692.sub2:sgpr_128, %891.sub0:sreg_64, implicit-def $scc
19904B|   %3327:sreg_32 = S_ADDC_U32 %692.sub3:sgpr_128, %891.sub1:sreg_64, implicit-def dead $scc, implicit $scc
19936B|   %3340:vgpr_32 = COPY %3327:sreg_32
19944B|   undef %893.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %889.sub0:vreg_64, 0, implicit $exec
19952B|   %893.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %889.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec
20064B|   %3772:vgpr_32 = GLOBAL_LOAD_DWORD %893:vreg_64, 16, 0, implicit $exec :: (load (s32) from %ir.1262, !tbaa !84, addrspace 1)
20080B|   S_WAITCNT 0
20096B|   S_BARRIER

Looking at this code it’s important to note that the shift by 3 is actually not inside of the loop, it’s before the loop starts and its value is simply used inside of the loop for the additions. This code should work absolutely fine, but it seems that the greedy register allocator moves this shift inside of the loop, and then the virtual register rewriter marks the v8 operand as killed.

Now at this point in our investigation, we need to start debugging the actual code of these passes to understand why they are doing these transformations and try to figure out what is going wrong. One good way to start with that is to use -debug-only=regalloc, this will print debugging output for the register allocation. However it doesn’t support the -filter-print-funcs= flag like -print-after-all which leads to huge output so we need to reduce our bitcode file a little further.

To do that we can use opt to remove from the bitcode all the kernels we’re not interested in:

/path/to/llvm/build/bin/opt --internalize-public-api-list=_ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE --internalize --globaldce fv_prune.bc -o fv_prune_trimmed.bc

This command will mark internal all the functions that are not listed in the --internalize-public-api-list flag, and then run a globaldce pass which will eliminate all the internal symbols. This trims down our bitcode file quite significantly.

Then we can also run the compilation up until right before the greedy register allocator, this way we can just run the passes we’re trying to debug:

"/path/to/llvm/build/bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx906" "-plugin-opt=O3" "-o" "pre_greedy.mir" "fv_prune_trimmed.bc" -mllvm -stop-before=greedy

lld is not really setup to use this stop-before parameter and so it will crash but not before giving us the output we want in pre_greedy.mir, this will now contain the Machine IR before the greedy register allocator, the one with the shift by 3 outside of the loop.

We can then use llc to run only the specific pass and get debug output from register allocation:

/path/to/llvm/build/bin/llc -start-before=greedy -stop-after=virtregrewriter -mcpu=gfx906 -debug-only=regalloc pre_greedy.mir -o post_regalloc.mir &> regalloc.txt

This command will run just the passes between greedy and virtregrewriter, and print debug output for the register allocation -debug-only=regalloc. The names for the passes can be seen in the -print-after-all dumps in parenthesis next to the longer name of the passes. Note that this regalloc name can be found in the source of the passes mentioned above, for example in:

/path/to/llvm/llvm/lib/CodeGen/RegAllocGreedy.cpp

Under:

#define DEBUG_TYPE "regalloc"↩

This in turns control the LLVM_DEBUG directives in that file and places them under the regalloc keyword.

So now that we have the debug output from the register allocation we can look through that, and to find what we’re looking for we can simply look for the virtual registers that we spotted above so %3078, and we find the following debug output:

Removing 1 back-copies.
Removing 16328r|%3961:vreg_64 = COPY %890:vreg_64
  blit [3440r,3488B:0): [3440r;3488B)=0(%3961)(recalc)
  blit [16192B,38336B:0): [16192B;16312r)=0(%3961)(recalc) [16312r;16336r)=1(%3962)(recalc) [16336r;38336B)=0(%3961)(recalc)
  rewr %bb.6|   3440r:0|%3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
  rewr %bb.90|  16320B:1|   undef %894.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %3962.sub0:vreg_64, 0, implicit $exec
  rewr %bb.90|  16336B:1|   %894.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %3962.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec
  rewr %bb.90|  16312B:0|   %3962:vreg_64 = COPY %3961:vreg_64
queuing new interval: %3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r  L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r  L0000000000000003 [3440r,3488B:1)[1    6192B,38336B:1) 0@x 1@3440r  weight:3.681593e-04
Enqueuing %3961
queuing new interval: %3962 [16312r,16336r:0) 0@16312r  L000000000000000C [16312r,16336r:0) 0@16312r  L0000000000000003 [16312r,16320r:0) 0@16312r  weight:5.705815e-02↩
Enqueuing %3962

selectOrSplit VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r  L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r  L0000000000000003 [3440r,3488B:1)[1    6192B,38336B:1) 0@x 1@3440r  weight:3.681593e-04 w=3.681593e-04
RS_Spill Cascade 0
should evict: %677 [96r,13936r:0)[16144B,38336B:0) 0@96r  weight:6.930721e-05 w= 6.930721e-05
should evict: %677 [96r,13936r:0)[16144B,38336B:0) 0@96r  weight:6.930721e-05 w= 6.930721e-05
Inline spilling VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r  L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r  L0000000000000003 [3440r,3488B:1)    [16192B,38336B:1) 0@x 1@3440r  weight:3.681593e-04
From original %890
|   remat:  16308r| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
|           16312e| %3962:vreg_64 = COPY killed %3963:vreg_64

All defs dead: dead %3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
Remat created 1 dead defs.
Deleting dead def 3440r|dead %3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
unassigning %3078 from $vgpr8_vgpr9: VGPR8_LO16 VGPR8_HI16 VGPR9_LO16 VGPR9_HI16
Enqueuing %3078
Shrink: %3078 [2224r,2240r:0)[2240r,3488B:1)[16192B,38336B:1) 0@2224r 1@2240r  L0000000000000003 [2224r,3440r:0) 0@2224r  L000000000000000C [2240r,3488B:0)[16192B,38336B    :0) 0@2240r  weight:1.520717e-01

Now this is quite a lot so first let’s roll back to the beginning of the debug output where the full kernel is shown, we can see that at that point, the code looks like this:

3440B|    %890:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
[...]
16320B|   undef %894.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %890.sub0:vreg_64, 0, implicit $exec↩
16336B|   %894.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %890.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec↩
16352B|   %3772:vgpr_32 = GLOBAL_LOAD_DWORD %894:vreg_64, 16, 0, implicit $exec :: (load (s32) from %ir.imask97.i, !tbaa !75, addrspace 1)↩
16368B|   S_WAITCNT 0↩
16384B|   S_BARRIER↩

In this snippet we can see that the shift by 3 is at 3440 in the code and that the loop is around 163XX, and also that at the beginning of the register allocation, the shift by 3 is indeed outside of the loop as expected, so coming back to the debug output, we first have:

Removing 1 back-copies.
Removing 16328r|%3961:vreg_64 = COPY %890:vreg_64
  blit [3440r,3488B:0): [3440r;3488B)=0(%3961)(recalc)
  blit [16192B,38336B:0): [16192B;16312r)=0(%3961)(recalc) [16312r;16336r)=1(%3962)(recalc) [16336r;38336B)=0(%3961)(recalc)
  rewr %bb.6|   3440r:0|%3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
  rewr %bb.90|  16320B:1|   undef %894.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %3962.sub0:vreg_64, 0, implicit $exec
  rewr %bb.90|  16336B:1|   %894.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %3962.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec
  rewr %bb.90|  16312B:0|   %3962:vreg_64 = COPY %3961:vreg_64

Now this is in the middle of the register allocation modifications so the code changed a little bit but looks sort of the same, it seems that there’s now a copy inside of the loop of %890 into %3961 and that this is trying to remove it. And we can see the modified instructions in the rewr part:

  rewr %bb.6|   3440r:0|%3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
  rewr %bb.90|  16320B:1|   undef %894.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %3962.sub0:vreg_64, 0, implicit $exec
  rewr %bb.90|  16336B:1|   %894.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %3962.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec
  rewr %bb.90|  16312B:0|   %3962:vreg_64 = COPY %3961:vreg_64

So at this point the addition instructions that used to take %890 take %3962, the shift is now writing to %3961 instead of %890 and %3961 is copied into %3962 before the additions.

And at this stage the shift is still outside of the loop which seems fine, so we can move on to the next part of the output, and namely the inline spilling part:

Inline spilling VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r  L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r  L0000000000000003 [3440r,3488B:1)    [16192B,38336B:1) 0@x 1@3440r  weight:3.681593e-04
From original %890
|   remat:  16308r| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
|           16312e| %3962:vreg_64 = COPY killed %3963:vreg_64

This is important because this is the first time we see the shift by 3 instruction moved into the loop, as you can see in 16308 instead of 3440.

And you can then see the compiler delete the original shift instruction:

All defs dead: dead %3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
Remat created 1 dead defs.
Deleting dead def 3440r|dead %3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec

And then restart the register allocation for it since the instruction was moved:

unassigning %3078 from $vgpr8_vgpr9: VGPR8_LO16 VGPR8_HI16 VGPR9_LO16 VGPR9_HI16
Enqueuing %3078
Shrink: %3078 [2224r,2240r:0)[2240r,3488B:1)[16192B,38336B:1) 0@2224r 1@2240r  L0000000000000003 [2224r,3440r:0) 0@2224r  L000000000000000C [2240r,3488B:0)[16192B,38336B    :0) 0@2240r  weight:1.520717e-01

Now in practice it took a bit more debugging but here we can start to see where the problem is coming from on the last line:

%3078 [2224r,2240r:0)[2240r,3488B:1)[16192B,38336B:1) 0@2224r 1@2240r  L0000000000000003 [2224r,3440r:0) 0@2224r  L000000000000000C [2240r,3488B:0)[16192B,38336B:0) 0@2240r

What this represents is the live ranges for the virtual register %3078, which means the parts of the code where the register is considered “alive”, so essentially all the code between the instruction that defines the register and its last use, but let’s break it down a bit further:

[2224r,2240r:0)[2240r,3488B:1)[16192B,38336B:1) 0@2224r 1@2240r

This first part tracks the liveness of the full register, these live intervals contain two “values” (and/or definitions), 0@2224r and 1@2240r, indeed if we look at the full Machine IR:

2224B|    undef %3078.sub0:vreg_64 = V_LSHRREV_B32_e32 5, %19:vgpr_32, implicit $exec
2240B|    %3078.sub1:vreg_64 = V_MOV_B32_e32 0, implicit $exec↩

This defines the two sub-registers of %3078. And so what the rest of the notation shows us is that %3078 is alive between 2224 and 2240 for value 0: [2224r,2240r:0), then is alive between 2240 to 3488 for value 1: [2240r,3488B:1), and then between 16192 to 38336 for value 1: [16192B,38336B:1).

Now as you can probably tell this notation of the live range seems a little strange because it is for the full register, but here the values 0 and 1 are only defining parts of the register, and so we need to look at the rest of the representation of the live range which shows us information about the sub-registers:

L0000000000000003 [2224r,3440r:0) 0@2224r
L000000000000000C [2240r,3488B:0)[16192B,38336B:0) 0@2240r

The notation here is similar to the notation above, with the initial value being a mask representing a sub-register, in binary 0x3 is 0011 and 0xC is 1100. And so looking at these two lines you can see that the first, 0x3 represents %3078.sub0 and 0xC represents %3078.sub1.

And so now we know that %3078.sub0 is defined at 2224 and is alive between 2224 and 3440, which means it’s alive until the shift by 3 but not further. However %3078.sub1 is defined at 2240 and is alive between 2240 and 3488 which means it’s alive past the shift, but not only that, it is also alive between 16192 and 38336, this second interval is important because it covers the loop with our problematic load.

Now that we understand the live intervals of %3078 the problem becomes fairly obvious when looking back at the inline spilling:

Inline spilling VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r  L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r  L0000000000000003 [3440r,3488B:1)    [16192B,38336B:1) 0@x 1@3440r  weight:3.681593e-04
From original %890
|   remat:  16308r| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
|           16312e| %3962:vreg_64 = COPY killed %3963:vreg_64

This code moved the shift by 3 inside the loop, but at 16308 where we know that %3078.sub0 is not alive, and this is the virtual register that ultimately gets assigned to v8. And so it makes sense that v8 was being re-used, because as far as the compiler is aware it is not supposed to be used at that point.

Thus we’re almost done, we know that the instruction is being incorrectly moved into the loop, we just need to find how this error manifests in the code and how to fix it. To do that we can simply search for the printed keywords such as remat: or Inline spilling and then follow the code from there. It took a little more time until reaching the following code in LiveRangeEdit::allUsesAvailableAt:

// Check that subrange is live at UseIdx.
if (MO.getSubReg()) {
  const TargetRegisterInfo *TRI = MRI.getTargetRegisterInfo();
  LaneBitmask LM = TRI->getSubRegIndexLaneMask(MO.getSubReg());
  for (LiveInterval::SubRange &SR : li.subranges()) {
    if ((SR.LaneMask & LM).none())
      continue;
    if (!SR.liveAt(UseIdx))
      return false;

    // Early exit if all used lanes are checked. No need to continue.
    LM &= ~SR.LaneMask;
    if (LM.none())
      break;
  }
}

The function this is in is trying to determine if all the uses of an instructions are available at a given point, which is in turn used to determine if it is valid to move the instruction there. And it is checking for sub-register ranges, but only if the operand of the instruction is using a sub-register. This is not the case for us, our shift instruction uses the full register, but one of the sub-register is not alive at the destination, and so this code ends up not checking our instruction for sub-ranges and incorrectly moves the shift into the loop.

The fix is then fairly simple, we just need to always check the sub-ranges when they’re available, and so tweaking the code above as follows resolved the issue:

// Check that subrange is live at UseIdx.
if (li.hasSubRanges()) {
  const TargetRegisterInfo *TRI = MRI.getTargetRegisterInfo();
  unsigned SubReg = MO.getSubReg();
  LaneBitmask LM = SubReg ? TRI->getSubRegIndexLaneMask(SubReg)
                          : MRI.getMaxLaneMaskForVReg(MO.getReg());

With this patch instead of checking the sub-ranges only if the instruction is using a sub-register, it checks them if they exist, and if the instruction is using the full register, it uses a mask containing both sub-registers to check them both.

And so, the inline spilling fails to move the shift instruction in the loop:

Inline spilling VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r  L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r  L0000000000000003 [3440r,3488B:1)    [16192B,38336B:1) 0@x 1@3440r  weight:3.681593e-04
From original %890
|   cannot remat for 16312e|%3962:vreg_64 = COPY %3961:vreg_64

And since it can’t “inline” the spill it simply resorts to a normal spill:

Merged spilled regs: SS#2 [3440r,3488B:0)[16192B,38336B:0) 0@x  weight:0.000000e+00
spillAroundUses %3961
|   rewrite: 3440r| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec

|   spill:   3448r| SI_SPILL_V64_SAVE killed %3963:vreg_64, %stack.2, $sgpr32, 0, implicit $exec :: (store (s64) into %stack.2, align 4, addrspace 5)
Checking redundant spills for 0@16312r in %3962 [16312r,16336r:0) 0@16312r  L000000000000000C [16312r,16336r:0) 0@16312r  L0000000000000003 [16312r,16320r:0) 0@16312r  w    eight:5.705815e-02
Merged to stack int: SS#2 [3440r,3488B:0)[16192B,38336B:0) 0@x  weight:0.000000e+00
|   folded:   16312r|   %3962:vreg_64 = SI_SPILL_V64_RESTORE %stack.2, $sgpr32, 0, implicit $exec :: (load (s64) from %stack.2, align 4, addrspace 5)

And the code generated this way is now correct meaning that the GROMACS test now passes on gfx906.

Conclusion

Finding the root cause of this issue was an interesting journey that took us from a molecular dynamics kernel all the way down to the AMD GPU ISAs.

Through describing this journey, this blog post provides some insight on a number of techniques that can be used either when debugging issues in the LLVM project, in DPC++ , or when working in an environment with limited debugging capabilities. And it also shows a glimpse into the DPC++ and LLVM components used during the compilation of GPU kernels. Which may hopefully be helpful to anyone wanting to learn more about these technologies.

And in closing, the full patch fixing this specific issue was submitted to upstream LLVM and promptly merged:

  • https://reviews.llvm.org/D131884
Codeplay Software Ltd has published this article only as an opinion piece. Although every effort has been made to ensure the information contained in this post is accurate and reliable, Codeplay cannot and does not guarantee the accuracy, validity or completeness of this information. The information contained within this blog is provided "as is" without any representations or warranties, expressed or implied. Codeplay Sofware Ltd makes no representations or warranties in relation to the information in this post.
Nicolas Miller's Avatar

Nicolas Miller

Senior Software Engineer