Tracking down a register allocator bug
20 August 2025
This article describes the investigation of a bug found when running GROMACS testing using SYCL with the DPC++ compiler while targeting AMD GPUs.
This was initially reported in the intel/llvm
GitHub repository issue-tracker
intel/llvm#6209 and was using a
specific
version
of GROMACS with SYCL support.
This bug will take us from the GROMACS source code all the way down to the register allocation in the compiler backend.
Description of the issue
When running one of the GROMACS tests built with DPC++ targeting AMD, the test crashes with the following output:
Memory access fault by GPU node-4 (Agent handle: 0x215c1f0) on address 0x7ff5f0d6d000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)
This error indicates that the kernel is accessing invalid memory, for example this can happen in simple out-of-bounds accesses.
It is reported that this issue appears on MI50 (gfx906 ISA), and MI200 (gfx90a ISA), but that it doesn’t fail on MI100 (gfx908 ISA).
The specific test is being run as follows:
SYCL_DEVICE_FILTER=hip:gpu ./bin/mdrun-pull-test
Finding the kernel
The GROMACS application contains a lot of kernels so the first step is to figure out exactly which kernel is causing the issue.
One easy way to do that is to use the ROCm debug environment variable
AMD_LOG_LEVEL=4
. We can then find the last ShaderName
debug output before
the crash:
:3:rocvirtual.cpp :2738: 6536108370963 us: 24086: [tid:0x7f5e78ab4740] ShaderName : _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE
Which gives us the name of the last kernel to run on the GPU, which is almost
certainly the kernel that caused the access fault. This debug output gives us
the kernel name in its mangled form, but using the c++filt
tool we can easily
turn it into a more readable format, and so we get:
c++filt _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE
typeinfo name for NbnxmKernel<true, true, (Nbnxm::ElecType)1, (Nbnxm::VdwType)1>
Alternatively we can use the ROCm debugger rocgdb
, simply running the program
under rocgdb
after the segfault the debugger will show which kernel it
happened in and also allow us to disassemble the kernel. Using the debugger can
unfortunately only go so far because debug information is not supported in DPC++
for AMD targets.
With the debugger we get the following:
Thread 3 "mdrun-pull-test" received signal SIGBUS, Bus error.↩
[Switching to thread 3, lane 0 (AMDGPU Lane 4:13:1:1/0 (0,0,0)[0,0,0])]↩
0x00007fffe5297078 in typeinfo name for NbnxmKernel<true, true, (Nbnxm::ElecType)1, (Nbnxm::VdwType)1> () from file:///path/to/gromacs/build/lib/libgromacs.so.7 #offset=75955456&size=17664↩
Using the gdb
command disas
we can then do our first proper analysis step
and look at the assembly dump showing on which instruction the memory access
fault is happening.
Dump of assembler code for function _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE:
[...]
0x00007fffe5297064 <+1124>:| global_load_dwordx2 v[14:15], v[5:6], off offset:16
0x00007fffe529706c <+1132>:| s_waitcnt lgkmcnt(0)
0x00007fffe5297070 <+1136>:| v_mov_b32_e32 v10, s41
0x00007fffe5297074 <+1140>:| s_mov_b32 s64, s80
=> 0x00007fffe5297078 <+1144>:| s_waitcnt vmcnt(0)
[...]
The arrow indicates that the error is happening on the s_waitcnt vmcnt(0)
.
This instruction waits for all the memory operations using vector registers to
be completed. Therefore the likely culprit in this case is actually the
global_load_dwordx2
instruction above.
At this point we know that the issue happens in the NbnxmKernel
and from the
disassembly it seems likely to be caused by a load instruction, this is
consistent with the memory access fault issue we’re seeing, the address used for
this load instruction must be incorrect in some way.
Comparing the assembly
We know that the application works on gfx908
but not on gfx906
so a first
step can be to compare the assembly generated by both and see if any significant
differences can be identified.
To obtain the assembly we can use the environment variable SYCL_DUMP_IMAGES=1
,
this is a DPC++ environment variable which will dump all the available kernels
images when running an application. GROMACS is a fairly large application so we
end up with 309 .bin
images.
We can then use grep
to search for the mangled kernel name we got out of
the debugger:
% grep _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE *.bin
Binary file sycl_amdgcn142.bin matches
The .bin
files are HIP fat binaries, we can then extract the actual device
object file from them with:
clang-offload-bundler --unbundle --type=o --targets=hipv4-amdgcn-amd-amdhsa--gfx906 --input=sycl_amdgcn142.bin --output=sycl_amdgcn142.bin.o
And then disassemble the device objects with:
llvm-objdump -d sycl_amdgcn142.bin.o &> sycl_amdgcn142.s
Doing that for both architectures we end up with the assembly for both and we can then use a diff tool to compare them.
Unfortunately at this point the differences between the two are very significant
and it’s very difficult to track down exactly where the offending instruction is
in the gfx908
assembly, so more work will be required to make use of this and
we’ll come back to it later.
Tracking down the issue in the source
Knowing the kernel name we can find the source for it in GROMACS:
src/gromacs/nbnxm/sycl/nbnxm_sycl_kernel_body.h
- https://gitlab.com/gromacs/gromacs/-/blob/aa-hwe-release-2022-dpcpp-hip/src/gromacs/nbnxm/sycl/nbnxm_sycl_kernel_body.h#L663
Unfortunately for us this kernel is quite large and complicated, roughly 400
lines of code with loops and many conditionals. And our debugging options are
very limited, indeed as mentioned above source level debugging is not available,
and neither is printf
! Both of these are available in the regular AMD
toolchain but not yet enabled in DPC++ for AMD at the time of this
investigation.
So as a way to get source level information on where this bug is happening we used the horrible snippet of code below:
volatile int* crash = nullptr;
*crash++;
You see adding this in the body of a kernel will cause it to crash with a memory
access fault error nearly identical to the one caused by our bug, but since the
error message prints the address of the memory access fault it will always show
0
if this piece of code caused it. And so this will show us if *crash++
was
run before or after the offending line of code, which means that by moving
*crash++
around in our kernel code and looking at the resulting error we can
eventually track down which line of code is causing the issue.
After a lot of trial and error the offending line in the kernel source was identified to be:
unsigned imask = a_plistCJ4[j4].imei[imeiIdx].imask;
This matches what we were expecting since this line loads from both a_plistCJ4
and then from imei
, so it must be that either one of these loads triggers the
error.
In addition, we also observed that the error never happens on the first iteration of the loop this line is in, this information will become important later on.
At this stage the obvious next step is to check whether j4
or imeiIdx
are
out of bounds for what they’re indexing. A cursory look through the code didn’t
spot anything obvious that would suggest these indices may be wrong. But by that
point even though we still didn’t have access to a proper printf
we figured
out that it was possible to use lower level printf
primitives provided by the
HIP toolchain from within the kernel, so we were able to try printing these
indices:
long msg = __ockl_fprintf_stdout_begin();
msg = __ockl_fprintf_append_string_n(msg, "debug: %lu, %lu\n", 16, 0);
msg = __ockl_fprintf_append_args(msg, 2, j4, imeiIdx, 0, 0, 0, 0, 1);
Now surprisingly after adding this bit of code before the offending line the program started working correctly! This is usually indicative that whatever we did to the code caused the compiler take a different path and avoided triggering the issue.
Digging a little further around that idea, we found out that printing just
imeiIdx
fixed the issue, and furthermore marking imeiIdx
as volatile
also
fixed the problem, and finally since we also know that the first iteration is
always correct, we also ended up figuring out that disabling loop unrolling with
-fno-unroll-loops
also fixes the issue.
So we now have a lot more information about what’s going on and multiple ways of avoiding the issue, so we’re almost ready to dive back in the assembly to see what we can figure out. But before we do, one of the helpful thing we can do is to surround the offending line with barriers:
itemIdx.barrier(fence_space::local_space);
unsigned imask = a_plistCJ4[j4].imei[imeiIdx].imask;
itemIdx.barrier(fence_space::local_space);
Barriers are helpful because they’re lowered to recognizable s_barrier
instructions and they limit how the compiler can re-organize the assembly around
them which makes it a lot easier to identify in the assembly where this line of
source code is. And luckily for us adding them doesn’t fix the issue we’re
seeing, so we can use them as a handy marker.
Diving back into the assembly
Now we can look again at comparing assembly between all the different cases we’ve identified, and see if we can spot what’s incorrect:
Assembly for gfx906
with the extra barriers and no other modification, the
offending instruction is the global_load_dword
at the bottom. The generated
assembly looks different than what we originally got in rocgdb
but it still
triggers the error:
s_cbranch_scc1 2000 // 0000000020AC: BF8507D0 <_ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE+0x23f0>
s_ashr_i32 s43, s42, 31 // 0000000020B0: 902B9F2A
s_lshl_b64 s[48:49], s[42:43], 5 // 0000000020B4: 8EB0852A
s_add_u32 s16, s40, s48 // 0000000020B8: 80103028
v_lshlrev_b64 v[7:8], 3, v[8:9] // 0000000020BC: D28F0007 00021083
s_addc_u32 s17, s41, s49 // 0000000020C4: 82113129
s_waitcnt lgkmcnt(0) // 0000000020C8: BF8CC07F
v_mov_b32_e32 v6, s17 // 0000000020CC: 7E0C0211
v_add_co_u32_e64 v5, s[16:17], s16, v7 // 0000000020D0: D1191005 00020E10
v_addc_co_u32_e64 v6, s[16:17], v6, v8, s[16:17] // 0000000020D8: D11C1006 00421106
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) // 0000000020E0: BF8C0000
s_barrier // 0000000020E4: BF8A0000
global_load_dword v52, v[5:6], off offset:16 // 0000000020E8: DC508010 347F0005
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) // 0000000020F0: BF8C0000
s_barrier
Assembly for gfx906
with imeiIdx
marked as volatile
(test passing):
s_cbranch_scc1 2012 // 000000002114: BF8507DC <_ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE+0x2488>
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
s_barrier // 00000000211C: BF8A0000
buffer_load_dword v8, off, s[0:3], s33 offset:16 // 000000002120: E0500010 21000800
s_nop 0 // 000000002128: BF800000
buffer_load_dword v9, off, s[0:3], s33 offset:20 // 00000000212C: E0500014 21000900
s_ashr_i32 s47, s46, 31 // 000000002134: 902F9F2E
s_lshl_b64 s[52:53], s[46:47], 5 // 000000002138: 8EB4852E
s_add_u32 s20, s44, s52 // 00000000213C: 8014342C
s_addc_u32 s21, s45, s53 // 000000002140: 8215352D
s_add_u32 s24, s20, 16 // 000000002144: 80189014
s_addc_u32 s25, s21, 0 // 000000002148: 82198015
v_mov_b32_e32 v7, s25 // 00000000214C: 7E0E0219
s_mov_b32 s58, s81 // 000000002150: BEBA0051
s_waitcnt vmcnt(0) // 000000002154: BF8C0F70
flat_load_dword v11, v[8:9] glc // 000000002158: DC510000 0B000008
s_waitcnt vmcnt(0) lgkmcnt(0) // 000000002160: BF8C0070
v_lshlrev_b64 v[5:6], 3, v[11:12] // 000000002164: D28F0005 00021683
v_add_co_u32_e64 v5, s[20:21], s24, v5 // 00000000216C: D1191405 00020A18
v_addc_co_u32_e64 v6, s[20:21], v7, v6, s[20:21] // 000000002174: D11C1406 00520D07
global_load_dword v51, v[5:6], off // 00000000217C: DC508000 337F0005
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) // 000000002184: BF8C0000
s_barrier // 000000002188: BF8A0000
With volatile
the code is changed a fair bit, in particular a lot of the
address calculation code is moved inside the two barriers. But looking at the
assembly closely we can see that it is doing the same operations and the
global_load_dword
at the bottom of that assembly snippet corresponds to the
offending global_load_dword
in the previous snippet.
The interesting part about the assembly with volatile
is that it can help us
figure out what corresponds to imeiIdx
in the assembly, here we can see the
instruction:
flat_load_dword v11, v[8:9] glc // 000000002158: DC510000 0B000008
Just before the last stages of address computation before the
global_load_dword
and that instruction is not in the snippet without
volatile
. In addition looking at the ISA document for gfx906
you can see
that glc
stands for Globally Coherent
and marks that this instruction
bypasses the L1 cache. So it is quite likely that imeiIdx
, our volatile
variable is being loaded by this instruction, and so that in this assembly
snippet it would correspond to the vector register v11
.
We can then see how v11
is used:
v_lshlrev_b64 v[5:6], 3, v[11:12] // 000000002164: D28F0005 00021683
v_add_co_u32_e64 v5, s[20:21], s24, v5 // 00000000216C: D1191405 00020A18
v_addc_co_u32_e64 v6, s[20:21], v7, v6, s[20:21] // 000000002174: D11C1406 00520D07
global_load_dword v51, v[5:6], off // 00000000217C: DC508000 337F0005
It is shifted by 3 into v[5:6]
, as a register pair v[11:12]
, then v5
and
v6
are used in the addition instructions, which outputs are then used as the
address for global_load_dword
.
We can then go back to the assembly of the case with errors and look for this
address computation pattern, to find which register corresponds to imeiIdx
,
and we can easily find:
[...]
v_lshlrev_b64 v[7:8], 3, v[8:9] // 0000000020BC: D28F0007 00021083
[...]
v_add_co_u32_e64 v5, s[16:17], s16, v7 // 0000000020D0: D1191005 00020E10
v_addc_co_u32_e64 v6, s[16:17], v6, v8, s[16:17] // 0000000020D8: D11C1006 00421106
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) // 0000000020E0: BF8C0000
s_barrier // 0000000020E4: BF8A0000
global_load_dword v52, v[5:6], off offset:16 // 0000000020E8: DC508010 347F0005
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) // 0000000020F0: BF8C0000
s_barrier
The shift by 3, followed by the two add instructions and then the load,
therefore in the broken assembly imeiIdx
is stored in v8
.
So now that we know v8
is supposed to be imeiIdx
we can scroll up in the
assembly without volatile
and find out how it’s calculated, knowing that v0
and v1
contain thread indices:
// v3 = v1
v_mov_b32_e32 v3, v1 // 000000001C90: 7E060301
[...]
// v15 = (v3 << 3) + v0
// v15 = idy * 8 + idx // index flipped
v_lshl_add_u32 v15, v3, 3, v0 // 000000001D30: D1FD000F 04010703
[...]
// v8 = v15 >> 5
// v8 = v15 / 32
v_lshrrev_b32_e32 v8, 5, v15 // 000000001F3C: 20101E85
And noting that in the source imeiIdx
is calculated by:
const unsigned tidxi = itemIdx.get_local_id(2);↩
const unsigned tidxj = itemIdx.get_local_id(1);↩
const unsigned tidx = tidxj * c_clSize + tidxi;
const unsigned imeiIdx = tidx / prunedClusterPairSize;↩
With c_clSize = 8
and prunedClusterPairSize = 32
.
So we can see that the calculation of imeiIdx
before the loop is correct, this
is consistent with what we were seeing that the first iteration of the loop was
always correct. But we can look further into the loop after the load instruction
to see what happens to v8
.
And very quickly we can see that after the incorrect instruction v8
is almost
immediately re-used and its value erased, and it doesn’t seem to be spilled to
memory and re-loaded later or to have its value re-calculated:
v_ashrrev_i32_e32 v8, 31, v7 // 00000000210C: 22100E9F
At this point the suspicion is clear, v8
which is supposed to contain
imeiIdx
is overridden when it shouldn’t be. However it’s hard to say for sure
as the loop is thousands of instructions long. But we can try to confirm this by
comparing further with other cases that we know are working.
Now comparing the same assembly for gfx908
that we also know to be working, we
observe the following:
v_accvgpr_read_b32 v2, a8 // 000000002084: D3D84002 18000107
v_add_co_u32_e64 v5, s[16:17], s16, v1 // 00000000208C: D1191005 00020210
v_addc_co_u32_e64 v6, s[16:17], v6, v2, s[16:17] // 000000002094: D11C1006 00420506
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) // 00000000209C: BF8C0000
s_barrier // 0000000020A0: BF8A0000
global_load_dword v52, v[5:6], off offset:16 // 0000000020A4: DC508010 347F0005
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) // 0000000020AC: BF8C0000
s_barrier
In the gfx908
snippet we see that the shift by 3 that we were observing
previously is not near the load but instead there is a v_accvgpr_read_b32
instruction. Looking further up in the assembly we can see that imeiIdx
is
calculated outside of the loop as previously, but then its value is stored in a
register a8
, and then loaded before the global_load_dword
. And a8
is not
written to anywhere else in the program so the value of imeiIdx
will be
correct for the entire loop.
Looking at the ISA documents we can see that a8
is an ACC vector register,
these registers are part of the matrix multiplication unit, but it seems here
that it is used for spilling the value of imeiIdx
, it is important to note
that gfx906
doesn’t have this matrix multiplication unit or ACC register and
that gfx90a
MI200, which we know also fails, does have these registers, but
looking through the LLVM code base we could figure out that on gfx90a
these
can be used as general purpose vector registers and so they are not used for
spilling like they are on gfx908
.
Furthermore we can also look at the assembly for gfx906
but when loop
unrolling is disabled. In that scenario it looks just like the regular gfx906
however, the register matching imeiIdx
is not re-used in the body of the loop.
So we have a fairly strong idea of why it is failing, that is to say v8
is
being incorrectly re-used, but we don’t now why or how to fix it yet.
Diving further into the compiler
Now we need to analyze what the compiler is doing and try to understand why v8
is re-used, but to do that we first need to narrow down a bit the compilation of
the kernels so we can look at what the compiler is doing without too much noise
from unrelated kernels.
So first we build with verbose output to try and identify the compilation
commands for the Nbnxm
kernel:
make VERBOSE=1 mdrun-pull-test -j$(nproc)
From that we see that the Nbnxm
kernel is actually built in four different
configurations:
[ 28%] Building CXX object src/gromacs/CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_f_prune.cpp.o
[ 28%] Building CXX object src/gromacs/CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_f_noprune.cpp.o
[ 28%] Building CXX object src/gromacs/CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_fv_noprune.cpp.o
[ 28%] Building CXX object src/gromacs/CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_fv_prune.cpp.o
Going into that build directory we can use grep
again to figure out which of
these contain our offending kernel:
% grep _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE *
Binary file nbnxm_sycl_kernel_body_fv_prune.cpp.o matches
Which means it’s the fv_prune
configuration that is causing issues. This .o
file here doesn’t actually contain assembly, but LLVM IR bitcode, with DPC++ for
AMD the actual assembly is generated during the linking stage by lld
.
We can extract the bitcode with the clang-offload-bundler
:
clang-offload-bundler --unbundle --type=o --targets=sycl-amdgcn-amd-amdhsa-gfx906 --input=nbnxm_sycl_kernel_body_fv_prune.cpp.o --output=fv_prune.bc
Now in theory we could then simply build this bitcode file with llc
and get
similar assembly and use that to investigate the compiler, however trying that
produces assembly that is fairly different, so to manually reproduce what the
compiler is doing in the regular build we need to look further.
As stated previously on AMD the assembly is generated during the link step by
lld
, looking at the output of the verbose make command, we can identify the
linking stage:
/path/to/llvm/build/bin/clang++ -fPIC -O3 -DNDEBUG -shared -Wl,-soname,libgromacs.so.7 -o ../../lib/libgromacs.so.7.0.0 @CMakeFiles/libgromacs.dir/objects1.rsp -Wl,-rpath,/path/to/gromacs/build2/lib: -ffast-math -fsycl -fsycl-device-code-split=per_kernel -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload -arch=gfx906 -lrt ../external/build-fftw/fftwBuild-prefix/lib/libfftw3f.a -lpthread -ffast-math -fsycl -fsycl-device-code-split=per_kernel -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx906 -lm ../../lib/libmuparser.so.2.3.2 -lm
The input files, including the nbnxm_sycl_kernel_body_fv_prune.cpp.o
file
we’re interested in are listed in the objects1.rsp
file. We can then run this
command with -###
, this will make the clang driver list all the underlying
commands that would be executed during this specific operation.
The output of this is quite large as GROMACS is a large application so we’ll just show the interesting lines, but looking at it you can find a command similar to the one we’ve used above to extract the bitcode:
"/path/to/llvm/build/bin/clang-offload-bundler" "-type=o" "-targets=host-x86_64-unknown-linux-gnu,sycl-amdgcn-amd-amdhsa-gfx906" "-input=CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_fv_prune.cpp.o" "-output=/tmp/nbnxm_sycl_kernel_body_fv_prune-b557cf.o" "-output=/tmp/nbnxm_sycl_kernel_body_fv_prune-ec4a19/nbnxm_sycl_kernel_body_fv_prune-gfx906.o" "-unbundle" "-allow-missing-bundles"
Then you can track the bitcode file being used in a very long llvm-link
command:
"/path/to/llvm/build/bin/llvm-link" [...] "/tmp/nbnxm_sycl_kernel_body_fv_prune-ec4a19/nbnxm_sycl_kernel_body_fv_prune-gfx906.o" [...] "-o" "/tmp/alignedallocator-a0f82a/alignedallocator-gfx906.bc" "--suppress-warnings"
This command links all the bitcode files from all the kernels in GROMACS
together into one very large bitcode file alignedallocator-gfx906.bc
, this is
because with SYCL_EXPORT
some symbols may be defined in different translation
units.
Then sycl-post-link
is used and splits this large bitcode file per kernel
instead of originally per translation unit:
"/path/to/llvm/build/bin/sycl-post-link" "-split=kernel" "-symbols" "-emit-exported-symbols" "-lower-esimd" "-O3" "-spec-const=default" "-o" "/tmp/alignedallocator-c32144/alignedallocator-gfx906.bc" "/tmp/alignedallocator-a0f82a/alignedallocator-gfx906.bc"
Since this is splitting one large bitcode files into a number of other bitcode
files the following commands working on the bitcode files will be wrapped in
llvm-foreach
. And this is where we reach the command we’re actually interested
in:
"/path/to/llvm/build/bin/llvm-foreach" "--out-ext=out" "--in-file-list=/tmp/alignedallocator-7746ed/alignedallocator-gfx906.o" "--in-replace=/tmp/alignedallocator-7746ed/alignedallocator-gfx906.o" "--out-file-list=/tmp/alignedallocator-d6ebb0/alignedallocator-gfx906.out" "--out-replace=/tmp/alignedallocator-d6ebb0/alignedallocator-gfx906.out" "--" "/path/to/llvm/build/bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx906" "-plugin-opt=O3" "-o" "/tmp/alignedallocator-d6ebb0/alignedallocator-gfx906.out" "/tmp/alignedallocator-7746ed/alignedallocator-gfx906.o"
This calls lld
over all of the bitcode files, and this is the step that
actually ends up generating the assembly and object file. So we can try to
extract just the lld
command and use that on our bitcode file we extracted
manually earlier:
"/path/to/llvm/build/bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx906" "-plugin-opt=O3" "-o" "fv_prune.out" "fv_prune.bc"
And then we can disassemble fv_prune.out
with llvm-objdump
and as opposed to
llc
this file is actually very similar to our problematic assembly and shows
the pattern we are looking for of v8
being overridden.
Now that we have narrowed down building the Nbnxm
kernel to a fairly simple
command we can move on to using one of the most powerful LLVM debugging tool:
-print-after-all
, with this flag the LLVM compiler will print the IR and
Machine IR after every single pass or stage of the compiler. This is extremely
helpful to debug however it does produce huge amounts of output which is why we
couldn’t use it on the commands building the entirety of GROMACS and had to
narrow it down first.
"/path/to/llvm/build/bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx906" "-plugin-opt=O3" "-o" "fv_prune.out" "fv_prune.bc" -mllvm -print-after-all -mllvm -filter-print-funcs=_ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE &> print-after-all.txt
And -filter-print-funcs
narrows down the output further to only the function
we’re interested in.
Now we can look through the output of this command and look for our problematic instructions, so again the pattern of a load between two barriers, and starting from the bottom of the file we end up finding the following Machine IR:
renamable $sgpr57 = S_ASHR_I32 renamable $sgpr56, 31, implicit-def dead $scc
renamable $sgpr58_sgpr59 = S_LSHL_B64 renamable $sgpr56_sgpr57, 5, implicit-def dead $scc
renamable $sgpr16 = S_ADD_U32 renamable $sgpr42, renamable $sgpr58, implicit-def $scc
renamable $vgpr7_vgpr8 = V_LSHLREV_B64_e64 3, $vgpr8_vgpr9, implicit $exec
renamable $sgpr17 = S_ADDC_U32 renamable $sgpr43, renamable $sgpr59, implicit-def dead $scc, implicit killed $scc
S_WAITCNT 49279
$vgpr6 = V_MOV_B32_e32 killed $sgpr17, implicit $exec, implicit $exec
renamable $vgpr5, renamable $sgpr16_sgpr17 = V_ADD_CO_U32_e64 killed $sgpr16, killed $vgpr7, 0, implicit $exec
renamable $vgpr6, dead renamable $sgpr16_sgpr17 = V_ADDC_U32_e64 killed $vgpr6, killed $vgpr8, killed $sgpr16_sgpr17, 0, implicit $exec
S_WAITCNT 0
S_BARRIER
renamable $vgpr50 = GLOBAL_LOAD_DWORD renamable $vgpr5_vgpr6, 16, 0, implicit $exec :: (load (s32) from %ir.1262, !tbaa !84, addrspace 1)
S_WAITCNT 0
S_BARRIER
This is clearly our problematic code, we can see the two barriers, the load, the
shift by 3 and the additions, and we can see vgpr8
being used, the important
instructions are the following:
[...]
renamable $vgpr7_vgpr8 = V_LSHLREV_B64_e64 3, $vgpr8_vgpr9, implicit $exec
[...]
renamable $vgpr6, dead renamable $sgpr16_sgpr17 = V_ADDC_U32_e64 killed $vgpr6, killed $vgpr8, killed $sgpr16_sgpr17, 0, implicit $exec
What is very interesting here is that in the V_ADDC
instruction the $vgpr8
operand is marked killed
which tells the compiler that the register is not
used after this instruction and can be re-used. This is why v8
is being
re-used in the body of the loop. In addition even the shift by 3 is overriding
v8
so it definitely won’t be correct in the next iteration.
So we scroll back up our print-after-all.txt
file up until we find the first
pass that introduced these seemingly incorrect instructions, and we track it
down to:
# *** IR Dump After Virtual Register Rewriter (virtregrewriter) ***:
This is the first pass that introduces this killed $vgpr8
, but this pass is
also the first pass where the Machine IR has machine registers, it runs right
after the register allocator to do the actual replacement between the virtual
registers and the newly allocated physical registers.
Scrolling up further to look at the code after the greedy register allocator, we can see that the code at that point looks like:
# *** IR Dump After Greedy Register Allocator (greedy) ***:
[...]
19776B| S_WAITCNT 0
19792B| S_BARRIER
19824B| renamable $sgpr57 = S_ASHR_I32 renamable $sgpr56, 31, implicit-def dead $scc
19872B| renamable $sgpr58_sgpr59 = S_LSHL_B64 renamable $sgpr56_sgpr57, 5, implicit-def dead $scc
19888B| renamable $sgpr16 = S_ADD_U32 renamable $sgpr42, renamable $sgpr58, implicit-def $scc
19904B| renamable $sgpr17 = S_ADDC_U32 renamable $sgpr43, renamable $sgpr59, implicit-def dead $scc, implicit $scc
19936B| %3340:vgpr_32 = COPY killed renamable $sgpr17
19944B| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
19952B| %3962:vreg_64 = COPY %3963:vreg_64
19960B| undef %893.sub0:vreg_64, renamable $sgpr16_sgpr17 = V_ADD_CO_U32_e64 killed $sgpr16, %3962.sub0:vreg_64, 0, implicit $exec
19976B| %893.sub1:vreg_64, dead renamable $sgpr16_sgpr17 = V_ADDC_U32_e64 %3340:vgpr_32, %3962.sub1:vreg_64, killed $sgpr16_sgpr17, 0, implicit $exec
20064B| %3772:vgpr_32 = GLOBAL_LOAD_DWORD %893:vreg_64, 16, 0, implicit $exec :: (load (s32) from %ir.1262, !tbaa !84, addrspace 1)
20080B| S_WAITCNT 0
20096B| S_BARRIER
We can see the shift by 3, with virtual register %3078
, that goes into
%3963
, which is then copied to %3962
, and then used as operand for the
addition, but as you can see at this point it is not yet marked killed.
And looking at the pass before the register allocation we see the following code:
4376B| %889:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec↩
[...]
19776B| S_WAITCNT 0
19792B| S_BARRIER
19824B| %887.sub1:sreg_64 = S_ASHR_I32 %887.sub0:sreg_64, 31, implicit-def dead $scc
19872B| %891:sreg_64 = S_LSHL_B64 %887:sreg_64, 5, implicit-def dead $scc
19888B| %3326:sreg_32 = S_ADD_U32 %692.sub2:sgpr_128, %891.sub0:sreg_64, implicit-def $scc
19904B| %3327:sreg_32 = S_ADDC_U32 %692.sub3:sgpr_128, %891.sub1:sreg_64, implicit-def dead $scc, implicit $scc
19936B| %3340:vgpr_32 = COPY %3327:sreg_32
19944B| undef %893.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %889.sub0:vreg_64, 0, implicit $exec
19952B| %893.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %889.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec
20064B| %3772:vgpr_32 = GLOBAL_LOAD_DWORD %893:vreg_64, 16, 0, implicit $exec :: (load (s32) from %ir.1262, !tbaa !84, addrspace 1)
20080B| S_WAITCNT 0
20096B| S_BARRIER
Looking at this code it’s important to note that the shift by 3 is actually not
inside of the loop, it’s before the loop starts and its value is simply used
inside of the loop for the additions. This code should work absolutely fine, but
it seems that the greedy register allocator moves this shift inside of the loop,
and then the virtual register rewriter marks the v8
operand as killed.
Now at this point in our investigation, we need to start debugging the actual
code of these passes to understand why they are doing these transformations and
try to figure out what is going wrong. One good way to start with that is to use
-debug-only=regalloc
, this will print debugging output for the register
allocation. However it doesn’t support the -filter-print-funcs=
flag like
-print-after-all
which leads to huge output so we need to reduce our bitcode
file a little further.
To do that we can use opt
to remove from the bitcode all the kernels we’re not
interested in:
/path/to/llvm/build/bin/opt --internalize-public-api-list=_ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE --internalize --globaldce fv_prune.bc -o fv_prune_trimmed.bc
This command will mark internal
all the functions that are not listed in the
--internalize-public-api-list
flag, and then run a globaldce
pass which will
eliminate all the internal symbols. This trims down our bitcode file quite
significantly.
Then we can also run the compilation up until right before the greedy register allocator, this way we can just run the passes we’re trying to debug:
"/path/to/llvm/build/bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx906" "-plugin-opt=O3" "-o" "pre_greedy.mir" "fv_prune_trimmed.bc" -mllvm -stop-before=greedy
lld
is not really setup to use this stop-before
parameter and so it will
crash but not before giving us the output we want in pre_greedy.mir
, this will
now contain the Machine IR before the greedy register allocator, the one with
the shift by 3 outside of the loop.
We can then use llc
to run only the specific pass and get debug output from
register allocation:
/path/to/llvm/build/bin/llc -start-before=greedy -stop-after=virtregrewriter -mcpu=gfx906 -debug-only=regalloc pre_greedy.mir -o post_regalloc.mir &> regalloc.txt
This command will run just the passes between greedy
and virtregrewriter
,
and print debug output for the register allocation -debug-only=regalloc
. The
names for the passes can be seen in the -print-after-all
dumps in parenthesis
next to the longer name of the passes. Note that this regalloc
name can be
found in the source of the passes mentioned above, for example in:
/path/to/llvm/llvm/lib/CodeGen/RegAllocGreedy.cpp
Under:
#define DEBUG_TYPE "regalloc"↩
This in turns control the LLVM_DEBUG
directives in that file and places them
under the regalloc
keyword.
So now that we have the debug output from the register allocation we can look
through that, and to find what we’re looking for we can simply look for the
virtual registers that we spotted above so %3078
, and we find the following
debug output:
Removing 1 back-copies.
Removing 16328r|%3961:vreg_64 = COPY %890:vreg_64
blit [3440r,3488B:0): [3440r;3488B)=0(%3961)(recalc)
blit [16192B,38336B:0): [16192B;16312r)=0(%3961)(recalc) [16312r;16336r)=1(%3962)(recalc) [16336r;38336B)=0(%3961)(recalc)
rewr %bb.6| 3440r:0|%3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
rewr %bb.90| 16320B:1| undef %894.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %3962.sub0:vreg_64, 0, implicit $exec
rewr %bb.90| 16336B:1| %894.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %3962.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec
rewr %bb.90| 16312B:0| %3962:vreg_64 = COPY %3961:vreg_64
queuing new interval: %3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r L0000000000000003 [3440r,3488B:1)[1 6192B,38336B:1) 0@x 1@3440r weight:3.681593e-04
Enqueuing %3961
queuing new interval: %3962 [16312r,16336r:0) 0@16312r L000000000000000C [16312r,16336r:0) 0@16312r L0000000000000003 [16312r,16320r:0) 0@16312r weight:5.705815e-02↩
Enqueuing %3962
selectOrSplit VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r L0000000000000003 [3440r,3488B:1)[1 6192B,38336B:1) 0@x 1@3440r weight:3.681593e-04 w=3.681593e-04
RS_Spill Cascade 0
should evict: %677 [96r,13936r:0)[16144B,38336B:0) 0@96r weight:6.930721e-05 w= 6.930721e-05
should evict: %677 [96r,13936r:0)[16144B,38336B:0) 0@96r weight:6.930721e-05 w= 6.930721e-05
Inline spilling VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r L0000000000000003 [3440r,3488B:1) [16192B,38336B:1) 0@x 1@3440r weight:3.681593e-04
From original %890
| remat: 16308r| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
| 16312e| %3962:vreg_64 = COPY killed %3963:vreg_64
All defs dead: dead %3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
Remat created 1 dead defs.
Deleting dead def 3440r|dead %3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
unassigning %3078 from $vgpr8_vgpr9: VGPR8_LO16 VGPR8_HI16 VGPR9_LO16 VGPR9_HI16
Enqueuing %3078
Shrink: %3078 [2224r,2240r:0)[2240r,3488B:1)[16192B,38336B:1) 0@2224r 1@2240r L0000000000000003 [2224r,3440r:0) 0@2224r L000000000000000C [2240r,3488B:0)[16192B,38336B :0) 0@2240r weight:1.520717e-01
Now this is quite a lot so first let’s roll back to the beginning of the debug output where the full kernel is shown, we can see that at that point, the code looks like this:
3440B| %890:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
[...]
16320B| undef %894.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %890.sub0:vreg_64, 0, implicit $exec↩
16336B| %894.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %890.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec↩
16352B| %3772:vgpr_32 = GLOBAL_LOAD_DWORD %894:vreg_64, 16, 0, implicit $exec :: (load (s32) from %ir.imask97.i, !tbaa !75, addrspace 1)↩
16368B| S_WAITCNT 0↩
16384B| S_BARRIER↩
In this snippet we can see that the shift by 3 is at 3440
in the code and that
the loop is around 163XX
, and also that at the beginning of the register
allocation, the shift by 3 is indeed outside of the loop as expected, so coming
back to the debug output, we first have:
Removing 1 back-copies.
Removing 16328r|%3961:vreg_64 = COPY %890:vreg_64
blit [3440r,3488B:0): [3440r;3488B)=0(%3961)(recalc)
blit [16192B,38336B:0): [16192B;16312r)=0(%3961)(recalc) [16312r;16336r)=1(%3962)(recalc) [16336r;38336B)=0(%3961)(recalc)
rewr %bb.6| 3440r:0|%3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
rewr %bb.90| 16320B:1| undef %894.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %3962.sub0:vreg_64, 0, implicit $exec
rewr %bb.90| 16336B:1| %894.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %3962.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec
rewr %bb.90| 16312B:0| %3962:vreg_64 = COPY %3961:vreg_64
Now this is in the middle of the register allocation modifications so the code
changed a little bit but looks sort of the same, it seems that there’s now a
copy inside of the loop of %890
into %3961
and that this is trying to remove
it. And we can see the modified instructions in the rewr
part:
rewr %bb.6| 3440r:0|%3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
rewr %bb.90| 16320B:1| undef %894.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %3962.sub0:vreg_64, 0, implicit $exec
rewr %bb.90| 16336B:1| %894.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %3962.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec
rewr %bb.90| 16312B:0| %3962:vreg_64 = COPY %3961:vreg_64
So at this point the addition instructions that used to take %890
take
%3962
, the shift is now writing to %3961
instead of %890
and %3961
is
copied into %3962
before the additions.
And at this stage the shift is still outside of the loop which seems fine, so we can move on to the next part of the output, and namely the inline spilling part:
Inline spilling VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r L0000000000000003 [3440r,3488B:1) [16192B,38336B:1) 0@x 1@3440r weight:3.681593e-04
From original %890
| remat: 16308r| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
| 16312e| %3962:vreg_64 = COPY killed %3963:vreg_64
This is important because this is the first time we see the shift by 3
instruction moved into the loop, as you can see in 16308
instead of 3440
.
And you can then see the compiler delete the original shift instruction:
All defs dead: dead %3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
Remat created 1 dead defs.
Deleting dead def 3440r|dead %3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
And then restart the register allocation for it since the instruction was moved:
unassigning %3078 from $vgpr8_vgpr9: VGPR8_LO16 VGPR8_HI16 VGPR9_LO16 VGPR9_HI16
Enqueuing %3078
Shrink: %3078 [2224r,2240r:0)[2240r,3488B:1)[16192B,38336B:1) 0@2224r 1@2240r L0000000000000003 [2224r,3440r:0) 0@2224r L000000000000000C [2240r,3488B:0)[16192B,38336B :0) 0@2240r weight:1.520717e-01
Now in practice it took a bit more debugging but here we can start to see where the problem is coming from on the last line:
%3078 [2224r,2240r:0)[2240r,3488B:1)[16192B,38336B:1) 0@2224r 1@2240r L0000000000000003 [2224r,3440r:0) 0@2224r L000000000000000C [2240r,3488B:0)[16192B,38336B:0) 0@2240r
What this represents is the live ranges for the virtual register %3078
, which
means the parts of the code where the register is considered “alive”, so
essentially all the code between the instruction that defines the register and
its last use, but let’s break it down a bit further:
[2224r,2240r:0)[2240r,3488B:1)[16192B,38336B:1) 0@2224r 1@2240r
This first part tracks the liveness of the full register, these live intervals
contain two “values” (and/or definitions), 0@2224r
and 1@2240r
, indeed if we
look at the full Machine IR:
2224B| undef %3078.sub0:vreg_64 = V_LSHRREV_B32_e32 5, %19:vgpr_32, implicit $exec
2240B| %3078.sub1:vreg_64 = V_MOV_B32_e32 0, implicit $exec↩
This defines the two sub-registers of %3078
. And so what the rest of the
notation shows us is that %3078
is alive between 2224
and 2240
for value
0
: [2224r,2240r:0)
, then is alive between 2240
to 3488
for value 1
:
[2240r,3488B:1)
, and then between 16192
to 38336
for value 1
:
[16192B,38336B:1)
.
Now as you can probably tell this notation of the live range seems a little
strange because it is for the full register, but here the values 0
and 1
are
only defining parts of the register, and so we need to look at the rest of the
representation of the live range which shows us information about the
sub-registers:
L0000000000000003 [2224r,3440r:0) 0@2224r
L000000000000000C [2240r,3488B:0)[16192B,38336B:0) 0@2240r
The notation here is similar to the notation above, with the initial value being
a mask representing a sub-register, in binary 0x3
is 0011
and 0xC
is
1100
. And so looking at these two lines you can see that the first, 0x3
represents %3078.sub0
and 0xC
represents %3078.sub1
.
And so now we know that %3078.sub0
is defined at 2224
and is alive between
2224
and 3440
, which means it’s alive until the shift by 3 but not further.
However %3078.sub1
is defined at 2240
and is alive between 2240
and 3488
which means it’s alive past the shift, but not only that, it is also alive
between 16192
and 38336
, this second interval is important because it covers
the loop with our problematic load.
Now that we understand the live intervals of %3078
the problem becomes fairly
obvious when looking back at the inline spilling:
Inline spilling VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r L0000000000000003 [3440r,3488B:1) [16192B,38336B:1) 0@x 1@3440r weight:3.681593e-04
From original %890
| remat: 16308r| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
| 16312e| %3962:vreg_64 = COPY killed %3963:vreg_64
This code moved the shift by 3 inside the loop, but at 16308
where we know
that %3078.sub0
is not alive, and this is the virtual register that ultimately
gets assigned to v8
. And so it makes sense that v8
was being re-used,
because as far as the compiler is aware it is not supposed to be used at that
point.
Thus we’re almost done, we know that the instruction is being incorrectly moved
into the loop, we just need to find how this error manifests in the code and how
to fix it. To do that we can simply search for the printed keywords such as
remat:
or Inline spilling
and then follow the code from there. It took a
little more time until reaching the following code in
LiveRangeEdit::allUsesAvailableAt
:
// Check that subrange is live at UseIdx.
if (MO.getSubReg()) {
const TargetRegisterInfo *TRI = MRI.getTargetRegisterInfo();
LaneBitmask LM = TRI->getSubRegIndexLaneMask(MO.getSubReg());
for (LiveInterval::SubRange &SR : li.subranges()) {
if ((SR.LaneMask & LM).none())
continue;
if (!SR.liveAt(UseIdx))
return false;
// Early exit if all used lanes are checked. No need to continue.
LM &= ~SR.LaneMask;
if (LM.none())
break;
}
}
The function this is in is trying to determine if all the uses of an instructions are available at a given point, which is in turn used to determine if it is valid to move the instruction there. And it is checking for sub-register ranges, but only if the operand of the instruction is using a sub-register. This is not the case for us, our shift instruction uses the full register, but one of the sub-register is not alive at the destination, and so this code ends up not checking our instruction for sub-ranges and incorrectly moves the shift into the loop.
The fix is then fairly simple, we just need to always check the sub-ranges when they’re available, and so tweaking the code above as follows resolved the issue:
// Check that subrange is live at UseIdx.
if (li.hasSubRanges()) {
const TargetRegisterInfo *TRI = MRI.getTargetRegisterInfo();
unsigned SubReg = MO.getSubReg();
LaneBitmask LM = SubReg ? TRI->getSubRegIndexLaneMask(SubReg)
: MRI.getMaxLaneMaskForVReg(MO.getReg());
With this patch instead of checking the sub-ranges only if the instruction is using a sub-register, it checks them if they exist, and if the instruction is using the full register, it uses a mask containing both sub-registers to check them both.
And so, the inline spilling fails to move the shift instruction in the loop:
Inline spilling VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r L0000000000000003 [3440r,3488B:1) [16192B,38336B:1) 0@x 1@3440r weight:3.681593e-04
From original %890
| cannot remat for 16312e|%3962:vreg_64 = COPY %3961:vreg_64
And since it can’t “inline” the spill it simply resorts to a normal spill:
Merged spilled regs: SS#2 [3440r,3488B:0)[16192B,38336B:0) 0@x weight:0.000000e+00
spillAroundUses %3961
| rewrite: 3440r| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
| spill: 3448r| SI_SPILL_V64_SAVE killed %3963:vreg_64, %stack.2, $sgpr32, 0, implicit $exec :: (store (s64) into %stack.2, align 4, addrspace 5)
Checking redundant spills for 0@16312r in %3962 [16312r,16336r:0) 0@16312r L000000000000000C [16312r,16336r:0) 0@16312r L0000000000000003 [16312r,16320r:0) 0@16312r w eight:5.705815e-02
Merged to stack int: SS#2 [3440r,3488B:0)[16192B,38336B:0) 0@x weight:0.000000e+00
| folded: 16312r| %3962:vreg_64 = SI_SPILL_V64_RESTORE %stack.2, $sgpr32, 0, implicit $exec :: (load (s64) from %stack.2, align 4, addrspace 5)
And the code generated this way is now correct meaning that the GROMACS test now
passes on gfx906
.
Conclusion
Finding the root cause of this issue was an interesting journey that took us from a molecular dynamics kernel all the way down to the AMD GPU ISAs.
Through describing this journey, this blog post provides some insight on a number of techniques that can be used either when debugging issues in the LLVM project, in DPC++ , or when working in an environment with limited debugging capabilities. And it also shows a glimpse into the DPC++ and LLVM components used during the compilation of GPU kernels. Which may hopefully be helpful to anyone wanting to learn more about these technologies.
And in closing, the full patch fixing this specific issue was submitted to upstream LLVM and promptly merged:
- https://reviews.llvm.org/D131884
