Bug 2457514
| Summary: | 6.19.11 on a Ryzen AI Max+ 395 (Strix Halo, gfx1151) running Fedora 43, the system locks up hard within a few hours of any GPU workload. | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Adam Clater <aclater> |
| Component: | kernel | Assignee: | Justin M. Forbes <jforbes> |
| Status: | NEW --- | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 43 | CC: | acaringi, adscvr, airlied, hans, hpa, jforbes, kernel-maint, linville, masami256, mchehab, nickolasjcarr, ptalbert, sid314, steved, suraj.ghimire7 |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | --- | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
|
Description
Adam Clater
2026-04-10 22:14:50 UTC
Update 2026-04-11: same hang reproduced on 6.19.10 — TLB-fences theory
falsified
I downgraded to 6.19.10-200.fc43 as the workaround and the system was stable
for ~22 hours. It then hung with a byte-identical signature to the three
6.19.11 crashes in my original report.
Crash details
- Kernel: 6.19.10-200.fc43.x86_64 (#1 SMP PREEMPT_DYNAMIC Wed Mar 25 16:09:19
UTC 2026)
- Boot start: 2026-04-10 18:34:56 EDT
- First SMU error: 2026-04-11 16:47:44 EDT (~22 hours uptime)
- Hard reset: 2026-04-11 16:49:12 EDT
- Workload at time of hang: llama.cpp running Qwen3-32B Q4_K_M, this time on
the ROCm backend (quay.io/ramalama/rocm:0.18.0,
HSA_OVERRIDE_GFX_VERSION=11.5.1) rather than the Vulkan backend used in the
original three crashes. So the backend is not the discriminator — both
Vulkan/RADV and ROCm/HIP workloads can wedge the SMU.
First fault and cascade
16:47:44 amdgpu 0000:c2:00.0: amdgpu: SMU: I'm not done with your previous
command:
SMN_C2PMSG_66:0x00000032
SMN_C2PMSG_82:0x00000000
16:47:44 amdgpu 0000:c2:00.0: amdgpu: Failed to power gate VPE!
16:47:44 [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe
failed, ret = -62
16:47:49 amdgpu: Failed to retrieve enabled ppfeatures
16:47:51 amdgpu: Dumping IP State
16:47:58 amdgpu: Failed to power gate VCN instance 0
16:48:03 amdgpu: Failed to power gate VCN instance 1
16:48:07 amdgpu: Failed to disable gfxoff
16:48:09 amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
16:48:09 amdgpu: failed to reg_write_reg_wait
Same SMN_C2PMSG_66:0x00000032 (GetEnabledSmuFeatures) wedge as the three
6.19.11 boots.
PCIe complex collapse followed the same pattern: xhci_hcd 0000:c4:00.3: HC
died, snd_hda_intel 0000:c2:00.1: CORB reset timeout, USB peripherals
disconnected, machine unresponsive, hard reset required.
What this rules out
The "drm/amdgpu: rework how we handle TLB fences" commit cannot be the sole
root cause. 6.19.10 lacks that commit and exhibits the identical failure mode
with the identical register state. 6.19.11 may accelerate the same underlying
bug (1–10 hours on 6.19.11 vs ~22 hours on 6.19.10 — same order of magnitude,
not a sharp regression), but the bug is older.
What the trigger looks like on this run
Unlike the 6.19.11 crashes where llama-vulkan was actively hammering the GPU
at the time of the wedge, on 6.19.10 the first observable failure is in the
VPE idle-work path: vpe_idle_work_handler fired,
vpe_set_powergating_state(GATE) was sent to the SMU, and the SMU never
acknowledged. The mailbox was already in state 0x32 when the next command
(GetEnabledSmuFeatures) was attempted 5 seconds later.
This points to the bug being in the smu_v14_0_0 driver/firmware interaction,
reachable any time the driver tries to power-gate a long-idle IP block (VPE,
VCN, gfxoff). It's not specific to a hot compute path.
Compute and KFD rings kept working through the SMU stall — userspace
llama-server health checks were still returning 200 OK at 16:48:02, eighteen
seconds after the first SMU error. Only the power-management mailbox path was
wedged. This is consistent with "SMU mailbox stuck mid-command, GFX/KFD rings
still draining inflight work, but no new SMU commands can land."
Not related
An earlier hibernation request at 15:57:25 was rejected by kernel lockdown at
the permission-check stage with no device callbacks invoked — checked the
journal carefully, no pm: suspend entry or freezing user space messages
followed. Not a contributor.
No MCE, no kernel panic, no Oops. Pure firmware/driver wedge.
Created attachment 2138416 [details]
kernel log extract from failed boot
Created attachment 2138417 [details]
system info and package versions
I can reproduce a very similar failure on Fedora 43 on a Strix Halo system. System: - Fedora 43 - GNOME on Wayland - Kernel: `6.19.13-200.fc43.x86_64` - GPU: AMD Strix Halo / Radeon 8060S Graphics, PCI ID `1002:1586` - Chrome: `147.0.7727.116` - Brave Flatpak: `1.89.143` - Mesa: `25.3.6` - `linux-firmware`: `20260410-1.fc43` Kernel args in use: `amdgpu.gttsize=126976 ttm.pages_limit=32505856` Observed behavior: - machine hard-freezes during Chromium-family browser video playback - physical reset is required - confirmed reproduced on `2026-04-27` Important note on browser flags: - the confirmed hard-freeze I can tie cleanly was with the normal non-safe Chrome launcher - I also have a separate Chrome `SIGTRAP` coredump from a process launched with `--use-gl=desktop` - however, at the time there were mixed Chrome/Brave instances in different launch modes, so I do not want to overstate that as proof that `--use-gl=desktop` failed as a mitigation Most relevant kernel log sequence from the previous boot: ```text Apr 27 13:28:11 amdgpu: Dumping IP State Apr 27 13:28:13 amdgpu: Failed to power gate VPE! Apr 27 13:28:18 amdgpu: Failed to disable gfxoff! Apr 27 13:28:23 amdgpu: Failed to disable gfxoff! Apr 27 13:28:28 amdgpu: Failed to disable gfxoff! Apr 27 13:28:33 amdgpu: AMDGPU device coredump file has been created Apr 27 13:28:33 amdgpu: ring sdma0 timeout, signaled seq=6289, emitted seq=6291 Apr 27 13:28:33 amdgpu: Starting sdma0 ring reset Apr 27 13:28:33 amdgpu: ring sdma0 test failed (-110) Apr 27 13:28:33 amdgpu: Ring sdma0 reset failed Apr 27 13:28:33 amdgpu: GPU reset begin!. Source: 1 ``` This looks closely related to this bug because it overlaps on: - Strix Halo / Radeon 8060S (`1002:1586`) - `Failed to power gate VPE!` - `Failed to disable gfxoff!` - hard machine lockup requiring reset I am attaching: - kernel log extract - system info / package versions Sorry, I forgot to add this on issue reproduction -- just playing YouTube videos hard-crashes in a few minutes. Reproduced on 6.19.13-300.fc44 -- bug also reachable via failed
GPU-reset cleanup, not just sustained workload
================================================================
Adding another reproduction of this SMU mailbox wedge after the
Fedora 43 -> 44 upgrade. The 6.19.13-300.fc44 kernel does NOT carry
a fix for this issue, and this incident shows a path into the wedge
that doesn't require sustained GPU load -- a userspace GPU-client
crash whose kernel-side cleanup fails is enough.
System
------
Hardware: AMD Ryzen AI Max+ 395 (Strix Halo, gfx1151),
128 GB unified memory
Kernel: 6.19.13-300.fc44.x86_64
Distro: Fedora 44
linux-firmware: 20260410-1.fc44
microcode_ctl: 2.1-74.fc44
mesa-vulkan-drivers: 26.0.3-4.fc44
HSA_OVERRIDE: HSA_OVERRIDE_GFX_VERSION=11.5.1 set globally
GPU PCI BDF: 0000:c2:00.0
Workload at trigger time
------------------------
- Primary GPU client: vLLM serving google/gemma-4-31B-it (AWQ) inside
container docker.io/kyuz0/vllm-therock-gfx1151:stable
(digest sha256:f89c8c689ade28877ade980ba0f29b3142af16c6ebb7f3f285311d38bc81a8a2,
built 2026-04-22), --attention-backend ROCM_ATTN,
gpu_memory_utilization=0.9, max_model_len=8192, ROCm THERock build
(rocm_sdk_libraries_gfx1151, hsa-runtime64.so.1).
- Concurrent GPU client: Slack flatpak (Electron) was using compute
queue comp_1.1.1 at the moment of failure.
- vLLM had been healthy and serving /v1/models at 17:50:18.
Failure sequence
----------------
The bug enters from a kernel-side cleanup failure after a userspace
GPU-client crash, not from sustained load:
17:59:16 amdgpu: ring comp_1.1.1 timeout, signaled seq=4260, emitted seq=4262
Process slack pid 7097 thread slack:cs0 pid 7255
Starting comp_1.1.1 ring reset -> reset compute queue (1:1:1)
Ring comp_1.1.1 reset failed
GPU reset begin!. Source: 1
Failed to evict queue 2
Failed to suspend process pid 110401 (vllm worker)
remove_all_kfd_queues_mes: Failed to remove queue 1 for dev 46231
traps: vllm[110088] general protection fault ip:7f8ad1055735
sp:7f88dfffec60 error:0 in libc.so.6[1735,7f8ad1054000+16f000]
(10x) sq_intr: error, detail 0x00000000, type 2, sh 0/1, priv 1, wave_id N
MODE2 reset
[drm] *ERROR* Failed to initialize parser -125!
GPU reset(1) succeeded!
[drm] device wedged, but recovered through reset
17:59:20 MES failed to respond to msg=REMOVE_QUEUE
failed to remove hardware queue from MES, doorbell=0x1002
MES might be in unrecoverable state, issue a GPU reset
Failed to remove queue 1
GPU reset begin!. Source: 3
MES failed to respond to msg=REMOVE_QUEUE doorbell=0x1000
MODE2 reset
GPU reset succeeded, trying to resume
SMU is resuming...
SMU is resumed successfully!
17:59:21 [drm:amdgpu_ring_test_helper] *ERROR* ring vpe test failed (-110)
resume of IP block <vpe_v6_1> failed -110
GPU reset end with ret = -110
17:59:26 amdgpu: SMU: I'm not done with your previous command:
SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
Failed to power gate VPE!
[drm:vpe_set_powergating_state] *ERROR* Dpm disable vpe failed, ret = -62
(then JPEG, VCN0, VCN1 same)
From 17:59:26 the SMU mailbox is wedged in the familiar 0x32 state
and remains stuck until hard reset; "Failed to retrieve enabled
ppfeatures!" repeats every ~5s for the remainder of uptime.
Cascade
-------
Same PCIe-complex collapse pattern as the prior incidents:
- 17:59:33 xhci_hcd 0000:c4:00.3: Refused to change power state
from D3hot to D0 -> Controller not ready at resume -19
-> HC died; cleaning up -> all USB downstream of c4:00.3
disconnected (5-1, 5-1.4, 6-1, 6-1.1)
- 17:59:34 snd_hda_intel 0000:c2:00.1: CORB reset timeout#2,
CORBRP = 65535 (HDMI audio function on the GPU)
- 18:00:03 Second USB hub disconnect (1-1.2, 2-1.2)
- Every udev re-probe of /dev/dri/card1 then times out at exactly
2m59s: "switcheroo-control-check-discrete-amdgpu /dev/dri/card1
timed out after 2min 59s, killing" (4 occurrences between 18:05
and 18:14).
Recovery
--------
Manual "sudo reboot now" issued at 18:14:43. Graceful shutdown could
NOT complete -- the chat-ui podman container kept logging through
18:17:42 (3 minutes after reboot now); next boot did not begin until
18:19:33, indicating a hard reset was required. This matches the
"PCIe-complex-down" state preventing systemd shutdown from finishing.
What's new vs. the 2026-04-10/11 incidents
------------------------------------------
The 2026-04-10/11 reports framed the trigger as "sustained GPU
workload" (Vulkan llama.cpp, ROCm llama.cpp). This incident shows a
THIRD path in:
1. Userspace GPU client (vLLM) takes a SIGSEGV in libc.so.6.
2. amdgpu attempts queue cleanup -> Ring comp_1.1.1 reset failed
-> Failed to evict queue 2 -> remove_all_kfd_queues_mes:
Failed to remove queue 1.
3. Driver falls back to MODE2 reset; FIRST reset reports success
("device wedged, but recovered through reset").
4. ~4s later MES still cannot respond to REMOVE_QUEUE for two
doorbells (0x1000, 0x1002) -> second MODE2 reset.
5. On the second reset, vpe_v6_1 IP-block resume returns -110
(-ETIMEDOUT).
6. From that moment SMU mailbox is wedged in 0x32 -- same end
state as the previous incidents.
So the failure is reachable not just via sustained workload but via
ANY GPU-reset cleanup that fails to resume the VPE IP block. The
relevant failure point is the vpe_v6_1 resume after MODE2,
immediately preceding the SMU 0x32 wedge.
It is also worth noting that a second GPU client (Slack Electron)
was issuing compute work on comp_1.1.1 at the same moment vLLM
crashed -- the dangling queue that the kernel tried (and failed) to
reset is Slack's, not vLLM's. The hang chain begins with a non-LLM
client's queue reset failure cascading into vLLM's KFD process
suspension also failing.
Evidence
--------
- Kernel-only excerpt of the 17:59:14-18:00:30 crash window:
~120 lines (can attach)
- Full journal 17:55-18:18 (kernel + user + container logs):
~2929 lines
- Coredump:
/var/spool/abrt/ccpp-2026-04-27-17:59:20.849710-109810
(vLLM python3.12, 4.2 GB peak) -- abrt could not finalize
("Error: No segments found in coredump")
- amdgpu device coredump:
/sys/class/drm/card1/device/devcoredump/data was created twice
(17:59:16 and 17:59:20); not preserved across reboot
Asks
----
1. Is vpe_v6_1 IP-block resume after MODE2 a known-fragile path on
Strix Halo? The -110 (-ETIMEDOUT) on "ring vpe test" is the
immediate precursor to the SMU 0x32 wedge in this incident.
2. Any update on which SMU message ID is stuck at
SMN_C2PMSG_66:0x00000032 in the current Strix Halo SMU firmware?
Knowing whether this maps to a power-gate or DPM command would
localize the firmware-side state machine bug.
3. Can the driver detect "VPE resume failed after MODE2" and avoid
issuing further SMU mailbox commands (which only stack up behind
the wedged 0x32) so that a third-stage recovery (BACO / cold
link reset) is at least attempted before declaring the device
dead?
Created attachment 2138420 [details]
0427 crash details
Created attachment 2138421 [details]
0427 crash kernel log
Created attachment 2138434 [details]
I reproduced a crash, and this time I got some artifacts.
Fourth incident on 2026-04-27 -- slack-triggered, two-stage cascade, full coredump analysis
============================================================================================
Reproduced again on 6.19.13-300.fc44.x86_64 (boot from 20:31:35 EDT, hung at
~20:55, hard-reset at 20:56:45). This is the fourth incident on this hardware
in one day. Two new findings worth recording:
1. Slack (not just chrome) is sufficient as a sole trigger. First offender
this incident was slack PID 8955 on comp_1.1.0, three minutes into a
fresh boot.
2. The userspace SIGSEGV is collateral, not a vLLM bug. Full coredump
analysis below pinpoints the faulting instruction as 'hlt' inside glibc
abort(), called by ROCm's rocr::core::Runtime::HwExceptionHandler. The
HSA runtime is reacting cleanly to the GPU disappearing; glibc's
last-resort suicide path gets escalated to SIGSEGV by the kernel.
Kernel timeline (boot -1: 20:31:35 -> 20:55)
--------------------------------------------
20:31:35 boot
20:34:22 amdgpu: ring comp_1.1.0 timeout, signaled seq=219, emitted seq=221
Process slack pid 8955 thread slack:cs0 pid 9019
Starting comp_1.1.0 ring reset -> reset compute queue (1:1:0) -> SUCCEEDED
[drm] device wedged, but recovered through reset <-- false-recovery #1
20:36:12 amdgpu: ring comp_1.1.1 timeout, signaled seq=41, emitted seq=43
Process chrome pid 8504 thread chrome:cs0 pid 8558
Starting comp_1.1.1 ring reset -> Ring comp_1.1.1 reset FAILED
20:36:13 GPU reset begin!. Source: 1
Failed to evict queue 2
Failed to suspend process pid 10570
remove_all_kfd_queues_mes: Failed to remove queue 1 for dev 46231
traps: vllm[5683] general protection fault ip:7f49a2242735
sp:7f47b53fdc60 error:0 in libc.so.6[1735,7f49a2241000+16f000]
MODE2 reset -> GPU reset succeeded, trying to resume
SMU is resumed successfully!
[drm] device wedged, but recovered through reset <-- false-recovery #2
20:36:17 MES failed to respond to msg=REMOVE_QUEUE
failed to remove hardware queue from MES, doorbell=0x1002
MES might be in unrecoverable state, issue a GPU reset
MES failed to respond to msg=REMOVE_QUEUE
failed to remove hardware queue from MES, doorbell=0x1000
GPU reset begin!. Source: 3
MODE2 reset -> GPU reset succeeded, trying to resume
SMU is resumed successfully!
[drm:amdgpu_ring_test_helper] *ERROR* ring vpe test failed (-110)
resume of IP block <vpe_v6_1> failed -110
GPU reset end with ret = -110
20:36:23+ amdgpu: SMU: I'm not done with your previous command:
SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
Failed to power gate VPE / JPEG / VCN inst 0 / VCN inst 1
Failed to retrieve enabled ppfeatures (repeats every 4-5s)
20:37:32 usb 1-1.2: USB disconnect (xhci PCIe partial collapse)
20:42:13 USB partially recovered
20:42:20 USB disconnected again
20:53:18-55 more SMU 0x32 / Failed to retrieve enabled ppfeatures
~20:55 user hard-reset (TCO never fired this time; system hung but didn't panic)
20:56:45 boot 0
This is signature-identical to the 2026-04-27 19:41 incident: comp_1.1.x
ring timeout -> reset failure -> MES REMOVE_QUEUE failure -> MODE2 reset ->
vpe_v6_1 -110 -> SMU 0x32. The MODE2 reset reports success ("device wedged,
but recovered through reset") but on this kernel, that line is not the
all-clear -- a delayed wedge can fire seconds (this incident) or tens of
minutes (19:41 incident) later via the next power-gate transition.
Coredump analysis (vLLM PID 2995, SIGSEGV @ 20:36:13)
-----------------------------------------------------
The userspace SIGSEGV traceable from the kernel "traps:" line was analyzed
against the container's libraries:
Faulting thread (LWP 68 = TID 5683):
RIP = 0x00007f49a2242735 libc.so.6 + 0x1735
RSP = 0x00007f47b53fdc60 (matches kernel sp:)
Caller: 0x00007f4992c1ca28 libhsa-runtime64.so.1 + 0x106a6c
Container glibc + 0x1735 disassembles as the 'hlt' instruction inside abort():
abort() in container glibc (Fedora 43 base):
170a: call __pthread_raise_internal ; raise(SIGABRT) -- did not terminate
...
172e: mov $0xe, %eax ; rt_sigprocmask
1733: syscall
1735: hlt ; <-- faulted (privileged insn in CPL=3)
1736: mov $0x7f, %edi
173b: call _exit
'hlt' is glibc's last-resort suicide path when normal SIGABRT delivery does
not kill the process. Userspace 'hlt' triggers a CPU #GP with error:0 --
exactly what the kernel logged. The kernel translates the #GP into SIGSEGV.
Caller in libhsa-runtime64.so.1 + 0x106a6c is in:
rocr::core::Runtime::HwExceptionHandler(long, void*)
106a62: call fprintf@plt ; logs "HW Exception by GPU node-1 reason :GPU Hang"
106a67: call abort@plt ; deliberate abort
106a6c: mov %rax, %r14 ; <-- return address (never reached)
This is the same handler that emitted "HW Exception by GPU node-1
reason :GPU Hang" in the 2026-04-27 19:41 vllm.log (attached to the previous
capture in this bug).
The other 14 vLLM threads were all idle in __syscall_cancel_arch_end
returning from syscall 0xca (futex wait) -- a sleeping worker pool. Only
the HSA exception-handler thread was running.
Causal chain
------------
amdgpu kernel: chrome comp_1.1.1 ring timeout
-> comp_1.1.1 ring reset FAILED
-> GPU Reset (Source 1) -> MES REMOVE_QUEUE failure
-> kfd queue removal failed for vLLM
HSA runtime userspace: KFD signals HW exception
-> rocr::core::Runtime::HwExceptionHandler
-> fprintf("HW Exception ... GPU Hang")
-> abort()
glibc abort(): raise(SIGABRT) failed to terminate (likely SIGABRT masked
on HSA thread)
-> fall-through to rt_sigprocmask + hlt
-> hlt #GP -> kernel delivers SIGSEGV -> core dumped
Significance
------------
The vLLM PID 2995 SIGSEGV is a downstream symptom, not the root cause.
ROCm's Runtime::HwExceptionHandler is the userspace choke-point in this bug
class -- both the 19:41 and 20:36 incidents went through it. Future
captures with "libc.so.6 + 0x1735" on the stack of a process using
/dev/kfd should be read as "GPU hung; this is HSA reacting" rather than
as a separate userspace bug.
The kernel-side wedge cause is unchanged from prior comments: vpe_v6_1
IP-block resume fails -110 after the second MODE2 reset, leaving the SMU
mailbox stuck in SMN_C2PMSG_66:0x00000032.
Mitigation taken on the affected machine
----------------------------------------
- vllm-gemma4-31b-awq.service (user quadlet) has been disabled at
auto-start while this bug is open -- the 2026-04-27 19:23 incident
showed vLLM model-load profiling alone is enough to trigger the wedge,
and auto-restart on every boot was reloading the gun.
- Recommend Slack and Chrome users on gfx1151 disable GPU compositing /
WebGPU until a fix lands. Both Electron compute queues have now been
observed as sole sufficient triggers (chrome comp_1.1.1 20:30, slack
comp_1.1.0 20:34, chrome comp_1.1.1 20:36).
Evidence attached (this incident, 20:31->20:55)
-----------------------------------------------
- boot-1-kernel.log full kernel journal of the crash boot (1691 lines)
- boot-1-gpu-events.log amdgpu/SMU/MES/PCIe-only events filtered (236 lines)
- coredump-2995-info.txt systemd-coredump info dump for vLLM SIGSEGV
- boots.txt boot index showing the four boots of the day
- current-state.txt kdump/sysctl/devcoredump status post-reboot
- coredumpctl-list.txt coredumps in the crash window
Note: no devcoredump captured this time -- /sys/class/devcoredump/ was
empty when the box came back, and 5-minute auto-expiry plus hard-reset
between hang and reboot lost it. No vmcore (kernel hung but never
panicked, TCO didn't fire). Pstore empty. Journal is the entire
kernel-side record for this incident.
Created attachment 2138436 [details]
Reproduced again on 6.19.13-300.fc44.x86_64
A potential fix has been identified. This looks like it may actually be a VAAPI issue. https://community.frame.work/t/smu-deadlock-system-freeze-on-fedora-43/81795/27?page=2 I'm patching this into my fedora kernel to test. Thanks—this is very helpful for immediate stability. That said, I’d frame it as a workaround, not a fix: it avoids the bug by keeping clocks/subsystems active instead of allowing normal power-gating. We’re sidestepping the failing SMU path rather than resolving the deadlock (no proper busy handling / retry / serialization), so the root issue remains and power management is degraded. This likely needs coordination with the silicon vendor (AMD) to address at the SMU/firmware level. Yes - that's correct, a workaround.
Update: Crupi's no_vpe_idle_pg patch (6.19.13-300.vpe1.fc44,
amdgpu.no_vpe_idle_pg=1)
does not fully fix this bug.
Reproduced 2026-04-27 22:40 EDT under pure inference load — no VAAPI, no
Electron,
no concurrent GPU clients. vLLM (TheRock ROCm image, Gemma 4 31B AWQ-INT4) hit
HW Exception on a compute queue (comp_1.1.1) ~23s after model load completed.
Cascade:
HW Exception by GPU node-1 reason :GPU Hang
[drm] *ERROR* Failed to initialize parser -125
ring vpe test failed (-110) # post-reset, collateral
Dpm disable jpeg failed, ret = -62
Failed to power gate VCN instance 0 / 1
SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000021
Failed to retrieve enabled ppfeatures # repeating every 5s
Notes:
- SMU stuck-command code is 0x21 on the patched kernel (was 0x32 unpatched).
Same wedge state, different message ID.
- No "Failed to power gate VPE!" in the cascade — consistent with VPE idle-pg
being gated off by the patch. Wedge entered via JPEG/VCN power-gate path.
- Conclusion: at least two triggers reach the same SMU mailbox wedge:
(1) VAAPI -> VPE idle power-gate [closed by Crupi patch]
(2) Inference HW exception -> MES REMOVE_QUEUE fail -> MODE2 reset
-> JPEG/VCN power-gate fail -> SMU wedge [still open]
- Clean shutdown hung ~8 hours after wedge; manual power-cycle required.
Same "hard reset only" recovery as 0x32.
Two coredumps + devcoredump captured; available on request.
|