Created attachment 2136649 [details] Contains system identity, lspci for c2:00.0, installed kernels, last 6 boot list, etc. Created attachment 2136649 [details] Contains system identity, lspci for c2:00.0, installed kernels, last 6 boot list, etc. ## Summary After upgrading from kernel 6.19.10 to 6.19.11 on a Ryzen AI Max+ 395 (Strix Halo, gfx1151) running Fedora 43, the system locks up hard within a few hours of any GPU workload. Three consecutive lockups in one day, each requiring a power-off. 6.19.10 ran the same workload stable for 3+ days immediately prior. The only amdgpu code change in the Fedora 6.19.11 changelog is "drm/amdgpu: rework how we handle TLB fences" (Alex Deucher). All other amdgpu entries in the changelog are kconfig / build flag adjustments. That patch is the suspected regression. ## Hardware - CPU/APU: AMD Ryzen AI Max+ 395 (Strix Halo) - iGPU: Radeon 8060S Graphics — PCI 1002:1586 at 0000:c2:00.0 - IP blocks (from journal): smu_v14_0_0, gfx_v11_0_0, DCN 3.5.1 - Memory: 128 GB unified (UMA APU) - BIOS VRAM carveout: 64 GB ## Software - Fedora 43 - Bad kernel: 6.19.11-200.fc43.x86_64 (#1 SMP PREEMPT_DYNAMIC Thu Apr 2 16:55:52 UTC 2026) - Last good: 6.19.10-200.fc43.x86_64 - Userspace workload at time of hang: llama.cpp Vulkan backend (RADV) running a Qwen3 model on the iGPU. Workload is not ROCm/HIP — this is pure Vulkan, and it previously stressed 6.19.10 for days without incident. ## Symptom First kernel error in each of the three lockup boots is identical: amdgpu 0000:c2:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 amdgpu 0000:c2:00.0: amdgpu: Failed to retrieve enabled ppfeatures! The error then repeats every ~5 seconds, cycling between msg id 0x32 (GetEnabledSmuFeatures) and 0x19 (AllowGfxOff), with: amdgpu 0000:c2:00.0: amdgpu: Failed to enable gfxoff! amdgpu 0000:c2:00.0: [drm] SMU response after wait: 0, msg id = 18 Userspace wedges shortly afterward: - llama-vulkan.service exits with SIGABRT, rootless podman coredumps - switcheroo-control-check-discrete-amdgpu times out after 2m59s - systemd-sleep: "Failed to freeze unit 'user.slice': Connection timed out" (this is what finally hangs the machine — suspend cannot complete because the GPU-using cgroup will not freeze) The machine becomes unresponsive and requires a hard power-off. ## Reproduction 1. Boot kernel 6.19.11-200.fc43 on Strix Halo (Ryzen AI Max+ 395). 2. Run any sustained Vulkan compute workload on the iGPU (llama.cpp -DGGML_VULKAN=ON against a multi-GB model is sufficient). 3. Within 1–10 hours the SMU mailbox enters the "I'm not done with your previous command" state and never recovers. Observed in three consecutive boots on 2026-04-10: | Boot start (EDT) | First SMU error | Hard reset | |------------------------|-------------------------|-----------------| | 2026-04-09 23:57:49 | 2026-04-10 09:42:05 | 09:43:24 | | 2026-04-10 09:44:39 | 2026-04-10 11:46:37 | 11:56:54 | | 2026-04-10 11:57:48 | 2026-04-10 14:53:08 | 14:56:49 | ## Hypothesis The 6.19.11 Fedora changelog contains exactly one amdgpu code change: drm/amdgpu: rework how we handle TLB fences (Alex Deucher) The symptom — SMU mailbox reporting that a prior command is still in-flight when the driver posts the next one — is consistent with a TLB-fence ordering / completion-accounting bug on a UMA APU where the IOMMU TLB is shared between CPU and iGPU. gfx1151 / Strix Halo is a new IP block and a likely candidate to expose a race in the reworked path that doesn't manifest on discrete cards. ## Workaround Will attempt 6.19.10-200.fc43 to see if it restores stability. ## Attachments - journal excerpt from the three crashed boots (will attach) - `lspci -nn | grep 1002:1586` - `uname -a`
Update 2026-04-11: same hang reproduced on 6.19.10 — TLB-fences theory falsified I downgraded to 6.19.10-200.fc43 as the workaround and the system was stable for ~22 hours. It then hung with a byte-identical signature to the three 6.19.11 crashes in my original report. Crash details - Kernel: 6.19.10-200.fc43.x86_64 (#1 SMP PREEMPT_DYNAMIC Wed Mar 25 16:09:19 UTC 2026) - Boot start: 2026-04-10 18:34:56 EDT - First SMU error: 2026-04-11 16:47:44 EDT (~22 hours uptime) - Hard reset: 2026-04-11 16:49:12 EDT - Workload at time of hang: llama.cpp running Qwen3-32B Q4_K_M, this time on the ROCm backend (quay.io/ramalama/rocm:0.18.0, HSA_OVERRIDE_GFX_VERSION=11.5.1) rather than the Vulkan backend used in the original three crashes. So the backend is not the discriminator — both Vulkan/RADV and ROCm/HIP workloads can wedge the SMU. First fault and cascade 16:47:44 amdgpu 0000:c2:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 16:47:44 amdgpu 0000:c2:00.0: amdgpu: Failed to power gate VPE! 16:47:44 [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62 16:47:49 amdgpu: Failed to retrieve enabled ppfeatures 16:47:51 amdgpu: Dumping IP State 16:47:58 amdgpu: Failed to power gate VCN instance 0 16:48:03 amdgpu: Failed to power gate VCN instance 1 16:48:07 amdgpu: Failed to disable gfxoff 16:48:09 amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) 16:48:09 amdgpu: failed to reg_write_reg_wait Same SMN_C2PMSG_66:0x00000032 (GetEnabledSmuFeatures) wedge as the three 6.19.11 boots. PCIe complex collapse followed the same pattern: xhci_hcd 0000:c4:00.3: HC died, snd_hda_intel 0000:c2:00.1: CORB reset timeout, USB peripherals disconnected, machine unresponsive, hard reset required. What this rules out The "drm/amdgpu: rework how we handle TLB fences" commit cannot be the sole root cause. 6.19.10 lacks that commit and exhibits the identical failure mode with the identical register state. 6.19.11 may accelerate the same underlying bug (1–10 hours on 6.19.11 vs ~22 hours on 6.19.10 — same order of magnitude, not a sharp regression), but the bug is older. What the trigger looks like on this run Unlike the 6.19.11 crashes where llama-vulkan was actively hammering the GPU at the time of the wedge, on 6.19.10 the first observable failure is in the VPE idle-work path: vpe_idle_work_handler fired, vpe_set_powergating_state(GATE) was sent to the SMU, and the SMU never acknowledged. The mailbox was already in state 0x32 when the next command (GetEnabledSmuFeatures) was attempted 5 seconds later. This points to the bug being in the smu_v14_0_0 driver/firmware interaction, reachable any time the driver tries to power-gate a long-idle IP block (VPE, VCN, gfxoff). It's not specific to a hot compute path. Compute and KFD rings kept working through the SMU stall — userspace llama-server health checks were still returning 200 OK at 16:48:02, eighteen seconds after the first SMU error. Only the power-management mailbox path was wedged. This is consistent with "SMU mailbox stuck mid-command, GFX/KFD rings still draining inflight work, but no new SMU commands can land." Not related An earlier hibernation request at 15:57:25 was rejected by kernel lockdown at the permission-check stage with no device callbacks invoked — checked the journal carefully, no pm: suspend entry or freezing user space messages followed. Not a contributor. No MCE, no kernel panic, no Oops. Pure firmware/driver wedge.
Created attachment 2138416 [details] kernel log extract from failed boot
Created attachment 2138417 [details] system info and package versions
I can reproduce a very similar failure on Fedora 43 on a Strix Halo system. System: - Fedora 43 - GNOME on Wayland - Kernel: `6.19.13-200.fc43.x86_64` - GPU: AMD Strix Halo / Radeon 8060S Graphics, PCI ID `1002:1586` - Chrome: `147.0.7727.116` - Brave Flatpak: `1.89.143` - Mesa: `25.3.6` - `linux-firmware`: `20260410-1.fc43` Kernel args in use: `amdgpu.gttsize=126976 ttm.pages_limit=32505856` Observed behavior: - machine hard-freezes during Chromium-family browser video playback - physical reset is required - confirmed reproduced on `2026-04-27` Important note on browser flags: - the confirmed hard-freeze I can tie cleanly was with the normal non-safe Chrome launcher - I also have a separate Chrome `SIGTRAP` coredump from a process launched with `--use-gl=desktop` - however, at the time there were mixed Chrome/Brave instances in different launch modes, so I do not want to overstate that as proof that `--use-gl=desktop` failed as a mitigation Most relevant kernel log sequence from the previous boot: ```text Apr 27 13:28:11 amdgpu: Dumping IP State Apr 27 13:28:13 amdgpu: Failed to power gate VPE! Apr 27 13:28:18 amdgpu: Failed to disable gfxoff! Apr 27 13:28:23 amdgpu: Failed to disable gfxoff! Apr 27 13:28:28 amdgpu: Failed to disable gfxoff! Apr 27 13:28:33 amdgpu: AMDGPU device coredump file has been created Apr 27 13:28:33 amdgpu: ring sdma0 timeout, signaled seq=6289, emitted seq=6291 Apr 27 13:28:33 amdgpu: Starting sdma0 ring reset Apr 27 13:28:33 amdgpu: ring sdma0 test failed (-110) Apr 27 13:28:33 amdgpu: Ring sdma0 reset failed Apr 27 13:28:33 amdgpu: GPU reset begin!. Source: 1 ``` This looks closely related to this bug because it overlaps on: - Strix Halo / Radeon 8060S (`1002:1586`) - `Failed to power gate VPE!` - `Failed to disable gfxoff!` - hard machine lockup requiring reset I am attaching: - kernel log extract - system info / package versions
Sorry, I forgot to add this on issue reproduction -- just playing YouTube videos hard-crashes in a few minutes.
Reproduced on 6.19.13-300.fc44 -- bug also reachable via failed GPU-reset cleanup, not just sustained workload ================================================================ Adding another reproduction of this SMU mailbox wedge after the Fedora 43 -> 44 upgrade. The 6.19.13-300.fc44 kernel does NOT carry a fix for this issue, and this incident shows a path into the wedge that doesn't require sustained GPU load -- a userspace GPU-client crash whose kernel-side cleanup fails is enough. System ------ Hardware: AMD Ryzen AI Max+ 395 (Strix Halo, gfx1151), 128 GB unified memory Kernel: 6.19.13-300.fc44.x86_64 Distro: Fedora 44 linux-firmware: 20260410-1.fc44 microcode_ctl: 2.1-74.fc44 mesa-vulkan-drivers: 26.0.3-4.fc44 HSA_OVERRIDE: HSA_OVERRIDE_GFX_VERSION=11.5.1 set globally GPU PCI BDF: 0000:c2:00.0 Workload at trigger time ------------------------ - Primary GPU client: vLLM serving google/gemma-4-31B-it (AWQ) inside container docker.io/kyuz0/vllm-therock-gfx1151:stable (digest sha256:f89c8c689ade28877ade980ba0f29b3142af16c6ebb7f3f285311d38bc81a8a2, built 2026-04-22), --attention-backend ROCM_ATTN, gpu_memory_utilization=0.9, max_model_len=8192, ROCm THERock build (rocm_sdk_libraries_gfx1151, hsa-runtime64.so.1). - Concurrent GPU client: Slack flatpak (Electron) was using compute queue comp_1.1.1 at the moment of failure. - vLLM had been healthy and serving /v1/models at 17:50:18. Failure sequence ---------------- The bug enters from a kernel-side cleanup failure after a userspace GPU-client crash, not from sustained load: 17:59:16 amdgpu: ring comp_1.1.1 timeout, signaled seq=4260, emitted seq=4262 Process slack pid 7097 thread slack:cs0 pid 7255 Starting comp_1.1.1 ring reset -> reset compute queue (1:1:1) Ring comp_1.1.1 reset failed GPU reset begin!. Source: 1 Failed to evict queue 2 Failed to suspend process pid 110401 (vllm worker) remove_all_kfd_queues_mes: Failed to remove queue 1 for dev 46231 traps: vllm[110088] general protection fault ip:7f8ad1055735 sp:7f88dfffec60 error:0 in libc.so.6[1735,7f8ad1054000+16f000] (10x) sq_intr: error, detail 0x00000000, type 2, sh 0/1, priv 1, wave_id N MODE2 reset [drm] *ERROR* Failed to initialize parser -125! GPU reset(1) succeeded! [drm] device wedged, but recovered through reset 17:59:20 MES failed to respond to msg=REMOVE_QUEUE failed to remove hardware queue from MES, doorbell=0x1002 MES might be in unrecoverable state, issue a GPU reset Failed to remove queue 1 GPU reset begin!. Source: 3 MES failed to respond to msg=REMOVE_QUEUE doorbell=0x1000 MODE2 reset GPU reset succeeded, trying to resume SMU is resuming... SMU is resumed successfully! 17:59:21 [drm:amdgpu_ring_test_helper] *ERROR* ring vpe test failed (-110) resume of IP block <vpe_v6_1> failed -110 GPU reset end with ret = -110 17:59:26 amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 Failed to power gate VPE! [drm:vpe_set_powergating_state] *ERROR* Dpm disable vpe failed, ret = -62 (then JPEG, VCN0, VCN1 same) From 17:59:26 the SMU mailbox is wedged in the familiar 0x32 state and remains stuck until hard reset; "Failed to retrieve enabled ppfeatures!" repeats every ~5s for the remainder of uptime. Cascade ------- Same PCIe-complex collapse pattern as the prior incidents: - 17:59:33 xhci_hcd 0000:c4:00.3: Refused to change power state from D3hot to D0 -> Controller not ready at resume -19 -> HC died; cleaning up -> all USB downstream of c4:00.3 disconnected (5-1, 5-1.4, 6-1, 6-1.1) - 17:59:34 snd_hda_intel 0000:c2:00.1: CORB reset timeout#2, CORBRP = 65535 (HDMI audio function on the GPU) - 18:00:03 Second USB hub disconnect (1-1.2, 2-1.2) - Every udev re-probe of /dev/dri/card1 then times out at exactly 2m59s: "switcheroo-control-check-discrete-amdgpu /dev/dri/card1 timed out after 2min 59s, killing" (4 occurrences between 18:05 and 18:14). Recovery -------- Manual "sudo reboot now" issued at 18:14:43. Graceful shutdown could NOT complete -- the chat-ui podman container kept logging through 18:17:42 (3 minutes after reboot now); next boot did not begin until 18:19:33, indicating a hard reset was required. This matches the "PCIe-complex-down" state preventing systemd shutdown from finishing. What's new vs. the 2026-04-10/11 incidents ------------------------------------------ The 2026-04-10/11 reports framed the trigger as "sustained GPU workload" (Vulkan llama.cpp, ROCm llama.cpp). This incident shows a THIRD path in: 1. Userspace GPU client (vLLM) takes a SIGSEGV in libc.so.6. 2. amdgpu attempts queue cleanup -> Ring comp_1.1.1 reset failed -> Failed to evict queue 2 -> remove_all_kfd_queues_mes: Failed to remove queue 1. 3. Driver falls back to MODE2 reset; FIRST reset reports success ("device wedged, but recovered through reset"). 4. ~4s later MES still cannot respond to REMOVE_QUEUE for two doorbells (0x1000, 0x1002) -> second MODE2 reset. 5. On the second reset, vpe_v6_1 IP-block resume returns -110 (-ETIMEDOUT). 6. From that moment SMU mailbox is wedged in 0x32 -- same end state as the previous incidents. So the failure is reachable not just via sustained workload but via ANY GPU-reset cleanup that fails to resume the VPE IP block. The relevant failure point is the vpe_v6_1 resume after MODE2, immediately preceding the SMU 0x32 wedge. It is also worth noting that a second GPU client (Slack Electron) was issuing compute work on comp_1.1.1 at the same moment vLLM crashed -- the dangling queue that the kernel tried (and failed) to reset is Slack's, not vLLM's. The hang chain begins with a non-LLM client's queue reset failure cascading into vLLM's KFD process suspension also failing. Evidence -------- - Kernel-only excerpt of the 17:59:14-18:00:30 crash window: ~120 lines (can attach) - Full journal 17:55-18:18 (kernel + user + container logs): ~2929 lines - Coredump: /var/spool/abrt/ccpp-2026-04-27-17:59:20.849710-109810 (vLLM python3.12, 4.2 GB peak) -- abrt could not finalize ("Error: No segments found in coredump") - amdgpu device coredump: /sys/class/drm/card1/device/devcoredump/data was created twice (17:59:16 and 17:59:20); not preserved across reboot Asks ---- 1. Is vpe_v6_1 IP-block resume after MODE2 a known-fragile path on Strix Halo? The -110 (-ETIMEDOUT) on "ring vpe test" is the immediate precursor to the SMU 0x32 wedge in this incident. 2. Any update on which SMU message ID is stuck at SMN_C2PMSG_66:0x00000032 in the current Strix Halo SMU firmware? Knowing whether this maps to a power-gate or DPM command would localize the firmware-side state machine bug. 3. Can the driver detect "VPE resume failed after MODE2" and avoid issuing further SMU mailbox commands (which only stack up behind the wedged 0x32) so that a third-stage recovery (BACO / cold link reset) is at least attempted before declaring the device dead?
Created attachment 2138420 [details] 0427 crash details
Created attachment 2138421 [details] 0427 crash kernel log
Created attachment 2138434 [details] I reproduced a crash, and this time I got some artifacts.
Fourth incident on 2026-04-27 -- slack-triggered, two-stage cascade, full coredump analysis ============================================================================================ Reproduced again on 6.19.13-300.fc44.x86_64 (boot from 20:31:35 EDT, hung at ~20:55, hard-reset at 20:56:45). This is the fourth incident on this hardware in one day. Two new findings worth recording: 1. Slack (not just chrome) is sufficient as a sole trigger. First offender this incident was slack PID 8955 on comp_1.1.0, three minutes into a fresh boot. 2. The userspace SIGSEGV is collateral, not a vLLM bug. Full coredump analysis below pinpoints the faulting instruction as 'hlt' inside glibc abort(), called by ROCm's rocr::core::Runtime::HwExceptionHandler. The HSA runtime is reacting cleanly to the GPU disappearing; glibc's last-resort suicide path gets escalated to SIGSEGV by the kernel. Kernel timeline (boot -1: 20:31:35 -> 20:55) -------------------------------------------- 20:31:35 boot 20:34:22 amdgpu: ring comp_1.1.0 timeout, signaled seq=219, emitted seq=221 Process slack pid 8955 thread slack:cs0 pid 9019 Starting comp_1.1.0 ring reset -> reset compute queue (1:1:0) -> SUCCEEDED [drm] device wedged, but recovered through reset <-- false-recovery #1 20:36:12 amdgpu: ring comp_1.1.1 timeout, signaled seq=41, emitted seq=43 Process chrome pid 8504 thread chrome:cs0 pid 8558 Starting comp_1.1.1 ring reset -> Ring comp_1.1.1 reset FAILED 20:36:13 GPU reset begin!. Source: 1 Failed to evict queue 2 Failed to suspend process pid 10570 remove_all_kfd_queues_mes: Failed to remove queue 1 for dev 46231 traps: vllm[5683] general protection fault ip:7f49a2242735 sp:7f47b53fdc60 error:0 in libc.so.6[1735,7f49a2241000+16f000] MODE2 reset -> GPU reset succeeded, trying to resume SMU is resumed successfully! [drm] device wedged, but recovered through reset <-- false-recovery #2 20:36:17 MES failed to respond to msg=REMOVE_QUEUE failed to remove hardware queue from MES, doorbell=0x1002 MES might be in unrecoverable state, issue a GPU reset MES failed to respond to msg=REMOVE_QUEUE failed to remove hardware queue from MES, doorbell=0x1000 GPU reset begin!. Source: 3 MODE2 reset -> GPU reset succeeded, trying to resume SMU is resumed successfully! [drm:amdgpu_ring_test_helper] *ERROR* ring vpe test failed (-110) resume of IP block <vpe_v6_1> failed -110 GPU reset end with ret = -110 20:36:23+ amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 Failed to power gate VPE / JPEG / VCN inst 0 / VCN inst 1 Failed to retrieve enabled ppfeatures (repeats every 4-5s) 20:37:32 usb 1-1.2: USB disconnect (xhci PCIe partial collapse) 20:42:13 USB partially recovered 20:42:20 USB disconnected again 20:53:18-55 more SMU 0x32 / Failed to retrieve enabled ppfeatures ~20:55 user hard-reset (TCO never fired this time; system hung but didn't panic) 20:56:45 boot 0 This is signature-identical to the 2026-04-27 19:41 incident: comp_1.1.x ring timeout -> reset failure -> MES REMOVE_QUEUE failure -> MODE2 reset -> vpe_v6_1 -110 -> SMU 0x32. The MODE2 reset reports success ("device wedged, but recovered through reset") but on this kernel, that line is not the all-clear -- a delayed wedge can fire seconds (this incident) or tens of minutes (19:41 incident) later via the next power-gate transition. Coredump analysis (vLLM PID 2995, SIGSEGV @ 20:36:13) ----------------------------------------------------- The userspace SIGSEGV traceable from the kernel "traps:" line was analyzed against the container's libraries: Faulting thread (LWP 68 = TID 5683): RIP = 0x00007f49a2242735 libc.so.6 + 0x1735 RSP = 0x00007f47b53fdc60 (matches kernel sp:) Caller: 0x00007f4992c1ca28 libhsa-runtime64.so.1 + 0x106a6c Container glibc + 0x1735 disassembles as the 'hlt' instruction inside abort(): abort() in container glibc (Fedora 43 base): 170a: call __pthread_raise_internal ; raise(SIGABRT) -- did not terminate ... 172e: mov $0xe, %eax ; rt_sigprocmask 1733: syscall 1735: hlt ; <-- faulted (privileged insn in CPL=3) 1736: mov $0x7f, %edi 173b: call _exit 'hlt' is glibc's last-resort suicide path when normal SIGABRT delivery does not kill the process. Userspace 'hlt' triggers a CPU #GP with error:0 -- exactly what the kernel logged. The kernel translates the #GP into SIGSEGV. Caller in libhsa-runtime64.so.1 + 0x106a6c is in: rocr::core::Runtime::HwExceptionHandler(long, void*) 106a62: call fprintf@plt ; logs "HW Exception by GPU node-1 reason :GPU Hang" 106a67: call abort@plt ; deliberate abort 106a6c: mov %rax, %r14 ; <-- return address (never reached) This is the same handler that emitted "HW Exception by GPU node-1 reason :GPU Hang" in the 2026-04-27 19:41 vllm.log (attached to the previous capture in this bug). The other 14 vLLM threads were all idle in __syscall_cancel_arch_end returning from syscall 0xca (futex wait) -- a sleeping worker pool. Only the HSA exception-handler thread was running. Causal chain ------------ amdgpu kernel: chrome comp_1.1.1 ring timeout -> comp_1.1.1 ring reset FAILED -> GPU Reset (Source 1) -> MES REMOVE_QUEUE failure -> kfd queue removal failed for vLLM HSA runtime userspace: KFD signals HW exception -> rocr::core::Runtime::HwExceptionHandler -> fprintf("HW Exception ... GPU Hang") -> abort() glibc abort(): raise(SIGABRT) failed to terminate (likely SIGABRT masked on HSA thread) -> fall-through to rt_sigprocmask + hlt -> hlt #GP -> kernel delivers SIGSEGV -> core dumped Significance ------------ The vLLM PID 2995 SIGSEGV is a downstream symptom, not the root cause. ROCm's Runtime::HwExceptionHandler is the userspace choke-point in this bug class -- both the 19:41 and 20:36 incidents went through it. Future captures with "libc.so.6 + 0x1735" on the stack of a process using /dev/kfd should be read as "GPU hung; this is HSA reacting" rather than as a separate userspace bug. The kernel-side wedge cause is unchanged from prior comments: vpe_v6_1 IP-block resume fails -110 after the second MODE2 reset, leaving the SMU mailbox stuck in SMN_C2PMSG_66:0x00000032. Mitigation taken on the affected machine ---------------------------------------- - vllm-gemma4-31b-awq.service (user quadlet) has been disabled at auto-start while this bug is open -- the 2026-04-27 19:23 incident showed vLLM model-load profiling alone is enough to trigger the wedge, and auto-restart on every boot was reloading the gun. - Recommend Slack and Chrome users on gfx1151 disable GPU compositing / WebGPU until a fix lands. Both Electron compute queues have now been observed as sole sufficient triggers (chrome comp_1.1.1 20:30, slack comp_1.1.0 20:34, chrome comp_1.1.1 20:36). Evidence attached (this incident, 20:31->20:55) ----------------------------------------------- - boot-1-kernel.log full kernel journal of the crash boot (1691 lines) - boot-1-gpu-events.log amdgpu/SMU/MES/PCIe-only events filtered (236 lines) - coredump-2995-info.txt systemd-coredump info dump for vLLM SIGSEGV - boots.txt boot index showing the four boots of the day - current-state.txt kdump/sysctl/devcoredump status post-reboot - coredumpctl-list.txt coredumps in the crash window Note: no devcoredump captured this time -- /sys/class/devcoredump/ was empty when the box came back, and 5-minute auto-expiry plus hard-reset between hang and reboot lost it. No vmcore (kernel hung but never panicked, TCO didn't fire). Pstore empty. Journal is the entire kernel-side record for this incident.
Created attachment 2138436 [details] Reproduced again on 6.19.13-300.fc44.x86_64
A potential fix has been identified. This looks like it may actually be a VAAPI issue. https://community.frame.work/t/smu-deadlock-system-freeze-on-fedora-43/81795/27?page=2 I'm patching this into my fedora kernel to test.
Thanks—this is very helpful for immediate stability. That said, I’d frame it as a workaround, not a fix: it avoids the bug by keeping clocks/subsystems active instead of allowing normal power-gating. We’re sidestepping the failing SMU path rather than resolving the deadlock (no proper busy handling / retry / serialization), so the root issue remains and power management is degraded. This likely needs coordination with the silicon vendor (AMD) to address at the SMU/firmware level.
Yes - that's correct, a workaround. Update: Crupi's no_vpe_idle_pg patch (6.19.13-300.vpe1.fc44, amdgpu.no_vpe_idle_pg=1) does not fully fix this bug. Reproduced 2026-04-27 22:40 EDT under pure inference load — no VAAPI, no Electron, no concurrent GPU clients. vLLM (TheRock ROCm image, Gemma 4 31B AWQ-INT4) hit HW Exception on a compute queue (comp_1.1.1) ~23s after model load completed. Cascade: HW Exception by GPU node-1 reason :GPU Hang [drm] *ERROR* Failed to initialize parser -125 ring vpe test failed (-110) # post-reset, collateral Dpm disable jpeg failed, ret = -62 Failed to power gate VCN instance 0 / 1 SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000021 Failed to retrieve enabled ppfeatures # repeating every 5s Notes: - SMU stuck-command code is 0x21 on the patched kernel (was 0x32 unpatched). Same wedge state, different message ID. - No "Failed to power gate VPE!" in the cascade — consistent with VPE idle-pg being gated off by the patch. Wedge entered via JPEG/VCN power-gate path. - Conclusion: at least two triggers reach the same SMU mailbox wedge: (1) VAAPI -> VPE idle power-gate [closed by Crupi patch] (2) Inference HW exception -> MES REMOVE_QUEUE fail -> MODE2 reset -> JPEG/VCN power-gate fail -> SMU wedge [still open] - Clean shutdown hung ~8 hours after wedge; manual power-cycle required. Same "hard reset only" recovery as 0x32. Two coredumps + devcoredump captured; available on request.
AMD has done further investigation here https://gitlab.freedesktop.org/drm/amd/-/work_items?sort=created_date&state=opened&search=395&first_page_size=20&show=eyJpaWQiOiI1MTcxIiwiZnVsbF9wYXRoIjoiZHJtL2FtZCIsImlkIjoxNDk1ODl9