Bug 2457514

Summary:

6.19.11 on a Ryzen AI Max+ 395 (Strix Halo, gfx1151) running Fedora 43, the system locks up hard within a few hours of any GPU workload.

Product:

[Fedora] Fedora

Reporter:

Adam Clater <aclater>

Component:

kernel

Assignee:

Justin M. Forbes <jforbes>

Status:

NEW ---

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

CC:

acaringi, adscvr, airlied, hans, hpa, jforbes, kernel-maint, linville, masami256, mchehab, nickolasjcarr, ptalbert, sid314, steved, suraj.ghimire7

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

---

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Contains system identity, lspci for c2:00.0, installed kernels, last 6 boot list, etc.	none
kernel log extract from failed boot	none
system info and package versions	none
0427 crash details	none
0427 crash kernel log	none
I reproduced a crash, and this time I got some artifacts.	none
Reproduced again on 6.19.13-300.fc44.x86_64	none

Description Adam Clater 2026-04-10 22:14:50 UTC

Created attachment 2136649 [details]
Contains    system identity, lspci for c2:00.0, installed kernels, last 6 boot list, etc.

Created attachment 2136649 [details]
Contains    system identity, lspci for c2:00.0, installed kernels, last 6 boot list, etc.

## Summary                                                                    
  
  After upgrading from kernel 6.19.10 to 6.19.11 on a Ryzen AI Max+ 395         
  (Strix Halo, gfx1151) running Fedora 43, the system locks up hard
  within a few hours of any GPU workload. Three consecutive lockups in          
  one day, each requiring a power-off. 6.19.10 ran the same workload            
  stable for 3+ days immediately prior.                                         
                                                                                
  The only amdgpu code change in the Fedora 6.19.11 changelog is                
  "drm/amdgpu: rework how we handle TLB fences" (Alex Deucher). All
  other amdgpu entries in the changelog are kconfig / build flag                
  adjustments. That patch is the suspected regression.                          
                                                                                
  ## Hardware                                                                   
                  
  - CPU/APU: AMD Ryzen AI Max+ 395 (Strix Halo)                                 
  - iGPU: Radeon 8060S Graphics — PCI 1002:1586 at 0000:c2:00.0
  - IP blocks (from journal): smu_v14_0_0, gfx_v11_0_0, DCN 3.5.1               
  - Memory: 128 GB unified (UMA APU)                                            
  - BIOS VRAM carveout: 64 GB                                                   
                                                                                
  ## Software                                                                   
                  
  - Fedora 43                                                                   
  - Bad kernel:  6.19.11-200.fc43.x86_64  (#1 SMP PREEMPT_DYNAMIC Thu Apr  2
  16:55:52 UTC 2026)                                                            
  - Last good:   6.19.10-200.fc43.x86_64
  - Userspace workload at time of hang: llama.cpp Vulkan backend (RADV)         
    running a Qwen3 model on the iGPU. Workload is not ROCm/HIP — this          
    is pure Vulkan, and it previously stressed 6.19.10 for days without         
    incident.                                                                   
                                                                                
  ## Symptom                                                                    
                  
  First kernel error in each of the three lockup boots is identical:            
                  
      amdgpu 0000:c2:00.0: amdgpu: SMU: I'm not done with your previous command:
        SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
      amdgpu 0000:c2:00.0: amdgpu: Failed to retrieve enabled ppfeatures!       
                                                                                
  The error then repeats every ~5 seconds, cycling between msg id 0x32          
  (GetEnabledSmuFeatures) and 0x19 (AllowGfxOff), with:                         
                                                                                
      amdgpu 0000:c2:00.0: amdgpu: Failed to enable gfxoff!                     
      amdgpu 0000:c2:00.0: [drm] SMU response after wait: 0, msg id = 18        
                                                                                
  Userspace wedges shortly afterward:                                           
  - llama-vulkan.service exits with SIGABRT, rootless podman coredumps          
  - switcheroo-control-check-discrete-amdgpu times out after 2m59s              
  - systemd-sleep: "Failed to freeze unit 'user.slice': Connection timed out"   
    (this is what finally hangs the machine — suspend cannot complete           
    because the GPU-using cgroup will not freeze)                               
                                                                                
  The machine becomes unresponsive and requires a hard power-off.               
                                                                                
  ## Reproduction                                                               
                  
  1. Boot kernel 6.19.11-200.fc43 on Strix Halo (Ryzen AI Max+ 395).            
  2. Run any sustained Vulkan compute workload on the iGPU
     (llama.cpp -DGGML_VULKAN=ON against a multi-GB model is sufficient).       
  3. Within 1–10 hours the SMU mailbox enters the "I'm not done with            
     your previous command" state and never recovers.                           
                                                                                
  Observed in three consecutive boots on 2026-04-10:                            
                  
  | Boot start (EDT)       | First SMU error         | Hard reset      |        
  |------------------------|-------------------------|-----------------|
  | 2026-04-09 23:57:49    | 2026-04-10 09:42:05     | 09:43:24        |        
  | 2026-04-10 09:44:39    | 2026-04-10 11:46:37     | 11:56:54        |        
  | 2026-04-10 11:57:48    | 2026-04-10 14:53:08     | 14:56:49        |        
                                                                                
  ## Hypothesis                                                                 
                                                                                
  The 6.19.11 Fedora changelog contains exactly one amdgpu code change:         
   
      drm/amdgpu: rework how we handle TLB fences (Alex Deucher)                
                  
  The symptom — SMU mailbox reporting that a prior command is still             
  in-flight when the driver posts the next one — is consistent with a
  TLB-fence ordering / completion-accounting bug on a UMA APU where the         
  IOMMU TLB is shared between CPU and iGPU. gfx1151 / Strix Halo is a           
  new IP block and a likely candidate to expose a race in the reworked          
  path that doesn't manifest on discrete cards.                                 
                                                                                
  ## Workaround                                                                 
                                                                                
  Will attempt 6.19.10-200.fc43 to see if it restores stability. 
                                                                                
  ## Attachments  
                                                                                
  - journal excerpt from the three crashed boots (will attach)                  
  - `lspci -nn | grep 1002:1586`
  - `uname -a`

Comment 1 Adam Clater 2026-04-11 21:24:39 UTC

Update 2026-04-11: same hang reproduced on 6.19.10 — TLB-fences theory
  falsified

  I downgraded to 6.19.10-200.fc43 as the workaround and the system was stable
  for ~22 hours. It then hung with a byte-identical signature to the three
  6.19.11 crashes in my original report.

  Crash details

  - Kernel: 6.19.10-200.fc43.x86_64 (#1 SMP PREEMPT_DYNAMIC Wed Mar 25 16:09:19
  UTC 2026)
  - Boot start: 2026-04-10 18:34:56 EDT
  - First SMU error: 2026-04-11 16:47:44 EDT (~22 hours uptime)
  - Hard reset: 2026-04-11 16:49:12 EDT
  - Workload at time of hang: llama.cpp running Qwen3-32B Q4_K_M, this time on
  the ROCm backend (quay.io/ramalama/rocm:0.18.0,
  HSA_OVERRIDE_GFX_VERSION=11.5.1) rather than the Vulkan backend used in the
  original three crashes. So the backend is not the discriminator — both
  Vulkan/RADV and ROCm/HIP workloads can wedge the SMU.

  First fault and cascade

  16:47:44 amdgpu 0000:c2:00.0: amdgpu: SMU: I'm not done with your previous
  command:
                                SMN_C2PMSG_66:0x00000032
  SMN_C2PMSG_82:0x00000000
  16:47:44 amdgpu 0000:c2:00.0: amdgpu: Failed to power gate VPE!
  16:47:44 [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe
  failed, ret = -62
  16:47:49 amdgpu: Failed to retrieve enabled ppfeatures
  16:47:51 amdgpu: Dumping IP State
  16:47:58 amdgpu: Failed to power gate VCN instance 0
  16:48:03 amdgpu: Failed to power gate VCN instance 1
  16:48:07 amdgpu: Failed to disable gfxoff
  16:48:09 amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
  16:48:09 amdgpu: failed to reg_write_reg_wait

  Same SMN_C2PMSG_66:0x00000032 (GetEnabledSmuFeatures) wedge as the three
  6.19.11 boots.

  PCIe complex collapse followed the same pattern: xhci_hcd 0000:c4:00.3: HC
  died, snd_hda_intel 0000:c2:00.1: CORB reset timeout, USB peripherals
  disconnected, machine unresponsive, hard reset required.

  What this rules out

  The "drm/amdgpu: rework how we handle TLB fences" commit cannot be the sole
  root cause. 6.19.10 lacks that commit and exhibits the identical failure mode
  with the identical register state. 6.19.11 may accelerate the same underlying
  bug (1–10 hours on 6.19.11 vs ~22 hours on 6.19.10 — same order of magnitude,
  not a sharp regression), but the bug is older.

  What the trigger looks like on this run

  Unlike the 6.19.11 crashes where llama-vulkan was actively hammering the GPU
  at the time of the wedge, on 6.19.10 the first observable failure is in the
  VPE idle-work path: vpe_idle_work_handler fired,
  vpe_set_powergating_state(GATE) was sent to the SMU, and the SMU never
  acknowledged. The mailbox was already in state 0x32 when the next command
  (GetEnabledSmuFeatures) was attempted 5 seconds later.

  This points to the bug being in the smu_v14_0_0 driver/firmware interaction,  
  reachable any time the driver tries to power-gate a long-idle IP block (VPE,
  VCN, gfxoff). It's not specific to a hot compute path.                        
                                                            
  Compute and KFD rings kept working through the SMU stall — userspace          
  llama-server health checks were still returning 200 OK at 16:48:02, eighteen
  seconds after the first SMU error. Only the power-management mailbox path was 
  wedged. This is consistent with "SMU mailbox stuck mid-command, GFX/KFD rings
  still draining inflight work, but no new SMU commands can land."

  Not related                                                                   
   
  An earlier hibernation request at 15:57:25 was rejected by kernel lockdown at 
  the permission-check stage with no device callbacks invoked — checked the
  journal carefully, no pm: suspend entry or freezing user space messages       
  followed. Not a contributor.                              

  No MCE, no kernel panic, no Oops. Pure firmware/driver wedge.

Comment 2 Sid 2026-04-27 21:26:17 UTC

Created attachment 2138416 [details]
kernel log extract from failed boot

Comment 3 Sid 2026-04-27 21:27:06 UTC

Created attachment 2138417 [details]
system info and package versions

Comment 4 Sid 2026-04-27 21:27:45 UTC

I can reproduce a very similar failure on Fedora 43 on a Strix Halo system.

System:

- Fedora 43
- GNOME on Wayland
- Kernel: `6.19.13-200.fc43.x86_64`
- GPU: AMD Strix Halo / Radeon 8060S Graphics, PCI ID `1002:1586`
- Chrome: `147.0.7727.116`
- Brave Flatpak: `1.89.143`
- Mesa: `25.3.6`
- `linux-firmware`: `20260410-1.fc43`

Kernel args in use:

`amdgpu.gttsize=126976 ttm.pages_limit=32505856`

Observed behavior:

- machine hard-freezes during Chromium-family browser video playback
- physical reset is required
- confirmed reproduced on `2026-04-27`

Important note on browser flags:

- the confirmed hard-freeze I can tie cleanly was with the normal non-safe Chrome launcher
- I also have a separate Chrome `SIGTRAP` coredump from a process launched with `--use-gl=desktop`
- however, at the time there were mixed Chrome/Brave instances in different launch modes, so I do not want to overstate that as proof that `--use-gl=desktop` failed as a mitigation

Most relevant kernel log sequence from the previous boot:

```text
Apr 27 13:28:11 amdgpu: Dumping IP State
Apr 27 13:28:13 amdgpu: Failed to power gate VPE!
Apr 27 13:28:18 amdgpu: Failed to disable gfxoff!
Apr 27 13:28:23 amdgpu: Failed to disable gfxoff!
Apr 27 13:28:28 amdgpu: Failed to disable gfxoff!
Apr 27 13:28:33 amdgpu: AMDGPU device coredump file has been created
Apr 27 13:28:33 amdgpu: ring sdma0 timeout, signaled seq=6289, emitted seq=6291
Apr 27 13:28:33 amdgpu: Starting sdma0 ring reset
Apr 27 13:28:33 amdgpu: ring sdma0 test failed (-110)
Apr 27 13:28:33 amdgpu: Ring sdma0 reset failed
Apr 27 13:28:33 amdgpu: GPU reset begin!. Source: 1
```

This looks closely related to this bug because it overlaps on:

- Strix Halo / Radeon 8060S (`1002:1586`)
- `Failed to power gate VPE!`
- `Failed to disable gfxoff!`
- hard machine lockup requiring reset

I am attaching:

- kernel log extract
- system info / package versions

Comment 5 Sid 2026-04-27 21:32:39 UTC

Sorry, I forgot to add this on issue reproduction -- just playing YouTube videos hard-crashes in a few minutes.

Comment 6 Adam Clater 2026-04-27 22:32:52 UTC

Reproduced on 6.19.13-300.fc44 -- bug also reachable via failed
GPU-reset cleanup, not just sustained workload
================================================================

Adding another reproduction of this SMU mailbox wedge after the
Fedora 43 -> 44 upgrade. The 6.19.13-300.fc44 kernel does NOT carry
a fix for this issue, and this incident shows a path into the wedge
that doesn't require sustained GPU load -- a userspace GPU-client
crash whose kernel-side cleanup fails is enough.


System
------

  Hardware:             AMD Ryzen AI Max+ 395 (Strix Halo, gfx1151),
                        128 GB unified memory
  Kernel:               6.19.13-300.fc44.x86_64
  Distro:               Fedora 44
  linux-firmware:       20260410-1.fc44
  microcode_ctl:        2.1-74.fc44
  mesa-vulkan-drivers:  26.0.3-4.fc44
  HSA_OVERRIDE:         HSA_OVERRIDE_GFX_VERSION=11.5.1 set globally
  GPU PCI BDF:          0000:c2:00.0


Workload at trigger time
------------------------

- Primary GPU client: vLLM serving google/gemma-4-31B-it (AWQ) inside
  container docker.io/kyuz0/vllm-therock-gfx1151:stable
  (digest sha256:f89c8c689ade28877ade980ba0f29b3142af16c6ebb7f3f285311d38bc81a8a2,
  built 2026-04-22), --attention-backend ROCM_ATTN,
  gpu_memory_utilization=0.9, max_model_len=8192, ROCm THERock build
  (rocm_sdk_libraries_gfx1151, hsa-runtime64.so.1).

- Concurrent GPU client: Slack flatpak (Electron) was using compute
  queue comp_1.1.1 at the moment of failure.

- vLLM had been healthy and serving /v1/models at 17:50:18.


Failure sequence
----------------

The bug enters from a kernel-side cleanup failure after a userspace
GPU-client crash, not from sustained load:

  17:59:16  amdgpu: ring comp_1.1.1 timeout, signaled seq=4260, emitted seq=4262
            Process slack pid 7097 thread slack:cs0 pid 7255
            Starting comp_1.1.1 ring reset -> reset compute queue (1:1:1)
            Ring comp_1.1.1 reset failed
            GPU reset begin!. Source: 1
            Failed to evict queue 2
            Failed to suspend process pid 110401  (vllm worker)
            remove_all_kfd_queues_mes: Failed to remove queue 1 for dev 46231
            traps: vllm[110088] general protection fault ip:7f8ad1055735
                   sp:7f88dfffec60 error:0 in libc.so.6[1735,7f8ad1054000+16f000]
            (10x) sq_intr: error, detail 0x00000000, type 2, sh 0/1, priv 1, wave_id N
            MODE2 reset
            [drm] *ERROR* Failed to initialize parser -125!
            GPU reset(1) succeeded!
            [drm] device wedged, but recovered through reset

  17:59:20  MES failed to respond to msg=REMOVE_QUEUE
            failed to remove hardware queue from MES, doorbell=0x1002
            MES might be in unrecoverable state, issue a GPU reset
            Failed to remove queue 1
            GPU reset begin!. Source: 3
            MES failed to respond to msg=REMOVE_QUEUE   doorbell=0x1000
            MODE2 reset
            GPU reset succeeded, trying to resume
            SMU is resuming...
            SMU is resumed successfully!

  17:59:21  [drm:amdgpu_ring_test_helper] *ERROR* ring vpe test failed (-110)
            resume of IP block <vpe_v6_1> failed -110
            GPU reset end with ret = -110

  17:59:26  amdgpu: SMU: I'm not done with your previous command:
                    SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
            Failed to power gate VPE!
            [drm:vpe_set_powergating_state] *ERROR* Dpm disable vpe failed, ret = -62
            (then JPEG, VCN0, VCN1 same)

From 17:59:26 the SMU mailbox is wedged in the familiar 0x32 state
and remains stuck until hard reset; "Failed to retrieve enabled
ppfeatures!" repeats every ~5s for the remainder of uptime.


Cascade
-------

Same PCIe-complex collapse pattern as the prior incidents:

- 17:59:33  xhci_hcd 0000:c4:00.3: Refused to change power state
            from D3hot to D0 -> Controller not ready at resume -19
            -> HC died; cleaning up -> all USB downstream of c4:00.3
            disconnected (5-1, 5-1.4, 6-1, 6-1.1)

- 17:59:34  snd_hda_intel 0000:c2:00.1: CORB reset timeout#2,
            CORBRP = 65535 (HDMI audio function on the GPU)

- 18:00:03  Second USB hub disconnect (1-1.2, 2-1.2)

- Every udev re-probe of /dev/dri/card1 then times out at exactly
  2m59s: "switcheroo-control-check-discrete-amdgpu /dev/dri/card1
  timed out after 2min 59s, killing" (4 occurrences between 18:05
  and 18:14).


Recovery
--------

Manual "sudo reboot now" issued at 18:14:43. Graceful shutdown could
NOT complete -- the chat-ui podman container kept logging through
18:17:42 (3 minutes after reboot now); next boot did not begin until
18:19:33, indicating a hard reset was required. This matches the
"PCIe-complex-down" state preventing systemd shutdown from finishing.


What's new vs. the 2026-04-10/11 incidents
------------------------------------------

The 2026-04-10/11 reports framed the trigger as "sustained GPU
workload" (Vulkan llama.cpp, ROCm llama.cpp). This incident shows a
THIRD path in:

  1. Userspace GPU client (vLLM) takes a SIGSEGV in libc.so.6.
  2. amdgpu attempts queue cleanup -> Ring comp_1.1.1 reset failed
     -> Failed to evict queue 2 -> remove_all_kfd_queues_mes:
     Failed to remove queue 1.
  3. Driver falls back to MODE2 reset; FIRST reset reports success
     ("device wedged, but recovered through reset").
  4. ~4s later MES still cannot respond to REMOVE_QUEUE for two
     doorbells (0x1000, 0x1002) -> second MODE2 reset.
  5. On the second reset, vpe_v6_1 IP-block resume returns -110
     (-ETIMEDOUT).
  6. From that moment SMU mailbox is wedged in 0x32 -- same end
     state as the previous incidents.

So the failure is reachable not just via sustained workload but via
ANY GPU-reset cleanup that fails to resume the VPE IP block. The
relevant failure point is the vpe_v6_1 resume after MODE2,
immediately preceding the SMU 0x32 wedge.

It is also worth noting that a second GPU client (Slack Electron)
was issuing compute work on comp_1.1.1 at the same moment vLLM
crashed -- the dangling queue that the kernel tried (and failed) to
reset is Slack's, not vLLM's. The hang chain begins with a non-LLM
client's queue reset failure cascading into vLLM's KFD process
suspension also failing.


Evidence
--------

- Kernel-only excerpt of the 17:59:14-18:00:30 crash window:
  ~120 lines (can attach)
- Full journal 17:55-18:18 (kernel + user + container logs):
  ~2929 lines
- Coredump:
  /var/spool/abrt/ccpp-2026-04-27-17:59:20.849710-109810
  (vLLM python3.12, 4.2 GB peak) -- abrt could not finalize
  ("Error: No segments found in coredump")
- amdgpu device coredump:
  /sys/class/drm/card1/device/devcoredump/data was created twice
  (17:59:16 and 17:59:20); not preserved across reboot


Asks
----

1. Is vpe_v6_1 IP-block resume after MODE2 a known-fragile path on
   Strix Halo? The -110 (-ETIMEDOUT) on "ring vpe test" is the
   immediate precursor to the SMU 0x32 wedge in this incident.

2. Any update on which SMU message ID is stuck at
   SMN_C2PMSG_66:0x00000032 in the current Strix Halo SMU firmware?
   Knowing whether this maps to a power-gate or DPM command would
   localize the firmware-side state machine bug.

3. Can the driver detect "VPE resume failed after MODE2" and avoid
   issuing further SMU mailbox commands (which only stack up behind
   the wedged 0x32) so that a third-stage recovery (BACO / cold
   link reset) is at least attempted before declaring the device
   dead?

Comment 7 Adam Clater 2026-04-27 22:35:26 UTC

Created attachment 2138420 [details]
0427 crash details

Comment 8 Adam Clater 2026-04-27 22:36:43 UTC

Created attachment 2138421 [details]
0427 crash kernel log

Comment 9 Adam Clater 2026-04-27 23:50:51 UTC

Created attachment 2138434 [details]
I reproduced a crash, and this time I got some artifacts.

Comment 10 Adam Clater 2026-04-28 01:15:22 UTC

Fourth incident on 2026-04-27 -- slack-triggered, two-stage cascade, full coredump analysis
============================================================================================

Reproduced again on 6.19.13-300.fc44.x86_64 (boot from 20:31:35 EDT, hung at
~20:55, hard-reset at 20:56:45). This is the fourth incident on this hardware
in one day. Two new findings worth recording:

1. Slack (not just chrome) is sufficient as a sole trigger. First offender
   this incident was slack PID 8955 on comp_1.1.0, three minutes into a
   fresh boot.

2. The userspace SIGSEGV is collateral, not a vLLM bug. Full coredump
   analysis below pinpoints the faulting instruction as 'hlt' inside glibc
   abort(), called by ROCm's rocr::core::Runtime::HwExceptionHandler. The
   HSA runtime is reacting cleanly to the GPU disappearing; glibc's
   last-resort suicide path gets escalated to SIGSEGV by the kernel.


Kernel timeline (boot -1: 20:31:35 -> 20:55)
--------------------------------------------

20:31:35  boot
20:34:22  amdgpu: ring comp_1.1.0 timeout, signaled seq=219, emitted seq=221
          Process slack pid 8955 thread slack:cs0 pid 9019
          Starting comp_1.1.0 ring reset -> reset compute queue (1:1:0) -> SUCCEEDED
          [drm] device wedged, but recovered through reset       <-- false-recovery #1

20:36:12  amdgpu: ring comp_1.1.1 timeout, signaled seq=41, emitted seq=43
          Process chrome pid 8504 thread chrome:cs0 pid 8558
          Starting comp_1.1.1 ring reset -> Ring comp_1.1.1 reset FAILED
20:36:13  GPU reset begin!. Source:  1
          Failed to evict queue 2
          Failed to suspend process pid 10570
          remove_all_kfd_queues_mes: Failed to remove queue 1 for dev 46231
          traps: vllm[5683] general protection fault ip:7f49a2242735
                 sp:7f47b53fdc60 error:0 in libc.so.6[1735,7f49a2241000+16f000]
          MODE2 reset -> GPU reset succeeded, trying to resume
          SMU is resumed successfully!
          [drm] device wedged, but recovered through reset       <-- false-recovery #2

20:36:17  MES failed to respond to msg=REMOVE_QUEUE
          failed to remove hardware queue from MES, doorbell=0x1002
          MES might be in unrecoverable state, issue a GPU reset
          MES failed to respond to msg=REMOVE_QUEUE
          failed to remove hardware queue from MES, doorbell=0x1000
          GPU reset begin!. Source:  3
          MODE2 reset -> GPU reset succeeded, trying to resume
          SMU is resumed successfully!
          [drm:amdgpu_ring_test_helper] *ERROR* ring vpe test failed (-110)
          resume of IP block <vpe_v6_1> failed -110
          GPU reset end with ret = -110

20:36:23+ amdgpu: SMU: I'm not done with your previous command:
                  SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
          Failed to power gate VPE / JPEG / VCN inst 0 / VCN inst 1
          Failed to retrieve enabled ppfeatures   (repeats every 4-5s)

20:37:32  usb 1-1.2: USB disconnect (xhci PCIe partial collapse)
20:42:13  USB partially recovered
20:42:20  USB disconnected again
20:53:18-55  more SMU 0x32 / Failed to retrieve enabled ppfeatures
~20:55    user hard-reset (TCO never fired this time; system hung but didn't panic)
20:56:45  boot 0

This is signature-identical to the 2026-04-27 19:41 incident: comp_1.1.x
ring timeout -> reset failure -> MES REMOVE_QUEUE failure -> MODE2 reset ->
vpe_v6_1 -110 -> SMU 0x32. The MODE2 reset reports success ("device wedged,
but recovered through reset") but on this kernel, that line is not the
all-clear -- a delayed wedge can fire seconds (this incident) or tens of
minutes (19:41 incident) later via the next power-gate transition.


Coredump analysis (vLLM PID 2995, SIGSEGV @ 20:36:13)
-----------------------------------------------------

The userspace SIGSEGV traceable from the kernel "traps:" line was analyzed
against the container's libraries:

  Faulting thread (LWP 68 = TID 5683):
    RIP = 0x00007f49a2242735      libc.so.6 + 0x1735
    RSP = 0x00007f47b53fdc60      (matches kernel sp:)
    Caller: 0x00007f4992c1ca28    libhsa-runtime64.so.1 + 0x106a6c

Container glibc + 0x1735 disassembles as the 'hlt' instruction inside abort():

  abort() in container glibc (Fedora 43 base):
    170a:  call __pthread_raise_internal   ; raise(SIGABRT) -- did not terminate
    ...
    172e:  mov  $0xe, %eax                 ; rt_sigprocmask
    1733:  syscall
    1735:  hlt                             ; <-- faulted (privileged insn in CPL=3)
    1736:  mov  $0x7f, %edi
    173b:  call _exit

'hlt' is glibc's last-resort suicide path when normal SIGABRT delivery does
not kill the process. Userspace 'hlt' triggers a CPU #GP with error:0 --
exactly what the kernel logged. The kernel translates the #GP into SIGSEGV.

Caller in libhsa-runtime64.so.1 + 0x106a6c is in:

  rocr::core::Runtime::HwExceptionHandler(long, void*)
    106a62:  call fprintf@plt    ; logs "HW Exception by GPU node-1 reason :GPU Hang"
    106a67:  call abort@plt      ; deliberate abort
    106a6c:  mov %rax, %r14      ; <-- return address (never reached)

This is the same handler that emitted "HW Exception by GPU node-1
reason :GPU Hang" in the 2026-04-27 19:41 vllm.log (attached to the previous
capture in this bug).

The other 14 vLLM threads were all idle in __syscall_cancel_arch_end
returning from syscall 0xca (futex wait) -- a sleeping worker pool. Only
the HSA exception-handler thread was running.


Causal chain
------------

amdgpu kernel: chrome comp_1.1.1 ring timeout
  -> comp_1.1.1 ring reset FAILED
  -> GPU Reset (Source 1) -> MES REMOVE_QUEUE failure
  -> kfd queue removal failed for vLLM
HSA runtime userspace: KFD signals HW exception
  -> rocr::core::Runtime::HwExceptionHandler
  -> fprintf("HW Exception ... GPU Hang")
  -> abort()
glibc abort(): raise(SIGABRT) failed to terminate (likely SIGABRT masked
                                                   on HSA thread)
  -> fall-through to rt_sigprocmask + hlt
  -> hlt #GP -> kernel delivers SIGSEGV -> core dumped


Significance
------------

The vLLM PID 2995 SIGSEGV is a downstream symptom, not the root cause.
ROCm's Runtime::HwExceptionHandler is the userspace choke-point in this bug
class -- both the 19:41 and 20:36 incidents went through it. Future
captures with "libc.so.6 + 0x1735" on the stack of a process using
/dev/kfd should be read as "GPU hung; this is HSA reacting" rather than
as a separate userspace bug.

The kernel-side wedge cause is unchanged from prior comments: vpe_v6_1
IP-block resume fails -110 after the second MODE2 reset, leaving the SMU
mailbox stuck in SMN_C2PMSG_66:0x00000032.


Mitigation taken on the affected machine
----------------------------------------

- vllm-gemma4-31b-awq.service (user quadlet) has been disabled at
  auto-start while this bug is open -- the 2026-04-27 19:23 incident
  showed vLLM model-load profiling alone is enough to trigger the wedge,
  and auto-restart on every boot was reloading the gun.

- Recommend Slack and Chrome users on gfx1151 disable GPU compositing /
  WebGPU until a fix lands. Both Electron compute queues have now been
  observed as sole sufficient triggers (chrome comp_1.1.1 20:30, slack
  comp_1.1.0 20:34, chrome comp_1.1.1 20:36).


Evidence attached (this incident, 20:31->20:55)
-----------------------------------------------

- boot-1-kernel.log       full kernel journal of the crash boot (1691 lines)
- boot-1-gpu-events.log   amdgpu/SMU/MES/PCIe-only events filtered (236 lines)
- coredump-2995-info.txt  systemd-coredump info dump for vLLM SIGSEGV
- boots.txt               boot index showing the four boots of the day
- current-state.txt       kdump/sysctl/devcoredump status post-reboot
- coredumpctl-list.txt    coredumps in the crash window

Note: no devcoredump captured this time -- /sys/class/devcoredump/ was
empty when the box came back, and 5-minute auto-expiry plus hard-reset
between hang and reboot lost it. No vmcore (kernel hung but never
panicked, TCO didn't fire). Pstore empty. Journal is the entire
kernel-side record for this incident.

Comment 11 Adam Clater 2026-04-28 01:16:51 UTC

Created attachment 2138436 [details]
Reproduced again on 6.19.13-300.fc44.x86_64

Comment 12 Adam Clater 2026-04-28 01:41:09 UTC

A potential fix has been identified. This looks like it may actually be a VAAPI issue.

https://community.frame.work/t/smu-deadlock-system-freeze-on-fedora-43/81795/27?page=2

I'm patching this into my fedora kernel to test.

Comment 13 Sid 2026-04-28 03:29:20 UTC

Thanks—this is very helpful for immediate stability. That said, I’d frame it as a workaround, not a fix: it avoids the bug by keeping clocks/subsystems active instead of allowing normal power-gating. We’re sidestepping the failing SMU path rather than resolving the deadlock (no proper busy handling / retry / serialization), so the root issue remains and power management is degraded. This likely needs coordination with the silicon vendor (AMD) to address at the SMU/firmware level.

Comment 14 Adam Clater 2026-04-28 10:51:21 UTC

Yes - that's correct, a workaround.

Update: Crupi's no_vpe_idle_pg patch (6.19.13-300.vpe1.fc44,
  amdgpu.no_vpe_idle_pg=1)                                                  
  does not fully fix this bug.                                                  
                                                                                
  Reproduced 2026-04-27 22:40 EDT under pure inference load — no VAAPI, no      
  Electron,                                                                     
  no concurrent GPU clients. vLLM (TheRock ROCm image, Gemma 4 31B AWQ-INT4) hit
  HW Exception on a compute queue (comp_1.1.1) ~23s after model load completed. 
  Cascade:                                                                      
                                                                                
    HW Exception by GPU node-1 reason :GPU Hang                                 
    [drm] *ERROR* Failed to initialize parser -125                
    ring vpe test failed (-110)             # post-reset, collateral            
    Dpm disable jpeg failed, ret = -62                                          
    Failed to power gate VCN instance 0 / 1                                     
    SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000021      
    Failed to retrieve enabled ppfeatures   # repeating every 5s                
                                                                                
  Notes:                                                                        
  - SMU stuck-command code is 0x21 on the patched kernel (was 0x32 unpatched).  
    Same wedge state, different message ID.                                     
  - No "Failed to power gate VPE!" in the cascade — consistent with VPE idle-pg 
    being gated off by the patch. Wedge entered via JPEG/VCN power-gate path.   
  - Conclusion: at least two triggers reach the same SMU mailbox wedge:         
      (1) VAAPI -> VPE idle power-gate         [closed by Crupi patch]          
      (2) Inference HW exception -> MES REMOVE_QUEUE fail -> MODE2 reset        
          -> JPEG/VCN power-gate fail -> SMU wedge   [still open]               
  - Clean shutdown hung ~8 hours after wedge; manual power-cycle required.      
    Same "hard reset only" recovery as 0x32.                                    
                                                                                
  Two coredumps + devcoredump captured; available on request.

Comment 15 Adam Clater 2026-04-28 11:47:49 UTC

AMD has done further investigation here https://gitlab.freedesktop.org/drm/amd/-/work_items?sort=created_date&state=opened&search=395&first_page_size=20&show=eyJpaWQiOiI1MTcxIiwiZnVsbF9wYXRoIjoiZHJtL2FtZCIsImlkIjoxNDk1ODl9