Bug 2420039
| Summary: | AMDGPU crashes with "ring gfx_0.0.0 timeout" | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Marcus K <moral1tycor3> |
| Component: | kernel | Assignee: | Adam Jackson <ajax> |
| Status: | NEW --- | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 43 | CC: | acaringi, adscvr, airlied, ajanulgu, ajax, asrivats, hans, hpa, igor.raits, jexposit, jforbes, josef, j, kernel-maint, linville, luc.wastiaux, lyude, marcandre.lureau, masami256, mchehab, mpenttil, philip.wyett, postmaster, ptalbert, rstrode, steved, suraj.ghimire7, tstellar, vitezslav.zivota |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | --- | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | --- | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Marcus K
2025-12-08 14:43:10 UTC
I think this is the same as https://bbs.archlinux.org/viewtopic.php?pid=2275770, and will be solved by version 20251125-2 You can try downgrading amd-gpu-firmware and amd-ucode-firmware and then recreate an initramfs by reinstalling kernel-core. Worked for me. I've had similar problems since about last week. too. Amdgpu crash, followed with gnome-shell crash. Interesting, crash was often triggered with VS Code, like what happened to Marcus. pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32771) pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: Process code pid 31648 thread code:cs0 pid 31669 pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: in page starting at address 0x000000003f800000 from client 0x1b (UTCL2) pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00301430 pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: Faulty UTCL2 client ID: SQC (data) (0xa) pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: MORE_FAULTS: 0x0 pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: WALKER_ERROR: 0x0 pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: PERMISSION_FAULTS: 0x3 pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: MAPPING_ERROR: 0x0 pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: RW: 0x0 pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: Dumping IP State pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: Dumping IP State Completed pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: [drm] AMDGPU device coredump file has been created pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=131625, emitted seq=131627 pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: Process code pid 31648 thread code:cs0 pid 31669 pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: Starting gfx_0.0.0 ring reset pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: Ring gfx_0.0.0 reset succeeded pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: [drm] device wedged, but recovered through reset Firmware downgrade to 20251021-1 helped me. amd-gpu-firmware.noarch 20251021-1.fc43 fedora amd-ucode-firmware.noarch 20251021-1.fc43 fedora (In reply to Grégoire Paris from comment #1) > I think this is the same as > https://bbs.archlinux.org/viewtopic.php?pid=2275770, and will be solved by > version 20251125-2 > You can try downgrading amd-gpu-firmware and amd-ucode-firmware and then > recreate an initramfs by reinstalling kernel-core. Worked for me. Thank you, this seems to have worked. I downgraded and pinned the version to 20251021-1 for both of them, and had no crashes since. Great! On my end I noticed that I was not using the latest Fedora, so I upgraded to Fedora 43, and so far no crashes, even with 20251125-1, so I don't understand… but maybe it will crash later. I started getting similar GPU-related crashes after upgrading to Fedora Workstation 43. Prior to that I used Fedora Workstation 41, and 42 on that same system without ever experiencing that sort of crash. Symptoms: during Firefox Youtube or Reddit + video browsing, occasionally the screen will freeze, and sometimes blank. The freeze persists for some time, then the Gnome shell crashes and I am logged out. The system log will then contain something like: Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: Dumping IP State Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: Dumping IP State Completed Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: [drm] AMDGPU device coredump file has been created Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: ring sdma0 timeout, signaled seq=255087, emitted seq=255089 Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset begin! Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: MODE2 reset Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume Sometimes, the error that appears is: Dec 10 22:20:11 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=319235, emitted seq=319236 Sometimes, it's: Dec 09 21:21:27 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: ring sdma0 timeout, signaled seq=174916, emitted seq=174918 It looks like it can be reproduced by running: stress-ng --cpu 0 --cpu-method fft --timeout 20m for around a minute. My system: Minisforum BD790ix3d system (AMD Ryzen 9 7945HX3D) with AMD iGPU (610m). luc@linux-ws ~$ lspci -k | grep -EA3 'VGA|3D|Display' 04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raphael (rev dc) Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raphael Kernel driver in use: amdgpu Kernel modules: amdgpu luc@linux-ws ~$ glxinfo | grep -i "opengl renderer" OpenGL renderer string: AMD Radeon 610M (radeonsi, raphael_mendocino, LLVM 21.1.5, DRM 3.64, 6.17.9-300.fc43.x86_64) luc@linux-ws ~$ cat /etc/fedora-release Fedora release 43 (Forty Three) I've since then upgrade to kernel 6.17.11-300 but the behavior is the same. I've tried the following changes without success, based on Claude research or GPT-5 Pro research advice gathered from various threads (arch forums, etc): 2025-12-07: BIOS change: BIOS setting change: navigate to Advanced → CPU Configuration → PSS and set it to Disabled. This prevents the CPU from entering the problematic C6 sleep state. 2025-12-08: tried to add kernel parameters: amdgpu.noretry=0 amdgpu.ppfeaturemask=0xffff7fff amdgpu.sg_display=0 still got a "amdgpu: ring sdma0 timeout, signaled seq=174916, emitted seq=174918" 2025-12-11: tried to add kernel parameters: amdgpu.dcdebugmask=0x10 amdgpu.gfxoff=0 still got a crash 2025-12-13: tried to add kernel parameter: amdgpu.sdma=0 still got "amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=220584, emitted seq=220584" 2025-12-16: tried downgrading amg-gpu-firmware and amd-ucode-firmware to 2051021-1, but was still able to reproduce the crash by running the stress-ng + youtube video. luc@linux-ws ~$ sudo dnf downgrade amd-gpu-firmware-20251021-1.fc43 amd-ucode-firmware-20251021-1.fc43 Updating and loading repositories: Repositories loaded. Package Arch Version Repository Size Downgrading: amd-gpu-firmware noarch 20251021-1.fc43 fedora 25.7 MiB replacing amd-gpu-firmware noarch 20251125-1.fc43 <unknown> 25.7 MiB amd-ucode-firmware noarch 20251021-1.fc43 fedora 419.9 KiB replacing amd-ucode-firmware noarch 20251125-1.fc43 <unknown> 546.0 KiB Though I may have misunderstood what's implied with this step: "and then recreate an initramfs" (I only ran the commandline above) I see the comment from Gregoire above "I think this is the same as https://bbs.archlinux.org/viewtopic.php?pid=2275770, and will be solved by version 20251125-2". Does this mean I can expect a fix when https://packages.fedoraproject.org/pkgs/linux-firmware/linux-firmware/ shows "20251125-2.fc43" ? Update: after doing this: sudo dnf downgrade amd-gpu-firmware-20251021-1.fc43 amd-ucode-firmware-20251021-1.fc43 and sudo dracut --force then reboot, I ran the following stress test: stress-ng --cpu 0 --cpu-method fft --timeout 20m and at the same time, play a heavy bitrate video in 4k on youtube: https://www.youtube.com/watch?v=7PIji8OubXU seems to have run for >5mn , which gives me confidence the problem doesn't appear under that configuration. I'll update if it comes back. luc@linux-ws ~$ rpm -q linux-firmware linux-firmware-20251021-1.fc43.noarch luc@linux-ws ~$ uname -a Linux linux-ws 6.17.11-300.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Dec 8 23:20:36 UTC 2025 x86_64 GNU/Linux > Though I may have misunderstood what's implied with this step: "and then recreate an initramfs" (I only ran the commandline above) Well yes, that was key for me as well, although the way I did it was by reinstalling `kernel-core-blah-blah-blah`. `rpm -qa|grep kernel-core` to find the correct package name. > Does this mean I can expect a fix when https://packages.fedoraproject.org/pkgs/linux-firmware/linux-firmware/ shows "20251125-2.fc43" ? That's my expectation as well. It looks like news about this are posted here: https://bugzilla.redhat.com/show_bug.cgi?id=2420062 |