Since about last week, I've been experiencing GPU driver crashes with AMDGPU. After having the system running for a while, the display driver will crash (i.e. freeze and shortly after reset with a black screen, then show me the Desktop again) with the following message in the kernel logs: amdgpu 0000:06:00.0: amdgpu: Dumping IP State amdgpu 0000:06:00.0: amdgpu: Dumping IP State Completed amdgpu 0000:06:00.0: amdgpu: [drm] AMDGPU device coredump file has been created amdgpu 0000:06:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data amdgpu 0000:06:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=85716, emitted seq=85718 amdgpu 0000:06:00.0: amdgpu: Process code pid 7967 thread code:cs0 pid 7989 amdgpu 0000:06:00.0: amdgpu: Starting gfx_0.0.0 ring reset amdgpu 0000:06:00.0: amdgpu: Ring gfx_0.0.0 reset failed amdgpu 0000:06:00.0: amdgpu: GPU reset begin! amdgpu 0000:06:00.0: amdgpu: MODE2 reset amdgpu 0000:06:00.0: amdgpu: GPU reset succeeded, trying to resume [drm] PCIE GART of 1024M enabled (table at 0x000000F47FC00000). amdgpu 0000:06:00.0: amdgpu: PSP is resuming... amdgpu 0000:06:00.0: amdgpu: reserve 0xa00000 from 0xf47e000000 for PSP TMR amdgpu 0000:06:00.0: amdgpu: RAS: optional ras ta ucode is not available amdgpu 0000:06:00.0: amdgpu: RAP: optional rap ta ucode is not available amdgpu 0000:06:00.0: amdgpu: SECUREDISPLAY: optional securedisplay ta ucode is not available amdgpu 0000:06:00.0: amdgpu: SMU is resuming... amdgpu 0000:06:00.0: amdgpu: SMU is resumed successfully! amdgpu 0000:06:00.0: amdgpu: kiq ring mec 2 pipe 1 q 0 amdgpu 0000:06:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x05002C00 amdgpu 0000:06:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0 amdgpu 0000:06:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1 on hub 0 amdgpu 0000:06:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4 on hub 0 amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5 on hub 0 amdgpu 0000:06:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0 amdgpu 0000:06:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0 amdgpu 0000:06:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0 amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0 amdgpu 0000:06:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0 amdgpu 0000:06:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0 amdgpu 0000:06:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 12 on hub 0 amdgpu 0000:06:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on hub 0 amdgpu 0000:06:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8 amdgpu 0000:06:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8 amdgpu 0000:06:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8 amdgpu 0000:06:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8 amdgpu 0000:06:00.0: amdgpu: GPU reset(1) succeeded! amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset The process triggering the crash varies. Sometimes it's VSCode (like the above example), but I've seen "XWayland" and "sway" as well. After the recovery, I can usually use the system for another while (about 30 min or so) before the display driver crashes again to a black screen, this time permanently, with the following message: amdgpu 0000:06:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000 amdgpu 0000:06:00.0: amdgpu: Failed to disable gfxoff! These messages repeat until I eventually force-reboot the system (via SysRq). I can't even change to a TTY to try to recover the system, my monitors just go to standby mode as they no longer get a vide signal. This sequence has happened three times today (so far). Reproducible: Always Additional Information: Kernel version: 6.17.9-300.fc43.x86_64 CPU: AMD Ryzen 9 7945HX with Radeon Graphics iGPU: VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Raphael [1002:164e] (rev d8) (prog-if 00 [VGA controller])
I think this is the same as https://bbs.archlinux.org/viewtopic.php?pid=2275770, and will be solved by version 20251125-2 You can try downgrading amd-gpu-firmware and amd-ucode-firmware and then recreate an initramfs by reinstalling kernel-core. Worked for me.
I've had similar problems since about last week. too. Amdgpu crash, followed with gnome-shell crash. Interesting, crash was often triggered with VS Code, like what happened to Marcus. pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32771) pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: Process code pid 31648 thread code:cs0 pid 31669 pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: in page starting at address 0x000000003f800000 from client 0x1b (UTCL2) pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00301430 pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: Faulty UTCL2 client ID: SQC (data) (0xa) pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: MORE_FAULTS: 0x0 pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: WALKER_ERROR: 0x0 pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: PERMISSION_FAULTS: 0x3 pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: MAPPING_ERROR: 0x0 pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: RW: 0x0 pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: Dumping IP State pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: Dumping IP State Completed pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: [drm] AMDGPU device coredump file has been created pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=131625, emitted seq=131627 pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: Process code pid 31648 thread code:cs0 pid 31669 pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: Starting gfx_0.0.0 ring reset pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: Ring gfx_0.0.0 reset succeeded pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: [drm] device wedged, but recovered through reset Firmware downgrade to 20251021-1 helped me. amd-gpu-firmware.noarch 20251021-1.fc43 fedora amd-ucode-firmware.noarch 20251021-1.fc43 fedora
(In reply to Grégoire Paris from comment #1) > I think this is the same as > https://bbs.archlinux.org/viewtopic.php?pid=2275770, and will be solved by > version 20251125-2 > You can try downgrading amd-gpu-firmware and amd-ucode-firmware and then > recreate an initramfs by reinstalling kernel-core. Worked for me. Thank you, this seems to have worked. I downgraded and pinned the version to 20251021-1 for both of them, and had no crashes since.
Great! On my end I noticed that I was not using the latest Fedora, so I upgraded to Fedora 43, and so far no crashes, even with 20251125-1, so I don't understand… but maybe it will crash later.
I started getting similar GPU-related crashes after upgrading to Fedora Workstation 43. Prior to that I used Fedora Workstation 41, and 42 on that same system without ever experiencing that sort of crash. Symptoms: during Firefox Youtube or Reddit + video browsing, occasionally the screen will freeze, and sometimes blank. The freeze persists for some time, then the Gnome shell crashes and I am logged out. The system log will then contain something like: Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: Dumping IP State Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: Dumping IP State Completed Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: [drm] AMDGPU device coredump file has been created Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: ring sdma0 timeout, signaled seq=255087, emitted seq=255089 Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset begin! Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: MODE2 reset Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume Sometimes, the error that appears is: Dec 10 22:20:11 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=319235, emitted seq=319236 Sometimes, it's: Dec 09 21:21:27 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: ring sdma0 timeout, signaled seq=174916, emitted seq=174918 It looks like it can be reproduced by running: stress-ng --cpu 0 --cpu-method fft --timeout 20m for around a minute. My system: Minisforum BD790ix3d system (AMD Ryzen 9 7945HX3D) with AMD iGPU (610m). luc@linux-ws ~$ lspci -k | grep -EA3 'VGA|3D|Display' 04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raphael (rev dc) Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raphael Kernel driver in use: amdgpu Kernel modules: amdgpu luc@linux-ws ~$ glxinfo | grep -i "opengl renderer" OpenGL renderer string: AMD Radeon 610M (radeonsi, raphael_mendocino, LLVM 21.1.5, DRM 3.64, 6.17.9-300.fc43.x86_64) luc@linux-ws ~$ cat /etc/fedora-release Fedora release 43 (Forty Three) I've since then upgrade to kernel 6.17.11-300 but the behavior is the same. I've tried the following changes without success, based on Claude research or GPT-5 Pro research advice gathered from various threads (arch forums, etc): 2025-12-07: BIOS change: BIOS setting change: navigate to Advanced → CPU Configuration → PSS and set it to Disabled. This prevents the CPU from entering the problematic C6 sleep state. 2025-12-08: tried to add kernel parameters: amdgpu.noretry=0 amdgpu.ppfeaturemask=0xffff7fff amdgpu.sg_display=0 still got a "amdgpu: ring sdma0 timeout, signaled seq=174916, emitted seq=174918" 2025-12-11: tried to add kernel parameters: amdgpu.dcdebugmask=0x10 amdgpu.gfxoff=0 still got a crash 2025-12-13: tried to add kernel parameter: amdgpu.sdma=0 still got "amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=220584, emitted seq=220584" 2025-12-16: tried downgrading amg-gpu-firmware and amd-ucode-firmware to 2051021-1, but was still able to reproduce the crash by running the stress-ng + youtube video. luc@linux-ws ~$ sudo dnf downgrade amd-gpu-firmware-20251021-1.fc43 amd-ucode-firmware-20251021-1.fc43 Updating and loading repositories: Repositories loaded. Package Arch Version Repository Size Downgrading: amd-gpu-firmware noarch 20251021-1.fc43 fedora 25.7 MiB replacing amd-gpu-firmware noarch 20251125-1.fc43 <unknown> 25.7 MiB amd-ucode-firmware noarch 20251021-1.fc43 fedora 419.9 KiB replacing amd-ucode-firmware noarch 20251125-1.fc43 <unknown> 546.0 KiB Though I may have misunderstood what's implied with this step: "and then recreate an initramfs" (I only ran the commandline above) I see the comment from Gregoire above "I think this is the same as https://bbs.archlinux.org/viewtopic.php?pid=2275770, and will be solved by version 20251125-2". Does this mean I can expect a fix when https://packages.fedoraproject.org/pkgs/linux-firmware/linux-firmware/ shows "20251125-2.fc43" ?
Update: after doing this: sudo dnf downgrade amd-gpu-firmware-20251021-1.fc43 amd-ucode-firmware-20251021-1.fc43 and sudo dracut --force then reboot, I ran the following stress test: stress-ng --cpu 0 --cpu-method fft --timeout 20m and at the same time, play a heavy bitrate video in 4k on youtube: https://www.youtube.com/watch?v=7PIji8OubXU seems to have run for >5mn , which gives me confidence the problem doesn't appear under that configuration. I'll update if it comes back. luc@linux-ws ~$ rpm -q linux-firmware linux-firmware-20251021-1.fc43.noarch luc@linux-ws ~$ uname -a Linux linux-ws 6.17.11-300.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Dec 8 23:20:36 UTC 2025 x86_64 GNU/Linux
> Though I may have misunderstood what's implied with this step: "and then recreate an initramfs" (I only ran the commandline above) Well yes, that was key for me as well, although the way I did it was by reinstalling `kernel-core-blah-blah-blah`. `rpm -qa|grep kernel-core` to find the correct package name. > Does this mean I can expect a fix when https://packages.fedoraproject.org/pkgs/linux-firmware/linux-firmware/ shows "20251125-2.fc43" ? That's my expectation as well.
It looks like news about this are posted here: https://bugzilla.redhat.com/show_bug.cgi?id=2420062