Bug 2420039

Summary:	AMDGPU crashes with "ring gfx_0.0.0 timeout"
Product:	[Fedora] Fedora	Reporter:	Marcus K <moral1tycor3>
Component:	kernel	Assignee:	Adam Jackson <ajax>
Status:	NEW ---	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	43	CC:	acaringi, adscvr, airlied, ajanulgu, ajax, asrivats, hans, hpa, igor.raits, jexposit, jforbes, josef, j, kernel-maint, linville, luc.wastiaux, lyude, marcandre.lureau, masami256, mchehab, mpenttil, philip.wyett, postmaster, ptalbert, rstrode, steved, suraj.ghimire7, tstellar, vitezslav.zivota
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	---
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marcus K 2025-12-08 14:43:10 UTC

Since about last week, I've been experiencing GPU driver crashes with AMDGPU. After having the system running for a while, the display driver will crash (i.e. freeze and shortly after reset with a black screen, then show me the Desktop again) with the following message in the kernel logs:

amdgpu 0000:06:00.0: amdgpu: Dumping IP State
amdgpu 0000:06:00.0: amdgpu: Dumping IP State Completed
amdgpu 0000:06:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
amdgpu 0000:06:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
amdgpu 0000:06:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=85716, emitted seq=85718
amdgpu 0000:06:00.0: amdgpu:  Process code pid 7967 thread code:cs0 pid 7989
amdgpu 0000:06:00.0: amdgpu: Starting gfx_0.0.0 ring reset
amdgpu 0000:06:00.0: amdgpu: Ring gfx_0.0.0 reset failed
amdgpu 0000:06:00.0: amdgpu: GPU reset begin!
amdgpu 0000:06:00.0: amdgpu: MODE2 reset
amdgpu 0000:06:00.0: amdgpu: GPU reset succeeded, trying to resume
[drm] PCIE GART of 1024M enabled (table at 0x000000F47FC00000).
amdgpu 0000:06:00.0: amdgpu: PSP is resuming...
amdgpu 0000:06:00.0: amdgpu: reserve 0xa00000 from 0xf47e000000 for PSP TMR
amdgpu 0000:06:00.0: amdgpu: RAS: optional ras ta ucode is not available
amdgpu 0000:06:00.0: amdgpu: RAP: optional rap ta ucode is not available
amdgpu 0000:06:00.0: amdgpu: SECUREDISPLAY: optional securedisplay ta ucode is not available
amdgpu 0000:06:00.0: amdgpu: SMU is resuming...
amdgpu 0000:06:00.0: amdgpu: SMU is resumed successfully!
amdgpu 0000:06:00.0: amdgpu: kiq ring mec 2 pipe 1 q 0
amdgpu 0000:06:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x05002C00
amdgpu 0000:06:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
amdgpu 0000:06:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1 on hub 0
amdgpu 0000:06:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4 on hub 0
amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5 on hub 0
amdgpu 0000:06:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
amdgpu 0000:06:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
amdgpu 0000:06:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
amdgpu 0000:06:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
amdgpu 0000:06:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
amdgpu 0000:06:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 12 on hub 0
amdgpu 0000:06:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on hub 0
amdgpu 0000:06:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
amdgpu 0000:06:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
amdgpu 0000:06:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
amdgpu 0000:06:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
amdgpu 0000:06:00.0: amdgpu: GPU reset(1) succeeded!
amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset

The process triggering the crash varies. Sometimes it's VSCode (like the above example), but I've seen "XWayland" and "sway" as well.
After the recovery, I can usually use the system for another while (about 30 min or so) before the display driver crashes again to a black screen, this time permanently, with the following message:

amdgpu 0000:06:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
amdgpu 0000:06:00.0: amdgpu: Failed to disable gfxoff!

These messages repeat until I eventually force-reboot the system (via SysRq). I can't even change to a TTY to try to recover the system, my monitors just go to standby mode as they no longer get a vide signal. This sequence has happened three times today (so far).


Reproducible: Always

Additional Information:
Kernel version: 6.17.9-300.fc43.x86_64 
CPU: AMD Ryzen 9 7945HX with Radeon Graphics
iGPU: VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Raphael [1002:164e] (rev d8) (prog-if 00 [VGA controller])

Comment 1 Grégoire Paris 2025-12-11 20:50:49 UTC

I think this is the same as https://bbs.archlinux.org/viewtopic.php?pid=2275770, and will be solved by version 20251125-2
You can try downgrading amd-gpu-firmware and amd-ucode-firmware and then recreate an initramfs by reinstalling kernel-core. Worked for me.

Comment 2 Vitezslav Zivota 2025-12-15 14:24:09 UTC

I've had similar problems since about last week. too. Amdgpu crash, followed with gnome-shell crash. Interesting, crash was often triggered with VS Code, like what happened to Marcus.

pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32771)
pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu:  Process code pid 31648 thread code:cs0 pid 31669
pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu:   in page starting at address 0x000000003f800000 from client 0x1b (UTCL2)
pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00301430
pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu:          Faulty UTCL2 client ID: SQC (data) (0xa)
pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu:          MORE_FAULTS: 0x0
pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu:          WALKER_ERROR: 0x0
pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu:          MAPPING_ERROR: 0x0
pro 14 20:27:31 holly kernel: amdgpu 0000:15:00.0: amdgpu:          RW: 0x0
pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: Dumping IP State
pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: Dumping IP State Completed
pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=131625, emitted seq=131627
pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu:  Process code pid 31648 thread code:cs0 pid 31669
pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: Starting gfx_0.0.0 ring reset
pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: amdgpu: Ring gfx_0.0.0 reset succeeded
pro 14 20:27:42 holly kernel: amdgpu 0000:15:00.0: [drm] device wedged, but recovered through reset

Firmware downgrade to 20251021-1 helped me.
amd-gpu-firmware.noarch                              20251021-1.fc43                     fedora
amd-ucode-firmware.noarch                            20251021-1.fc43                     fedora

Comment 3 Marcus K 2025-12-15 15:44:20 UTC

(In reply to Grégoire Paris from comment #1)
> I think this is the same as
> https://bbs.archlinux.org/viewtopic.php?pid=2275770, and will be solved by
> version 20251125-2
> You can try downgrading amd-gpu-firmware and amd-ucode-firmware and then
> recreate an initramfs by reinstalling kernel-core. Worked for me.

Thank you, this seems to have worked. I downgraded and pinned the version to 20251021-1 for both of them, and had no crashes since.

Comment 4 Grégoire Paris 2025-12-15 18:14:52 UTC

Great! On my end I noticed that I was not using the latest Fedora, so I upgraded to Fedora 43, and so far no crashes, even with 20251125-1, so I don't understand… but maybe it will crash later.

Comment 5 Luc W 2025-12-16 13:41:36 UTC

I started getting similar GPU-related crashes after upgrading to Fedora Workstation 43. Prior to that I used Fedora Workstation 41, and 42 on that same system without ever experiencing that sort of crash.

Symptoms: during Firefox Youtube or Reddit + video browsing, occasionally the screen will freeze, and sometimes blank. The freeze persists for some time, then the Gnome shell crashes and I am logged out.
The system log will then contain something like:

Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: Dumping IP State
Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: Dumping IP State Completed
Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: ring sdma0 timeout, signaled seq=255087, emitted seq=255089
Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: MODE2 reset
Dec 07 19:56:19 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume

Sometimes, the error that appears is:
Dec 10 22:20:11 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=319235, emitted seq=319236
Sometimes, it's:
Dec 09 21:21:27 linux-ws kernel: amdgpu 0000:04:00.0: amdgpu: ring sdma0 timeout, signaled seq=174916, emitted seq=174918

It looks like it can be reproduced by running:
stress-ng --cpu 0 --cpu-method fft --timeout 20m
for around a minute.

My system: Minisforum BD790ix3d system (AMD Ryzen 9 7945HX3D) with AMD iGPU (610m).

luc@linux-ws ~$ lspci -k | grep -EA3 'VGA|3D|Display'
04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raphael (rev dc)
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raphael
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu
luc@linux-ws ~$ glxinfo | grep -i "opengl renderer"
OpenGL renderer string: AMD Radeon 610M (radeonsi, raphael_mendocino, LLVM 21.1.5, DRM 3.64, 6.17.9-300.fc43.x86_64)
luc@linux-ws ~$ cat /etc/fedora-release 
Fedora release 43 (Forty Three)

I've since then upgrade to kernel 6.17.11-300 but the behavior is the same.

I've tried the following changes without success, based on Claude research or GPT-5 Pro research advice gathered from various threads (arch forums, etc):

2025-12-07:
BIOS change: BIOS setting change: navigate to Advanced → CPU Configuration → PSS and set it to Disabled. This prevents the CPU from entering the problematic C6 sleep state.

2025-12-08:
tried to add kernel parameters:
amdgpu.noretry=0 amdgpu.ppfeaturemask=0xffff7fff amdgpu.sg_display=0

still got a "amdgpu: ring sdma0 timeout, signaled seq=174916, emitted seq=174918"

2025-12-11:
tried to add kernel parameters:
amdgpu.dcdebugmask=0x10 amdgpu.gfxoff=0
still got a crash

2025-12-13:
tried to add kernel parameter:
amdgpu.sdma=0
still got "amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=220584, emitted seq=220584"

2025-12-16:
tried downgrading amg-gpu-firmware and amd-ucode-firmware to 2051021-1, but was still able to reproduce the crash by running the stress-ng + youtube video.

luc@linux-ws ~$ sudo dnf downgrade amd-gpu-firmware-20251021-1.fc43 amd-ucode-firmware-20251021-1.fc43
Updating and loading repositories:
Repositories loaded.
Package                                           Arch         Version                                           Repository                      Size
Downgrading:
 amd-gpu-firmware                                 noarch       20251021-1.fc43                                   fedora                      25.7 MiB
   replacing amd-gpu-firmware                     noarch       20251125-1.fc43                                   <unknown>                   25.7 MiB
 amd-ucode-firmware                               noarch       20251021-1.fc43                                   fedora                     419.9 KiB
   replacing amd-ucode-firmware                   noarch       20251125-1.fc43                                   <unknown>                  546.0 KiB


Though I may have misunderstood what's implied with this step: "and then recreate an initramfs" (I only ran the commandline above)

I see the comment from Gregoire above "I think this is the same as https://bbs.archlinux.org/viewtopic.php?pid=2275770, and will be solved by version 20251125-2". Does this mean I can expect a fix when https://packages.fedoraproject.org/pkgs/linux-firmware/linux-firmware/ shows "20251125-2.fc43" ?

Comment 6 Luc W 2025-12-16 13:55:34 UTC

Update: after doing this:

sudo dnf downgrade amd-gpu-firmware-20251021-1.fc43 amd-ucode-firmware-20251021-1.fc43
and 
sudo dracut --force

then reboot, I ran the following stress test:

stress-ng --cpu 0 --cpu-method fft --timeout 20m
and at the same time, play a heavy bitrate video in 4k on youtube: https://www.youtube.com/watch?v=7PIji8OubXU

seems to have run for >5mn , which gives me confidence the problem doesn't appear under that configuration. I'll update if it comes back.

luc@linux-ws ~$ rpm -q linux-firmware
linux-firmware-20251021-1.fc43.noarch
luc@linux-ws ~$ uname -a
Linux linux-ws 6.17.11-300.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Dec  8 23:20:36 UTC 2025 x86_64 GNU/Linux

Comment 7 Grégoire Paris 2025-12-16 20:11:16 UTC

> Though I may have misunderstood what's implied with this step: "and then recreate an initramfs" (I only ran the commandline above)

Well yes, that was key for me as well, although the way I did it was by reinstalling `kernel-core-blah-blah-blah`. `rpm -qa|grep kernel-core` to find the correct package name.

> Does this mean I can expect a fix when https://packages.fedoraproject.org/pkgs/linux-firmware/linux-firmware/ shows "20251125-2.fc43" ?

That's my expectation as well.

Comment 8 Grégoire Paris 2026-01-01 17:19:47 UTC

It looks like news about this are posted here: https://bugzilla.redhat.com/show_bug.cgi?id=2420062