Bug 2415143 - amdgpu: Fedora KDE amdgpu Boot-looping Crash
Summary: amdgpu: Fedora KDE amdgpu Boot-looping Crash
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: xorg-x11-drv-amdgpu
Version: 43
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Dominik 'Rathann' Mierzejewski
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2025-11-14 23:49 UTC by Wyatt Childers
Modified: 2025-11-19 17:49 UTC (History)
16 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Wyatt Childers 2025-11-14 23:49:19 UTC
Something has changed within the last couple of weeks that's resulted in Fedora KDE crashing to a black screen during the boot process.

Previously, this seemed to primarily affect reboots and shutting my computer off and doing a cold boot was enough to facilitate a happy, well-behaved, boot.

This has escalated to a boot-looping crash where I get a frame or so of the boot splash screen before the screen goes black, the system pauses for several seconds, and restarts the boot process. This now occurs, with a 100% reproduction rate even when starting via a cold powered off boot and with every kernel I have installed.

This is by symptom obviously AMDGPU crashing but unfortunately there is very little (most frequently nothing) in the logs. On one boot, the logs did manage to record the following abnormal kernel GPU errors:

Nov 14 18:01:03 localhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [CRTC:80:crtc-0] commit wait timed out
Nov 14 18:00:59 localhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* flip_done timed out
Nov 14 18:00:39 localhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [CRTC:80:crtc-0] flip_done timed out
Nov 14 18:00:29 localhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data

I have the following workaround for affected users that can detach their display AND have an SSH server configured:

- unplugging the display
- boot
- SSH into the machine
- reattaching the monitor
- sudo systemctl restart sddm

At this point (it might take a couple of attempts to restart sddm), the graphical session should function.

Perilously, the usual options (removing quiet, rhgb, adding 3 to the boot line) to avoid graphical boot procedures do not seem to have any impact on this bug. The only way I've been able to sufficiently "disable" the graphical boot to allow the system to boot without going into a crash loop is to unplug the monitor.


Reproducible: Always

Steps to Reproduce:
Boot.
Actual Results:
Crash loop.

Expected Results:
SDDM appears

Additional Information:
This is a more formal filing of my comments here: https://discussion.fedoraproject.org/t/talk-fedora-43-kde-sometimes-boots-to-a-black-screen/171219/17

Comment 1 Wyatt Childers 2025-11-14 23:53:13 UTC
These issues seem possibly related:
- https://bugzilla.redhat.com/show_bug.cgi?id=2354776
- https://bugzilla.redhat.com/show_bug.cgi?id=2359116

but as a boot-loop that can only be bypassed with (what I would consider "heroics") this manifestation is much more serious.

Comment 2 Dominik 'Rathann' Mierzejewski 2025-11-15 21:52:27 UTC
Are you using Xorg session with KDE?

Comment 3 Wyatt Childers 2025-11-15 21:54:54 UTC
I am not; sorry I forgot hardware info in general.

Operating System: Fedora Linux 43
KDE Plasma Version: 6.5.2
KDE Frameworks Version: 6.19.0
Qt Version: 6.10.0
Kernel Version: 6.17.7-300.fc43.x86_64 (64-bit)
Graphics Platform: Wayland
Processors: 32 × AMD Ryzen 9 7950X 16-Core Processor
Memory: 64 GiB of RAM (61.9 GiB usable)
Graphics Processor 1: AMD Radeon RX 7900 XTX
Graphics Processor 2: AMD Ryzen 9 7950X 16-Core Processor
Manufacturer: ASUS

If I can get to SDDM (to even get into my KDE session), all is fine.

Comment 4 Dominik 'Rathann' Mierzejewski 2025-11-17 11:38:26 UTC
There is a number of similar issues open at upstream issue tracker:
https://gitlab.freedesktop.org/drm/amd/-/issues/?sort=created_date&state=opened&search=dc_dmub_srv_log_diagnostic_data&first_page_size=30 .

Could you check if any of them match yours?

Anyway, this is a kernel issue, so reassigning to kernel.

Comment 5 Dominik 'Rathann' Mierzejewski 2025-11-17 20:51:42 UTC
Reassigning back to xorg driver package after discussion with kernel maintainer.

Comment 6 Wyatt Childers 2025-11-18 22:38:32 UTC
It's kind of hard to say; those all look like they're happening well after login, but conceptually the same crash could be responsible for all of them (but with different triggers) or this could be novel.

So in the since of the symptom no, but with such little information in the logs following the crash ... I can't say definitely that this is "none of those things" or "one of those things."

Comment 7 Wyatt Childers 2025-11-18 22:40:34 UTC
Of note, I've had the system up since the 14th without any crashing during normal usage (I've played games, watched shows, done web browsing, ran heavy compilation workloads, etc etc etc).

So this really does seem to just be triggering during the boot process.

Comment 8 Wyatt Childers 2025-11-19 16:18:52 UTC
Okay I tried again with today's updates and got some interesting new errors in related to the crash:

Nov 19 11:05:12 localhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 19 11:05:16 localhost kernel: amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000029 SMN_C2PMSG_82:0x00000000
Nov 19 11:05:16 localhost kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable gfxoff!
Nov 19 11:05:18 localhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 19 11:05:24 localhost kernel: amdgpu 0000:03:00.0: amdgpu: ring_buffer_start = 00000000b23a7c87; ring_buffer_end = 00000000695ffdcc; write_frame = 0000000089fa4e2b
Nov 19 11:05:24 localhost kernel: amdgpu 0000:03:00.0: amdgpu: write_frame is pointing to address out of bounds
Nov 19 11:05:24 localhost kernel: amdgpu 0000:03:00.0: amdgpu: device lost from bus!
Nov 19 11:05:24 localhost kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:41 param:0x00000000 message:DisallowGfxOff?
Nov 19 11:05:24 localhost kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable gfxoff!
Nov 19 11:05:45 localhost kernel: amdgpu 0000:03:00.0: amdgpu: device lost from bus!
Nov 19 11:05:45 localhost kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:41 param:0x00000000 message:DisallowGfxOff?
Nov 19 11:05:45 localhost kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable gfxoff!

I also had one clean boot from power off. It seems like the transitions between grub -> plymouth -> sddm are the danger points. In particular, if I never see plymouth's "loading spinner" I seem to be golden. However, if that does render, I only get a single frame of it and the boot is going to fail.

Comment 9 Wyatt Childers 2025-11-19 17:49:16 UTC
I tried adding amdgpu.gfxoff=0 to my kernel arguments running o the hint of "Failed to disable gfxoff":

sudo grubby --update-kernel=ALL --args="amdgpu.gfxoff=0"

However, that did not seem to improve things.

"device lost from bus!" is also interesting as it is somewhat suggestive of a hardware issue, but I'd find that hard to believe given the days-long flawless runtime once successfully on the desktop.


Note You need to log in before you can comment on or make changes to this bug.