Something has changed within the last couple of weeks that's resulted in Fedora KDE crashing to a black screen during the boot process. Previously, this seemed to primarily affect reboots and shutting my computer off and doing a cold boot was enough to facilitate a happy, well-behaved, boot. This has escalated to a boot-looping crash where I get a frame or so of the boot splash screen before the screen goes black, the system pauses for several seconds, and restarts the boot process. This now occurs, with a 100% reproduction rate even when starting via a cold powered off boot and with every kernel I have installed. This is by symptom obviously AMDGPU crashing but unfortunately there is very little (most frequently nothing) in the logs. On one boot, the logs did manage to record the following abnormal kernel GPU errors: Nov 14 18:01:03 localhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [CRTC:80:crtc-0] commit wait timed out Nov 14 18:00:59 localhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* flip_done timed out Nov 14 18:00:39 localhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [CRTC:80:crtc-0] flip_done timed out Nov 14 18:00:29 localhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data I have the following workaround for affected users that can detach their display AND have an SSH server configured: - unplugging the display - boot - SSH into the machine - reattaching the monitor - sudo systemctl restart sddm At this point (it might take a couple of attempts to restart sddm), the graphical session should function. Perilously, the usual options (removing quiet, rhgb, adding 3 to the boot line) to avoid graphical boot procedures do not seem to have any impact on this bug. The only way I've been able to sufficiently "disable" the graphical boot to allow the system to boot without going into a crash loop is to unplug the monitor. Reproducible: Always Steps to Reproduce: Boot. Actual Results: Crash loop. Expected Results: SDDM appears Additional Information: This is a more formal filing of my comments here: https://discussion.fedoraproject.org/t/talk-fedora-43-kde-sometimes-boots-to-a-black-screen/171219/17
These issues seem possibly related: - https://bugzilla.redhat.com/show_bug.cgi?id=2354776 - https://bugzilla.redhat.com/show_bug.cgi?id=2359116 but as a boot-loop that can only be bypassed with (what I would consider "heroics") this manifestation is much more serious.
Are you using Xorg session with KDE?
I am not; sorry I forgot hardware info in general. Operating System: Fedora Linux 43 KDE Plasma Version: 6.5.2 KDE Frameworks Version: 6.19.0 Qt Version: 6.10.0 Kernel Version: 6.17.7-300.fc43.x86_64 (64-bit) Graphics Platform: Wayland Processors: 32 × AMD Ryzen 9 7950X 16-Core Processor Memory: 64 GiB of RAM (61.9 GiB usable) Graphics Processor 1: AMD Radeon RX 7900 XTX Graphics Processor 2: AMD Ryzen 9 7950X 16-Core Processor Manufacturer: ASUS If I can get to SDDM (to even get into my KDE session), all is fine.
There is a number of similar issues open at upstream issue tracker: https://gitlab.freedesktop.org/drm/amd/-/issues/?sort=created_date&state=opened&search=dc_dmub_srv_log_diagnostic_data&first_page_size=30 . Could you check if any of them match yours? Anyway, this is a kernel issue, so reassigning to kernel.
Reassigning back to xorg driver package after discussion with kernel maintainer.
It's kind of hard to say; those all look like they're happening well after login, but conceptually the same crash could be responsible for all of them (but with different triggers) or this could be novel. So in the since of the symptom no, but with such little information in the logs following the crash ... I can't say definitely that this is "none of those things" or "one of those things."
Of note, I've had the system up since the 14th without any crashing during normal usage (I've played games, watched shows, done web browsing, ran heavy compilation workloads, etc etc etc). So this really does seem to just be triggering during the boot process.
Okay I tried again with today's updates and got some interesting new errors in related to the crash: Nov 19 11:05:12 localhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data Nov 19 11:05:16 localhost kernel: amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000029 SMN_C2PMSG_82:0x00000000 Nov 19 11:05:16 localhost kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable gfxoff! Nov 19 11:05:18 localhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data Nov 19 11:05:24 localhost kernel: amdgpu 0000:03:00.0: amdgpu: ring_buffer_start = 00000000b23a7c87; ring_buffer_end = 00000000695ffdcc; write_frame = 0000000089fa4e2b Nov 19 11:05:24 localhost kernel: amdgpu 0000:03:00.0: amdgpu: write_frame is pointing to address out of bounds Nov 19 11:05:24 localhost kernel: amdgpu 0000:03:00.0: amdgpu: device lost from bus! Nov 19 11:05:24 localhost kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:41 param:0x00000000 message:DisallowGfxOff? Nov 19 11:05:24 localhost kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable gfxoff! Nov 19 11:05:45 localhost kernel: amdgpu 0000:03:00.0: amdgpu: device lost from bus! Nov 19 11:05:45 localhost kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:41 param:0x00000000 message:DisallowGfxOff? Nov 19 11:05:45 localhost kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable gfxoff! I also had one clean boot from power off. It seems like the transitions between grub -> plymouth -> sddm are the danger points. In particular, if I never see plymouth's "loading spinner" I seem to be golden. However, if that does render, I only get a single frame of it and the boot is going to fail.
I tried adding amdgpu.gfxoff=0 to my kernel arguments running o the hint of "Failed to disable gfxoff": sudo grubby --update-kernel=ALL --args="amdgpu.gfxoff=0" However, that did not seem to improve things. "device lost from bus!" is also interesting as it is somewhat suggestive of a hardware issue, but I'd find that hard to believe given the days-long flawless runtime once successfully on the desktop.