Ever since kernel 6.8 builds started appearing in Rawhide, they have been failing openQA tests because there is often no graphical display on the Workstation tests. This does not appear to affect KDE. The KDE tests always pass. It doesn't happen on *every* Workstation test, but it always happens a lot. Usually more than half the tests fail this way. When the failure happens, the system boots normally with the old kernel, we then apply the update to the 6.8 kernel snapshot and reboot, and the system never reaches the GDM login screen. The Plymouth splash screen shows for a while, then the display goes to "Display output is not active." openQA tests on qemu virtual machines with virtio graphics, without 3D acceleration enabled. In the system logs, right around the time the screen goes to "not active", these messages are logged: Jan 29 07:30:16 fedora gnome-session-binary[1015]: Entering running state Jan 29 07:30:17 fedora rtkit-daemon[732]: Successfully made thread 1039 of process 1027 (/usr/bin/gnome-shell) owned by '42' high priority at nice level 0. Jan 29 07:30:17 fedora rtkit-daemon[732]: Successfully made thread 1039 of process 1027 (/usr/bin/gnome-shell) owned by '42' RT at priority 20. Jan 29 07:30:17 fedora gnome-shell[1027]: Page flip failed: Page flip of 35 failed, and no mode set available Jan 29 07:30:17 fedora gnome-shell[1027]: Failed to post KMS update: Page flip of 35 failed, and no mode set available Jan 29 07:30:17 fedora org.gnome.Shell.desktop[1329]: MESA: error: ZINK: failed to choose pdev Jan 29 07:30:17 fedora org.gnome.Shell.desktop[1329]: glx: failed to create drisw screen Jan 29 07:30:17 fedora org.gnome.Shell.desktop[1329]: failed to load driver: zink Jan 29 07:30:17 fedora gnome-shell[1027]: maybe_update_cursor_plane: assertion 'crtc_state_impl' failed Jan 29 07:30:17 fedora gnome-shell[1027]: Page flip failed: Page flip of 35 failed, and no mode set available Jan 29 07:30:17 fedora gsd-media-keys[1136]: Failed to grab accelerator for keybinding settings:hibernate Jan 29 07:30:17 fedora gsd-media-keys[1136]: Failed to grab accelerator for keybinding settings:playback-repeat Jan 29 07:30:17 fedora /usr/libexec/gdm-wayland-session[1014]: dbus-daemon[1014]: [session uid=42 pid=1014] Activating service name='org.gnome.ScreenSaver' requested by ':1.25' (uid=42 pid=> Jan 29 07:30:17 fedora gnome-shell[1027]: maybe_update_cursor_plane: assertion 'crtc_state_impl' failed Jan 29 07:30:17 fedora gnome-shell[1027]: Page flip failed: Page flip of 35 failed, and no mode set available Jan 29 07:30:17 fedora gnome-shell[1027]: maybe_update_cursor_plane: assertion 'crtc_state_impl' failed Jan 29 07:30:17 fedora gnome-shell[1027]: Page flip failed: Page flip of 35 failed, and no mode set available I've been noting this in the Bodhi updates for the kernel builds (all of which have failed gating because of this), but since it's been going on for a while, I'm filing a bug for visibility.
Update from jforbes: a temporary revert to workaround this is building soon, and we're hoping to have a proper fix later this week or next week.
justin also noted that this didn't get picked up sooner because openQA uses an 'unusual' video config that not many other folks are using so nobody else flagged it. the config openQA uses is virtio-vga without 3D passthrough. We've used qxl and VGA/std in the past, but had issues with both, and virtio-vga seemed to be the 'currently best supported' option. We cannot practically use 3D passthrough on openQA, because the worker hosts are server hardware with very basic CPUs (Matrix G200, mostly) and run dozens of jobs simultaneously; I am pretty sure 3D passthrough would either not work at all, or cause more problems than it would solve, in that kind of setup. SUSE uses -device VGA. We could go back to that in Fedora, I guess, but I'm not really sure it's necessarily an improvement. I don't recall exactly what bug we ran into it last time we used it, but there definitely was one. (With qxl I think it was 'VTs sometimes get the color scheme wrong for some reason so all our console screenshots don't match', and nobody seemed interested in fixing that).
Bilal mentioned that this may be a mutter bug and the following commit fixes it https://gitlab.gnome.org/swick/mutter/-/commit/51bc0431079f2f5778c70b7577102d0977769b45. But that hasn't made into a mutter released version yet.
I can test that in a couple of hours after some meetings. Thanks for the heads-up.
well, testing it is a bit trickier because jforbes sent out a kernel with a workaround, so the current Rawhide does not hit the bug any more. I'm trying to bodge up a test run that uses both a patched mutter and an older kernel now.
OK, so yeah, I think that mutter commit does fix it. I managed to force a set of tests to run with a patched mutter and older kernel - https://openqa.stg.fedoraproject.org/tests/overview?distri=fedora&version=40&groupid=2&build=Kojitask-113182813_113114834-NOREPORT . You can see on the first thumbnail of _graphical_wait_login_2 in each case - e.g. https://openqa.stg.fedoraproject.org/tests/3550812#step/_graphical_wait_login_2/1 - that it boots kernel-6.8.0-0.rc3.20240207git6d280f4d760e.28.fc40 , a kernel which failed the tests when tested alone - https://openqa.fedoraproject.org/tests/overview?distri=fedora&groupid=2&version=40&build=Update-FEDORA-2024-eb02928b46 - and each test passed this time. I'll backport that mutter patch, and once that's in, we can try a kernel build with the workaround removed, I guess?
mutter update is done - https://bodhi.fedoraproject.org/updates/FEDORA-2024-31ca0e57d3 - so let's call this fixed for now. If it somehow shows up again, I'll re-open.