Bug 2261842 - Workstation edition often shows no graphics on a qemu VM with kernel 6.8
Summary: Workstation edition often shows no graphics on a qemu VM with kernel 6.8
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: mutter
Version: rawhide
Hardware: All
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Florian Müllner
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: openqa
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-01-30 01:18 UTC by Adam Williamson
Modified: 2024-02-09 00:28 UTC (History)
27 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-02-09 00:28:36 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
GNOME Gitlab GNOME mutter merge_requests 3539 0 None merged kms/cursor-manager: Create CrtcStateImpls for all active CRTCs 2024-02-08 22:50:51 UTC

Description Adam Williamson 2024-01-30 01:18:45 UTC
Ever since kernel 6.8 builds started appearing in Rawhide, they have been failing openQA tests because there is often no graphical display on the Workstation tests.

This does not appear to affect KDE. The KDE tests always pass.

It doesn't happen on *every* Workstation test, but it always happens a lot. Usually more than half the tests fail this way.

When the failure happens, the system boots normally with the old kernel, we then apply the update to the 6.8 kernel snapshot and reboot, and the system never reaches the GDM login screen. The Plymouth splash screen shows for a while, then the display goes to "Display output is not active."

openQA tests on qemu virtual machines with virtio graphics, without 3D acceleration enabled.

In the system logs, right around the time the screen goes to "not active", these messages are logged:

Jan 29 07:30:16 fedora gnome-session-binary[1015]: Entering running state
Jan 29 07:30:17 fedora rtkit-daemon[732]: Successfully made thread 1039 of process 1027 (/usr/bin/gnome-shell) owned by '42' high priority at nice level 0.
Jan 29 07:30:17 fedora rtkit-daemon[732]: Successfully made thread 1039 of process 1027 (/usr/bin/gnome-shell) owned by '42' RT at priority 20.
Jan 29 07:30:17 fedora gnome-shell[1027]: Page flip failed: Page flip of 35 failed, and no mode set available
Jan 29 07:30:17 fedora gnome-shell[1027]: Failed to post KMS update: Page flip of 35 failed, and no mode set available
Jan 29 07:30:17 fedora org.gnome.Shell.desktop[1329]: MESA: error: ZINK: failed to choose pdev
Jan 29 07:30:17 fedora org.gnome.Shell.desktop[1329]: glx: failed to create drisw screen
Jan 29 07:30:17 fedora org.gnome.Shell.desktop[1329]: failed to load driver: zink
Jan 29 07:30:17 fedora gnome-shell[1027]: maybe_update_cursor_plane: assertion 'crtc_state_impl' failed
Jan 29 07:30:17 fedora gnome-shell[1027]: Page flip failed: Page flip of 35 failed, and no mode set available
Jan 29 07:30:17 fedora gsd-media-keys[1136]: Failed to grab accelerator for keybinding settings:hibernate
Jan 29 07:30:17 fedora gsd-media-keys[1136]: Failed to grab accelerator for keybinding settings:playback-repeat
Jan 29 07:30:17 fedora /usr/libexec/gdm-wayland-session[1014]: dbus-daemon[1014]: [session uid=42 pid=1014] Activating service name='org.gnome.ScreenSaver' requested by ':1.25' (uid=42 pid=>
Jan 29 07:30:17 fedora gnome-shell[1027]: maybe_update_cursor_plane: assertion 'crtc_state_impl' failed
Jan 29 07:30:17 fedora gnome-shell[1027]: Page flip failed: Page flip of 35 failed, and no mode set available
Jan 29 07:30:17 fedora gnome-shell[1027]: maybe_update_cursor_plane: assertion 'crtc_state_impl' failed
Jan 29 07:30:17 fedora gnome-shell[1027]: Page flip failed: Page flip of 35 failed, and no mode set available

I've been noting this in the Bodhi updates for the kernel builds (all of which have failed gating because of this), but since it's been going on for a while, I'm filing a bug for visibility.

Comment 1 Adam Williamson 2024-02-07 19:39:30 UTC
Update from jforbes: a temporary revert to workaround this is building soon, and we're hoping to have a proper fix later this week or next week.

Comment 2 Adam Williamson 2024-02-07 19:42:52 UTC
justin also noted that this didn't get picked up sooner because openQA uses an 'unusual' video config that not many other folks are using so nobody else flagged it.

the config openQA uses is virtio-vga without 3D passthrough. We've used qxl and VGA/std in the past, but had issues with both, and virtio-vga seemed to be the 'currently best supported' option. We cannot practically use 3D passthrough on openQA, because the worker hosts are server hardware with very basic CPUs (Matrix G200, mostly) and run dozens of jobs simultaneously; I am pretty sure 3D passthrough would either not work at all, or cause more problems than it would solve, in that kind of setup.

SUSE uses -device VGA. We could go back to that in Fedora, I guess, but I'm not really sure it's necessarily an improvement. I don't recall exactly what bug we ran into it last time we used it, but there definitely was one. (With qxl I think it was 'VTs sometimes get the color scheme wrong for some reason so all our console screenshots don't match', and nobody seemed interested in fixing that).

Comment 3 Javier Martinez Canillas 2024-02-08 09:00:23 UTC
Bilal mentioned that this may be a mutter bug and the following commit fixes it https://gitlab.gnome.org/swick/mutter/-/commit/51bc0431079f2f5778c70b7577102d0977769b45.

But that hasn't made into a mutter released version yet.

Comment 4 Adam Williamson 2024-02-08 15:59:23 UTC
I can test that in a couple of hours after some meetings. Thanks for the heads-up.

Comment 5 Adam Williamson 2024-02-08 22:22:05 UTC
well, testing it is a bit trickier because jforbes sent out a kernel with a workaround, so the current Rawhide does not hit the bug any more. I'm trying to bodge up a test run that uses both a patched mutter and an older kernel now.

Comment 6 Adam Williamson 2024-02-08 22:47:28 UTC
OK, so yeah, I think that mutter commit does fix it. I managed to force a set of tests to run with a patched mutter and older kernel - https://openqa.stg.fedoraproject.org/tests/overview?distri=fedora&version=40&groupid=2&build=Kojitask-113182813_113114834-NOREPORT . You can see on the first thumbnail of _graphical_wait_login_2 in each case - e.g. https://openqa.stg.fedoraproject.org/tests/3550812#step/_graphical_wait_login_2/1 - that it boots kernel-6.8.0-0.rc3.20240207git6d280f4d760e.28.fc40  , a kernel which failed the tests when tested alone - https://openqa.fedoraproject.org/tests/overview?distri=fedora&groupid=2&version=40&build=Update-FEDORA-2024-eb02928b46 - and each test passed this time.

I'll backport that mutter patch, and once that's in, we can try a kernel build with the workaround removed, I guess?

Comment 7 Adam Williamson 2024-02-09 00:28:36 UTC
mutter update is done - https://bodhi.fedoraproject.org/updates/FEDORA-2024-31ca0e57d3 - so let's call this fixed for now. If it somehow shows up again, I'll re-open.


Note You need to log in before you can comment on or make changes to this bug.