I recently made a big change to the Fedora openQA tests. Generally speaking, they used to mostly run on BIOS, with UEFI variants of some tests. Now, they mostly run on UEFI, with BIOS variants of some tests. As part of this, all the tests that do an encrypted install switched from using BIOS to using UEFI, and they've all started failing quite often. The failure mode is that, when we reach the 'enter passphrase' screen on boot, no keystrokes are echoed back as dots, and on hitting enter after typing the passphrase, nothing appears to happen. The system just stays at the 'enter passphrase' screen, with the text box empty. There seems to be some other odd behaviour which may be related. We don't always get the graphical plymouth, sometimes we get the text one - e.g. https://openqa.fedoraproject.org/tests/2551433#step/disk_guided_encrypted_postinstall/1 , even though that was a graphical install and the runs on the previous and following composes showed the graphical prompt, that run showed the text prompt. Looking at the video, the whole plymouth phase was in 'text' mode, not graphical mode. When we do get the graphical mode, sometimes we get the keyboard layout indicator, and sometimes we don't. Notably, whenever the test fails we seem to not see the layout indicator (but there *are* cases where we don't see the indicator, but the test passes): https://openqa.fedoraproject.org/tests/2560466#step/disk_guided_encrypted_postinstall/1 - passed, indicator visible https://openqa.fedoraproject.org/tests/2563240#step/disk_guided_encrypted_postinstall/1 - failed, indicator not visible https://openqa.fedoraproject.org/tests/2549067#step/disk_guided_encrypted_postinstall/1 - passed, indicator not visible This is definitely associated with the BIOS -> UEFI shift and not anything else; I did it on the staging openQA instance some time earlier than on the production one, and the onset of the problem on each instance coincides exactly with when I did the switch on each instance.
oh, I should mention, openQA uses qemu, directly (not via libvirt). we're using the UEFI firmwares from the edk2 package, just as libvirt does - specifically, /usr/share/edk2/ovmf/OVMF_CODE.secboot.fd and /usr/share/edk2/ovmf/OVMF_VARS.secboot.fd . The openQA workers are running Fedora 39 and have edk2-ovmf-20240214-2.fc39.noarch .
Hmm, something else interesting: I think it may be specific to graphical plymouth, somehow. All the failures I can see for this test are on installs of Workstation, and I think they all are on boots where we got the graphical mode (I can't see any case where we happened to get the text mode, and it failed). Some are fresh installs (the tests for non-English languages - these are always encrypted so we can test decryption with non-US layouts). Some are the test which tests upgrades of an encrypted Workstation install. We have another test - install_simple_encrypted - which does an encrypted install with the minimal package set, so no plymouth theme is installed, so we always get the text mode of plymouth. That test does not seem to have failed one time since being switched to UEFI. Huh, now I look, something else interesting: this actually is only affecting Rawhide. I can't see a case where it's failed on F40.
Hmm, F40 and Rawhide plymouth seem to be identical. So this may actually be a kernel issue, I think. It seems F41 went to 6.9 kernels on March 12, and it seems I did the UEFI transition on openQA staging (and started seeing this bug) around March 25. So there was never a point where I was testing this UEFI path on kernel 6.8 on Rawhide, I don't think. If I get a bit of time, I'll do a kernel 6.9 scratch build for F40 and force an affected test to run on it a few times, that should confirm it's a kernel 6.9 bug somehow.
okay, yeah, that took a surprising amount of time to hack up and I got jumpscared by https://bugzilla.redhat.com/show_bug.cgi?id=2274833 in the middle of it, but it does look like this is caused by kernel 6.9. I forced the encrypted upgrade test to run a bunch of times on a kernel 6.9 scratch build for F40 and reproduced this a few times, e.g. https://openqa.stg.fedoraproject.org/tests/3791243# . I did the same on a random other F40 update (so it was running with a 6.8 kernel) and never hit this.
I'm trying to get some logs for this, but it's hard - I can't find a way to get anything useful logged to a serial console that doesn't cause plymouth to stop displaying entirely, or switch to text mode (which avoids the bug), on the monitor console :( It would technically be possible to bisect this, I guess, but it'd take quite a long time. I might try that next week if nobody has a better idea.
Annoyingly, can't reproduce this on my bare metal test box (which would've made bisecting it somewhat easier). Will now see if I can reproduce it in a VM running on that box, at least.
it's *possible* kernel-6.9.0-0.rc5.20240424git9d1ddab261f3.46.fc41 fixed this (either that or we just got very lucky with the tests in the Fedora-Rawhide-20240425.n.0 compose, which is also a possibility). I'll keep an eye on it.
sigh, nope, looks like we got unusually lucky on prod openQA but stg openQA still hit this on a couple of tests. :(
This is still happening and driving me kinda insane. I'm going to switch the relevant openQA tests to use a BIOS VM for now, if I'm right that it only affects UEFI VMs, that will avoid this constantly causing failures.
Hi, (In reply to Adam Williamson from comment #5) > I'm trying to get some logs for this, but it's hard - I can't find a way to > get anything useful logged to a serial console that doesn't cause plymouth > to stop displaying entirely, or switch to text mode (which avoids the bug), > on the monitor console :( You can put plymouth.ignore-serial-consoles on the kernel command line to make it not change behavior in the presence of serial consoles. so maybe do something like plymouth.ignore-serial-consoles plymouth.debug=stream:/dev/kmsg log_buf_len=1M drm.debug=0x1e printk.devkmsg=on debug rhgb console=ttyS0 and see if that gives any useful output
it might also be useful to see if plymouth.use-legacy-input affects things (that will make it read input from /dev/tty instead of /dev/input)
thanks a lot, I will try those and see how it goes.
well, good news is I can reproduce this quite easily in a VM running on my main system, which makes testing a bit easier. plymouth.use-legacy-input does not appear to affect things at all. I'm attaching logs from both 'good' and 'bad' cases with the debugging params requested. In both cases, I let the boot reach the passphrase entry screen, pressed a single key to see if it was echoed, then shut the machine down.
Created attachment 2035605 [details] "good" log with debugging (single typed character was echoed)
Created attachment 2035606 [details] "bad" log with debugging (single typed character was NOT echoed)
Hmm. In the "bad" log, I see: [ 2.367104] 00:00:02.284 ../src/libply-splash-core/ply-keyboard.c:376:ply_keyboard_watc: Watching for keyboard input from renderer [ 2.511072] 00:00:02.428 ../src/libply-splash-core/ply-keyboard.c:376:ply_keyboard_watc: Watching for keyboard input from renderer [ 2.521590] 00:00:02.439 ../src/libply-splash-core/ply-keyboard.c:395:ply_keyboard_stop: No longer watching for keyboard input from renderer in the "good" log, I see: [ 2.206362] 00:00:02.117 ../src/libply-splash-core/ply-device-manager.c:683:watch_for_u: watching for udev graphics device and input device add and remove events [ 2.207095] 00:00:02.117 ../src/libply-splash-core/ply-device-manager.c:472:create_devi: Not creating devices for subsystem input because there is no configure XKB layout [ 2.560566] 00:00:02.471 ../src/libply-splash-core/ply-keyboard.c:376:ply_keyboard_watc: Watching for keyboard input from renderer On another "good" log, I can see this: [ 2.194886] 00:00:02.110 ../src/libply-splash-core/ply-keyboard.c:376:ply_keyboard_watc: Watching for keyboard input from renderer [ 2.212097] 00:00:02.127 ../src/libply-splash-core/ply-keyboard.c:395:ply_keyboard_stop: No longer watching for keyboard input from renderer [ 2.277950] 00:00:02.193 ../src/libply-splash-core/ply-keyboard.c:376:ply_keyboard_watc: Watching for keyboard input from renderer And in another "bad" log: [ 3.267894] 00:00:03.159 ../src/libply-splash-core/ply-keyboard.c:376:ply_keyboard_watc: Watching for keyboard input from renderer [ 3.404570] 00:00:03.295 ../src/libply-splash-core/ply-keyboard.c:376:ply_keyboard_watc: Watching for keyboard input from renderer [ 3.409428] 00:00:03.300 ../src/libply-splash-core/ply-keyboard.c:395:ply_keyboard_stop: No longer watching for keyboard input from renderer so...it kinda looks like the occurrence of these messages can vary a bit, but when the bug happens, the *last* message of this type is the "No longer watching" one...
looks like that comes from `ply_keyboard_stop_watching_for_renderer_input`, which is called by `ply_keyboard_stop_watching_for_input`. That's only called by two things - `ply_keyboard_free` and `ply_device_manager_deactivate_keyboards` - and I don't see the "deactivating keyboards" message in the logs, which suggests we're getting it via `ply_keyboard_free`, not `ply_device_manager_deactivate_keyboards`. `ply_keyboard_free` is only called by `free_keyboards_for_renderer`, which is only called by `free_devices_from_device_path`, which is called by `on_udev_event` and `free_renderer`. On the `free_renderer` path I think we should see a message "freeing device manager", which I don't see, so I think we're in the `on_udev_event` path. So...a udev event? Here's some selective grepping from the "bad" case and the two somewhat different "good" cases: "bad" case [ 2.146152] 00:00:02.063 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got add event for device /dev/dri/card0 [ 2.367104] 00:00:02.284 ../src/libply-splash-core/ply-keyboard.c:376:ply_keyboard_watc: Watching for keyboard input from renderer [ 2.437989] 00:00:02.355 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got change event for device /dev/dri/card1 [ 2.438654] 00:00:02.356 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got add event for device /dev/dri/renderD128 [ 2.439282] 00:00:02.356 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got remove event for device /dev/dri/card0 [ 2.511072] 00:00:02.428 ../src/libply-splash-core/ply-keyboard.c:376:ply_keyboard_watc: Watching for keyboard input from renderer [ 2.521590] 00:00:02.439 ../src/libply-splash-core/ply-keyboard.c:395:ply_keyboard_stop: No longer watching for keyboard input from renderer [ 2.526490] 00:00:02.443 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got add event for device /dev/dri/card1 [ 2.527566] 00:00:02.445 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got change event for device /dev/dri/card1 [ 2.528116] 00:00:02.445 ../src/libply-splash-core/ply-device-manager.c:659:on_udev_eve: ignoring duplicate change event for device /dev/dri/card1 "good" case with only one "watching" message [ 2.308889] 00:00:02.219 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got add event for device /dev/dri/card0 [ 2.415579] 00:00:02.326 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got change event for device /dev/dri/card1 [ 2.560566] 00:00:02.471 ../src/libply-splash-core/ply-keyboard.c:376:ply_keyboard_watc: Watching for keyboard input from renderer [ 2.571420] 00:00:02.482 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got remove event for device /dev/dri/card0 [ 2.572004] 00:00:02.482 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got add event for device /dev/dri/renderD128 [ 2.572596] 00:00:02.483 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got add event for device /dev/dri/card1 [ 2.573172] 00:00:02.483 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got change event for device /dev/dri/card1 [ 2.573678] 00:00:02.484 ../src/libply-splash-core/ply-device-manager.c:659:on_udev_eve: ignoring duplicate change event for device /dev/dri/card1 "good" case with "watching", then "not watching", then "watching" again [ 2.344184] 00:00:02.255 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got add event for device /dev/dri/card0 [ 2.584829] 00:00:02.496 ../src/libply-splash-core/ply-keyboard.c:376:ply_keyboard_watc: Watching for keyboard input from renderer [ 2.594412] 00:00:02.505 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got remove event for device /dev/dri/card0 [ 2.597073] 00:00:02.508 ../src/libply-splash-core/ply-keyboard.c:395:ply_keyboard_stop: No longer watching for keyboard input from renderer [ 2.604626] 00:00:02.516 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got change event for device /dev/dri/card1 [ 2.605417] 00:00:02.516 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got add event for device /dev/dri/renderD128 [ 2.606264] 00:00:02.517 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got add event for device /dev/dri/card1 [ 2.607066] 00:00:02.518 ../src/libply-splash-core/ply-device-manager.c:659:on_udev_eve: ignoring duplicate add event for device /dev/dri/card1 [ 2.607963] 00:00:02.519 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got change event for device /dev/dri/card1 [ 2.608769] 00:00:02.520 ../src/libply-splash-core/ply-device-manager.c:659:on_udev_eve: ignoring duplicate change event for device /dev/dri/card1 [ 2.770143] 00:00:02.681 ../src/libply-splash-core/ply-keyboard.c:376:ply_keyboard_watc: Watching for keyboard input from renderer does this illuminate anything?
Ray says "it's 100% a plymouth bug", so re-assigning. It seems like maybe there's some change in 6.9 regarding the timing of these video device add/remove events on boot which happens to trigger this, but the bug fundamentally is in plymouth.
Just for reference, this is how it looks when booting a 6.8 kernel: [ 2.481750] 00:00:02.402 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got add event for device /dev/dri/card0 [ 2.506811] 00:00:02.427 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got change event for device /dev/dri/card1 [ 2.507472] 00:00:02.428 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got add event for device /dev/dri/renderD128 [ 2.741805] 00:00:02.662 ../src/libply-splash-core/ply-keyboard.c:376:ply_keyboard_watc: Watching for keyboard input from renderer [ 2.765244] 00:00:02.686 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got add event for device /dev/dri/card1 [ 2.765796] 00:00:02.686 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got change event for device /dev/dri/card1 [ 2.766365] 00:00:02.687 ../src/libply-splash-core/ply-device-manager.c:659:on_udev_eve: ignoring duplicate change event for device /dev/dri/card1 [ 2.766987] 00:00:02.688 ../src/libply-splash-core/ply-device-manager.c:642:on_udev_eve: got remove event for device /dev/dri/card0
i think this is likely fixed by two of the commits on this merge request: https://gitlab.freedesktop.org/plymouth/plymouth/-/merge_requests/323/
(In reply to Ray Strode [halfline] from comment #20) > i think this is likely fixed by two of the commits on this merge request: > > https://gitlab.freedesktop.org/plymouth/plymouth/-/merge_requests/323/ Yes, looking at the bad log this is exactly the issue which the "ply-device-manager: Remove simpledrm renderers before adding normal drm renderers" from that merge-request fixes. But what is weird here is that there is no "plymouth,use-simpledrm" on the kernel commandline. It seems that the simpledrm check in src/libply-splash-core/ply-device-manager.c is not identifying the simpledrm in the EFI virtual-machine case as being simpledrm, or maybe the check there broke in kernel 6.9 ? Either way the fixes from the plymouth merge-request should fix this and I was planning on bringing those to F40+ (with auto-karma pushing disabled for F40) anyways to fix bug 2183743. So a plymouth build with these fixes should show up in rawhide soonish. Adam, please let us know if this new build fixes things.
p.s. Note that the fixes also change the simpledrm check in plymouth's device-manager it now checks for a suffix of "simple-framebuffer.0/drm/card0" for the udev-device's syspath and the bad log shows: [ 2.133217] 00:00:02.050 ../src/libply-splash-core/ply-device-manager.c:497:create_devi: found device /sys/devices/pci0000:00/0000:00:01.0/simple-framebuffer.0/drm/card0 So the new check should work in the openQA EFI virtual-machine case from this bug too now. Which means that to fix this we don't need the fix which I mentioned, just the new check to identify simpledrm /dev/dri/card# devices should be enough.
Thanks, I will test it.
FEDORA-2024-aed7767b59 (plymouth-24.004.60-12.fc41) has been submitted as an update to Fedora 41. https://bodhi.fedoraproject.org/updates/FEDORA-2024-aed7767b59
FEDORA-2024-7d7cf3d6f4 (plymouth-24.004.60-12.fc40) has been submitted as an update to Fedora 40. https://bodhi.fedoraproject.org/updates/FEDORA-2024-7d7cf3d6f4
FEDORA-2024-aed7767b59 (plymouth-24.004.60-12.fc41) has been pushed to the Fedora 41 stable repository. If problem still persists, please make note of it in this bug report.
FEDORA-2024-7d7cf3d6f4 has been pushed to the Fedora 40 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-7d7cf3d6f4` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-7d7cf3d6f4 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
One affected test that's still on UEFI has not failed for three days, which is a good sign. I'm gonna flip several other affected tests back to UEFI to check further (I previously flipped them to BIOS to workaround this bug).
OK, yep, this definitely looks fixed. Thanks.
FEDORA-2024-7d7cf3d6f4 (plymouth-24.004.60-12.fc40) has been pushed to the Fedora 40 stable repository. If problem still persists, please make note of it in this bug report.
After applying this plymouth update that just went to stable, all the Nvidia based workstations that I have (with output directly connected to the Nvidia card) never show plymouth. This includes never reaching the password prompt for decrypting the disk), the screen just stays black. Systems are booted with "rd.driver.blacklist=nouveau modprobe.blacklist=nouveau" and with "options nvidia-drm modeset=1 fbdev=1" in the module options. Downgrading plymouth solved the issue, I can use them normally again. Do we need to reintroduce initcall_blacklist=simpledrm_platform_driver_init for the Nvidia proprietary driver?
Just some more information. The desktop workstation have a discrete Nvidia GPU and an integrated AMD one, not connected to any output but not disabled. Since I could not access Grub (wasn't the shift key to bring up the hidden Grub menu?), I attached a monitor to the motherboard HDMI port, and to my surprise the luks password prompt via plymouth was there. Long story short, after a few tests, with the previous Plymouth update, Plymouth is displayed on the connected output, with the new update, it's displayed on the stand by AMD GPU with no monitor attached, By going to the bios of a system and disabling the integrated GPU (thus leaving only the Nvidia one available), Plymouth is displayed correctly.
This is now broken on Fedora 39, because kernel 6.9 has reached stable there. This is causing openQA upgrade tests from F39 to F40 to fail, e.g. https://openqa.fedoraproject.org/tests/2697047 . Can we please have the fix backported to F39? It looks like it's not a trivial backport, unfortunately :(
(In reply to Adam Williamson from comment #33) > This is now broken on Fedora 39, because kernel 6.9 has reached stable > there. This is causing openQA upgrade tests from F39 to F40 to fail, e.g. > https://openqa.fedoraproject.org/tests/2697047 . Can we please have the fix > backported to F39? It looks like it's not a trivial backport, unfortunately > :( Yeah I think we might just as well sync the F39 plymouth to F40, the new F40 package has seen a reasonable amount of testing and a backport is involved enough that chances are that the backport might actually introduce some (other) issues. So I think that syncing the F39 packages to what is in F40 is best. @rstrode any objections against syncing the F39 plymouth to what is in F40 now?
no objection
FEDORA-2024-442827752b (plymouth-24.004.60-12.fc39) has been submitted as an update to Fedora 39. https://bodhi.fedoraproject.org/updates/FEDORA-2024-442827752b
FEDORA-2024-442827752b has been pushed to the Fedora 39 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-442827752b` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-442827752b See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2024-442827752b (plymouth-24.004.60-12.fc39) has been pushed to the Fedora 39 stable repository. If problem still persists, please make note of it in this bug report.