This has been going on for a while but I only really got around to looking into it last week. On aarch64 in openQA, all tests that require boot-time decryption of an encrypted device are currently failing. The same tests pass on x86_64. From monitoring tests running live, it looks to me like every time openQA "presses a key", *two* characters are echoed to the console. We end up with twice as many asterisks as are actually in the passphrase. I've been trying to get an image built with a patched systemd that will echo the actual typed characters (not asterisks) to confirm this, but didn't get it to work yet. I *suspect* it might be ingesting the keypress once via the console displayed on the screen and once via the serial console, because I can see from the test logs that the entry prompt was displayed on both. I'm not sure yet if this is a problem in real-world deployments (hardware and/or virtual) or if it's somehow specific to openQA. It would be good if folks with aarch64 setups handy can test encrypted installs.
Note that on aarch64 we always seem to see the prompt on a kinda 'raw' text console - it's light grey on black, has a bunch of systemd status messages, then the "Please enter passphrase for disk" prompt. On x86_64 we get a fancier prompt from, I guess, Plymouth - either a graphical one (if the installed package set included the Plymouth graphical bits, I guess) or a prompt that's in bright white on a grey background with no other messages visible. I don't know why this difference happens, or whether it's significant.
Best I can tell, this broke in Fedora-Rawhide-20240105.n.1 (ugh, I know). For some reason I have no "Fedora rawhide compose report" email for that compose, which is extremely annoying. That email would tell us what changed in it. But from koji tag history, I *think* probably plymouth-24.004.60-1.fc40 and kernel-6.7.0-0.rc8.20240104gitac865f00af29.64.fc40 landed in that compose, up from plymouth-23.360.11-1.fc40 and kernel-6.7.0-0.rc8.61.fc40 . It doesn't look like the kernel, dracut or systemd changed.
Do you have the openQA link, this sounds like a case of duplicate consoles. Ex: earlycon + console= on a virtual serial port. That is at least the easy way to get double output, particularly with preserve boot console.
yes, that's what I meant by "I *suspect* it might be ingesting the keypress once via the console displayed on the screen and once via the serial console". An example of a failed test is https://openqa.fedoraproject.org/tests/3687227 , there are about a dozen in every compose (because quite a few tests use encryption for one reason or another, e.g. the language tests encrypt the disk to test decryption outside of US English works OK). In openQA we explicitly add `console=tty0` to the boot args for the installed system because of https://bugzilla.redhat.com/show_bug.cgi?id=1661288 - I found that if we don't do that, we don't see any boot messages on the display, but we want to see them on the display so we can match things like the passphrase entry prompt, greenboot status message etc.
oh, and yes, openQA sets up quite a few consoles: /usr/bin/qemu-system-aarch64 -device virtio-gpu-pci,edid=on,xres=1024,yres=768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device intel-hda -device hda-output,audiodev=snd0 -m 4096 -machine virt,gic-version=max -cpu host -netdev user,id=qanet0,net=172.16.2.0/24 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -device nec-usb-xhci -device usb-tablet -device usb-kbd -smp 2 -enable-kvm -no-shutdown -vnc :117,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev pipe,id=virtio_console_user,path=virtio_console_user,logfile=virtio_console_user.log,logappend=on -device virtconsole,chardev=virtio_console_user,name=org.openqa.console.virtio_console_user -chardev socket,path=qmp_socket,server=on,wait=off,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-file,filename=/var/lib/openqa/pool/27/raid/hd0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0,file=hd0-file,cache.no-flush=on,discard=unmap -device virtio-blk,id=hd0-device,drive=hd0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/27/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on,discard=unmap -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0 -drive id=pflash-code-overlay0,if=pflash,file=/var/lib/openqa/pool/27/raid/pflash-code-overlay0,unit=0,readonly=on -drive id=pflash-vars-overlay0,if=pflash,file=/var/lib/openqa/pool/27/raid/pflash-vars-overlay0,unit=1 there's serial0, and then there's virtio_console and virtio_console_user . You get logs of all of these in the Logs & Assets tab - serial0.txt is the log of serial0, serial_terminal.txt is the log of virtio_console, serial_terminal_user.txt is the log of virtio_console_user (all IIRC). the files get renamed along the way from the name that's used for qemu, I don't entirely remember/know why that is.
serial0.txt logs the passphrase entry prompt, and we also see it on the screen. the other two console log files don't show any trace of it. So I think it "appears" on the screen and on one serial console. All the above details are the same on x86_64, where this bug doesn't happen. The obvious things that are *different* between the two are: 1. we don't pass console=tty0 on x86_64 2. on x86_64 we see a 'plymouth-ified' passphrase entry prompt I suppose it might be interesting to see if this bug reproduces on x86_64 if we pass an explicit console=tty0 and/or disable plymouth...
I wonder if its this: https://src.fedoraproject.org/rpms/plymouth/blob/rawhide/f/0001-Add-UseSimpledrmNoLuks-config-file-keyword.patch
(not the duplicate chars but the comments about plymouth not appearing)
I setup a local openqa instance and hit this too. Its an openQA bug, which I've hacked for the time being.
oh, thanks a lot for looking into it. How do you mean it's an openQA bug exactly? How did you hack it? Can you report it upstream to https://progress.opensuse.org/ ? Thanks!
I know the author of the original change, and have pinged him about it, but timezones. I should know more tomorrow, but at the moment my 'hack' seems to be working fine and if needed I will do a PR.
Jeremy said that his 'hack' was to entirely remove the delay in os-autoinst's VNC backend between sending a key down event and a key up event, here: https://github.com/os-autoinst/os-autoinst/blob/ac08b5a48a5624eeacf69ba31271ee4e7c9308d1/consoles/VNC.pm#L691 From my knowledge of that code, this is all essentially about implementing a 'speed limit' on openQA's typing. Long ago openQA devs discovered that if we just send key press events as fast as we can, it can result in unreliable typing - events get lost along the way, whether at the qemu level or the guest kernel level or the guest userspace, it doesn't really matter. So by default, when you 'press keys' or 'type strings' in openQA (with the VNC backend), it aims to send 30 key presses per second, no more. There are various levels of control over this. There's a TYPING_LIMIT_DEFAULT of 30 (which is divided into 1), which can be overridden by the TYPING_LIMIT variable (which allows you to control the *maximum* typing speed at the test level). Then there's the `max_interval` arg for `type_string`, which can be used to type *slower* than the limit (but not faster). The value passed is square rooted then divided into 1, and the result is *added* to the 1 / TYPING_LIMIT(_DEFAULT) calculation. So if you set TYPING_LIMIT to 15 and then call `type_string` with a `max_interval` value of 10, this calculation is done: (1 / 10) + (1 / sqrt(10)) which comes out to about 0.4. This value is called $seconds_per_keypress. The type_string function then kinda splits it into three chunks. It expects a quarter of that time to be spent "pressing the key", a quarter "releasing the key", and a half "searching the next key" - see https://github.com/os-autoinst/os-autoinst/blob/ac08b5a48a5624eeacf69ba31271ee4e7c9308d1/consoles/video_base.pm#L56 . So it calls `$self->send_key_event($letter, $seconds_per_keypress * 0.25);` , and that ultimately gets us into the `map_and_send_key` function with the `$press_release_delay` value set to this "a quarter of about 0.4" value. `map_and_send_key` then sleeps for that long after the key down event, and the same amount of time after the key up event. Now I write it out, you can kinda see what the problem likely is: when we want to type very slowly, we wind up "holding the key down" for quite a long time. In Fedora's tests we have a `type_very_safely` convenience function which sets `max_interval` to 1, so we wind up with a seconds_per_keypress of 1.03recurring, which means we're holding each key down for a quarter of a second. That's a pretty long time! The amount of time you need to hold a key down before you start getting repeated inputs is going to depend on whatever's currently in charge of turning key events into characters in user space, I think, but 0.25 seconds definitely feels like it *could* trigger it. And we do, indeed, use type_very_safely to type the encryption passphrase. I did a quick dumb test to confirm this: I changed staging to use `type_safely` instead of `type_very_safely` for entering the encryption passphrase and restarted an aarch64 encrypted install test. Lo and behold, it passed. So that strongly confirms that the variable typing delays are the problem here. I'm gonna commit that change and send it to prod too as a short term bodge; if it causes failures on x86_64 or something we can always tweak it to use different delays on x86_64 and aarch64 or whatever. Medium term I came up with a patch to hardcode the sleep after key down at 0.008 seconds, which is the delay you get with 'all default' values - TYPING_LIMIT at 30 and no max_interval - so it's been used for a whole ton of typing across all the world's openQA instances. Jeremy advocates just removing that key down delay entirely but I feel conservative about it. :D The code is tweaked so that the rest of the delay required to achieve the intended typing speed is put into the sleep after key up. I'm going to build that and deploy it on staging (and, I guess, go back to type_very_safely on staging) to see how it behaves. Optimistically speaking, it *might* even make `type_very_safely` typing *better* in other cases where we use it (like typing in anaconda) by avoiding unintended key repeats.
"So if you set TYPING_LIMIT to 15 and then call `type_string` with a `max_interval` value of 10, this calculation is done:" should read: "So if you set TYPING_LIMIT to 10 and then call `type_string` with a `max_interval` value of 10, this calculation is done:" d'oh.
Right, generally default keyboard repeat rate is configured between ~1000-250ms, so the sleep is riding past the edge of the 250ms default keyboard repeat sometimes used for consoles and should never be longer than 100ms. It possible the use of arm's lower frequency scheduler clocks/NO_HZ/event merging/whatever are pushing it over the edge a little vs x86 in openQA, but that isn't a bug in the kernel/whatever its a bug in openqa to assume that sleep's are going to be exactly the timeout requested on linux. Frankly at the delay rates here, it looks more like an x86 bug for _not_ repeating. I can't really imagine a case for VNC where the keydown->keyup should have a delay vs just delaying between keystrokes unless someone is trying to test rollover + repeat, in which case the next key should be getting pressed before the keyup event. Given that the commands are being reliably delivered nothing in the VNC+emulation+kernel should be dropping these events anywhere near these speeds, serial in UEFI+grub, maybe. In fact given how fast some people can type, we should really be uncovering bugs caused by application layers dropping events that are happening faster than 20ms. There is something further going on though, because there is a lot of variation in how fast its appears to be typing with the overall rate being ~1 CPS. I can also see how there may have been issues over serial ports when they wern't at least emulating 16550's because just about no one is doing HW flow control either so throttling the CPS in those cases is basically required. (My personal machine's keyboard repeat is set below 200ms with an extremely fast repeat because i tend to use arrow keys for navigation, and it keeps my fingers from getting lazy and sleeping on the keys).
"Repeat delay" not rate.
I have my proposed fix for this deployed on staging now (and staging reverted to use `type_very_safely` for passphrases; prod is using `type_safely` to work around the problem for now). I'm gonna leave it like this over the weekend and see how it goes.
FEDORA-2025-18b79484b7 (os-autoinst-5^20250707gitd55ec72-6.fc42) has been submitted as an update to Fedora 42. https://bodhi.fedoraproject.org/updates/FEDORA-2025-18b79484b7
FEDORA-2025-b46f25ca26 (os-autoinst-5^20250707gitd55ec72-6.fc43) has been submitted as an update to Fedora 43. https://bodhi.fedoraproject.org/updates/FEDORA-2025-b46f25ca26
FEDORA-2025-b46f25ca26 has been pushed to the Fedora 43 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2025-b46f25ca26` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2025-b46f25ca26 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2025-18b79484b7 has been pushed to the Fedora 42 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2025-18b79484b7` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2025-18b79484b7 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
The proposed fix seems to have been behaving fine, so I've done an official build and sent it to testing. I'll update prod to it tomorrow.
FEDORA-2025-b46f25ca26 (os-autoinst-5^20250707gitd55ec72-6.fc43) has been pushed to the Fedora 43 stable repository. If problem still persists, please make note of it in this bug report.
FEDORA-2025-18b79484b7 (os-autoinst-5^20250707gitd55ec72-6.fc42) has been pushed to the Fedora 42 stable repository. If problem still persists, please make note of it in this bug report.