Bug 1658676 - Typing '<' more than once (with the specific keypress sequence used by os-autoinst) breaks with usb-kbd and VNC
Summary: Typing '<' more than once (with the specific keypress sequence used by os-aut...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: qemu
Version: rawhide
Hardware: All
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Fedora Virtualization Maintainers
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-12 16:21 UTC by Michel Normand
Modified: 2018-12-23 02:52 UTC (History)
13 users (show)

Fixed In Version: qemu-3.0.0-3.fc29
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-12-23 02:52:42 UTC


Attachments (Terms of Use)
tgtd_all_xe_journalctl_Id19667.log (185.00 KB, text/plain)
2018-12-12 16:21 UTC, Michel Normand
no flags Details
bad_openqa.conf.png (121.69 KB, image/png)
2018-12-13 10:58 UTC, Michel Normand
no flags Details

Description Michel Normand 2018-12-12 16:21:15 UTC
Created attachment 1513713 [details]
tgtd_all_xe_journalctl_Id19667.log

Description of problem:
tgtd service failed to start on f29 while rawhide openQA test

problem identified in openQA ppc64le test:
https://openqa.stg.fedoraproject.org/tests/418385#step/_support_server/21

I did a retry on my own openQA server increasing openQA timeout from 90s to 360s, with  still failure.
The extracted journalctl output:
===
Dec 12 15:27:36 support.domain.local systemd[1]: Starting tgtd iSCSI target daemon...
-- Subject: Unit tgtd.service has begun start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit tgtd.service has begun starting up.
Dec 12 15:27:36 support.domain.local tgtd[1306]: tgtd: iser_ib_init(3431) Failed to initialize RDMA; load kernel modules?
Dec 12 15:27:36 support.domain.local tgtd[1306]: tgtd: work_timer_start(146) use timer_fd based scheduler
Dec 12 15:27:36 support.domain.local tgtd[1306]: tgtd: bs_init(387) use signalfd notification
Dec 12 15:27:41 support.domain.local tgt-admin[1309]: Config::General: Block "<target>" has no EndBlock statement (level: 2, chunk 5)!
Dec 12 15:27:41 support.domain.local tgt-admin[1309]:  at /usr/sbin/tgt-admin line 165.
Dec 12 15:27:41 support.domain.local systemd[1]: tgtd.service: Control process exited, code=exited status=255
Dec 12 15:28:32 support.domain.local chronyd[983]: Selected source 85.199.214.100
Dec 12 15:29:12 support.domain.local systemd[1]: tgtd.service: State 'stop-sigterm' timed out. Killing.
Dec 12 15:29:12 support.domain.local systemd[1]: tgtd.service: Killing process 1306 (tgtd) with signal SIGKILL.
Dec 12 15:29:12 support.domain.local systemd[1]: tgtd.service: Main process exited, code=killed, status=9/KILL
Dec 12 15:29:12 support.domain.local systemd[1]: tgtd.service: Failed with result 'exit-code'.
Dec 12 15:29:12 support.domain.local systemd[1]: Failed to start tgtd iSCSI target daemon.
-- Subject: Unit tgtd.service has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit tgtd.service has failed.
-- 
-- The result is failed.
===

Version-Release number of selected component (if applicable):
scsi-target-utils-1.0.70-4.fc28.ppc64le.rpm in f29 release.

Comment 1 Michel Normand 2018-12-12 16:34:44 UTC
the issued command and output extracted from openQA captured image is:
=== 
# systemctl restart tgtd.service ; echo ~V_Xm-$?- > /dev/hvc0 
Job for tgtd.service failed because the control process exited with error code. 
See "systemctl status tgtd.service" and "journalctl -xe" for details. 
=== 
~V_Xm-1- <= this is the echo reported screen
===

Comment 2 Michel Normand 2018-12-13 10:58:13 UTC
Created attachment 1513981 [details]
bad_openqa.conf.png

adding a cat /etc/tgt/conf.d/openqa.conf in tests/_support_server.pm confirms failure is in the generated openqa.conf file: the expected trailing </target> is reported as /target>
===
    # start up iscsi target
    assert_script_run "printf '<target iqn.2016-06.local.domain:support.target1>\n    backing-store /dev/vdb\n    incominguser test weakpassword\n</target>' > /etc/tgt/conf.d/openqa.conf";
+   assert_script_run "cat /etc/tgt/conf.d/openqa.conf";
    # open firewall port
    assert_script_run "firewall-cmd --add-service=iscsi-target";
===

So need to understand why missing '<' character

Comment 3 Michel Normand 2018-12-13 16:53:26 UTC
still not yet understood, will continue with different syntax.

Comment 4 Adam Williamson 2018-12-13 17:02:14 UTC
So, the way this is generated is pretty simple, just using printf:

assert_script_run "printf '<target iqn.2016-06.local.domain:support.target1>\n    backing-store /dev/vdb\n    incominguser test weakpassword\n</target>' > /etc/tgt/conf.d/openqa.conf";

and this is not failing on x86_64, AFAICS. Is it happening reliably, every time, on ppc64le? I'd suspect a typing error, but it seems unlikely the exact same typing error would happen every time.

I wonder what happens if you hack the target string to be *shorter*?

Comment 5 Michel Normand 2018-12-13 18:10:22 UTC
same problem with shorter string:
===
    script_run "printf '=== create /tmp/openqa.conf0 with script_run\n'";
    script_run "printf '<target iqn.2016-06.local.domain:support.target1>\n    backing-store /dev/vdb\n    incominguser test weakpassword\n</target>' > /tmp/openqa.conf0";
    script_run "cat /tmp/openqa.conf0; printf '\n===\n'";
===
    script_run "printf '=== create /tmp/openqa.conf0 with two script_run\n'";
    script_run "printf '<target iqn.2016-06.local.domain:support.target1>\n    backing-store /dev/vdb\n    incominguser test weakpassword\n' > /tmp/openqa.conf0";
    script_run "</target>' >> /tmp/openqa.conf0";
    script_run "cat /tmp/openqa.conf0; printf '\n===\n'";
===

same problem replacing script_run by type_string:
===
    script_run "printf '=== create /tmp/openqa.conf3 with type_string two times and backslash before single quote on 2nd line\n'";
    type_string "printf '<target iqn.2016-06.local.domain:support.target1>\n    backing-store /dev/vdb\n    incominguser test weakpassword\n' > /tmp/openqa.conf3\n";
    type_string "printf \'</target>\' >> /tmp/openqa.conf3\n";
    script_run "cat /tmp/openqa.conf3; printf '\n===\n'";
===

Comment 6 Adam Williamson 2018-12-13 18:13:27 UTC
Huh. Did you watch the test and check if the < is actually included when typing the printf command or not? Can you reproduce the same behaviour manually?

Comment 7 Michel Normand 2018-12-13 18:35:07 UTC
(In reply to Adam Williamson from comment #6)
> Huh. Did you watch the test and check if the < is actually included when
> typing the printf command or not? Can you reproduce the same behaviour
> manually?

The < is not included in the printf command as already shown in previous attachment https://bugzilla.redhat.com/attachment.cgi?id=1513981

I have a guest on another machine and do not reproduce problem manually typing in default console.
I will try with a guest on same machine as openQA workers. (will do that monday as off tomorow)

Comment 8 Adam Williamson 2018-12-14 00:09:39 UTC
aarch64 seems to have the same issue:

https://openqa.stg.fedoraproject.org/tests/420580#step/_support_server/23

very weird. Not sure what's going on.

Comment 9 Adam Williamson 2018-12-14 00:13:07 UTC
Note, this seems to have started going wrong on exactly the date I updated the whole staging deployment from F28 to F29 and from the old openQA / os-autoinst we were on before (from about 2018-02) to a snapshot from November...so obvious suspects are the newer os-autoinst, and newer qemu...

Comment 10 Adam Williamson 2018-12-14 00:13:41 UTC
qemu went from 2.11.2 to 3.0.0 in the upgrade.

Comment 11 Adam Williamson 2018-12-18 07:40:56 UTC
So I poked around at this a bit by hacking up the _post_network-static test on the staging instance to do wacky stuff if the arch is aarch64 (so all TAP aarch64 tests on staging are currently garbage, sorry about that). It really seems like just the act of typing a single '<' screws up the VM, somehow. If you do:

type_string "<<<<<<";

it seems like you get *one* <, then the others come out as >. I got other weird results for basically any attempt to type anything with more than one '<' in it. I'm still not sure what's going on, though. It's almost certainly not openQA. It possibly *could* be os-autoinst, but I'm having trouble seeing how that could be arch-specific. For now I'm gonna throw it at qemu just in case the qemu devs have any idea what the hell is going on.

Comment 12 Cole Robinson 2018-12-18 14:55:35 UTC
Can you provide the qemu command line?
Adam you mention aarch64 but initial comment is about ppc64, is it affecting all archs or just some?
Trying to reproduce qemu 3.1 from virt-preview may help narrow it down

Comment 13 Adam Williamson 2018-12-18 16:13:44 UTC
It's affecting aarch64 and ppc64 but *not* x86_64. Those are all the arches we currently run the affected test on, do not know the status of any others.

Here are the command lines for tests on the same Rawhide compose on the three arches.

x86_64 (UNAFFECTED): /usr/bin/qemu-system-x86_64 -vga std -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw ac97 -global isa-fdc.driveA= -m 2048 -cpu Nehalem -device virtio-rng-pci -netdev tap,id=qanet0,ifname=tap7,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:00:46 -boot order=c,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 2 -enable-kvm -no-shutdown -vnc :98,share=force-shared -device virtio-serial -chardev socket,path=virtio_console,server,nowait,id=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/8/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0 -blockdev driver=file,node-name=hd1-file,filename=/var/lib/openqa/pool/8/raid/hd1,cache.no-flush=on -blockdev driver=qcow2,node-name=hd1,file=hd1-file,cache.no-flush=on -device virtio-blk,id=hd1-device,drive=hd1,serial=hd1 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/8/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0

aarch64 (AFFECTED): /usr/bin/qemu-system-aarch64 -device virtio-gpu-pci -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw ac97 -m 3072 -machine virt -cpu host -device virtio-rng-pci -netdev tap,id=qanet0,ifname=tap1,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:00:30 -device nec-usb-xhci -device usb-tablet -device usb-kbd -smp 2 -enable-kvm -no-shutdown -vnc :92,share=force-shared -device virtio-serial -chardev socket,path=virtio_console,server,nowait,id=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/2/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0 -blockdev driver=file,node-name=hd1-file,filename=/var/lib/openqa/pool/2/raid/hd1,cache.no-flush=on -blockdev driver=qcow2,node-name=hd1,file=hd1-file,cache.no-flush=on -device virtio-blk,id=hd1-device,drive=hd1,serial=hd1 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/2/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0 -drive id=pflash-code-overlay0,if=pflash,file=/var/lib/openqa/pool/2/raid/pflash-code-overlay0,unit=0,readonly=on -drive id=pflash-vars-overlay0,if=pflash,file=/var/lib/openqa/pool/2/raid/pflash-vars-overlay0,unit=1

ppc64le (AFFECTED): /usr/bin/qemu-system-ppc64 -g 1024x768 -vga std -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw ac97 -global isa-fdc.driveA= -m 4096 -machine usb=off -cpu host -device virtio-rng-pci -netdev tap,id=qanet0,ifname=tap1,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:00:1e -boot order=c,menu=on,splash-time=5000 -device nec-usb-xhci -device usb-tablet -device usb-kbd -smp 1 -enable-kvm -no-shutdown -vnc :92,share=force-shared -device virtio-serial -chardev socket,path=virtio_console,server,nowait,id=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/2/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0 -blockdev driver=file,node-name=hd1-file,filename=/var/lib/openqa/pool/2/raid/hd1,cache.no-flush=on -blockdev driver=qcow2,node-name=hd1,file=hd1-file,cache.no-flush=on -device virtio-blk,id=hd1-device,drive=hd1,serial=hd1 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/2/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0

I notice that both aarch64 and ppc64le have '-device usb-kbd' but x86_64 does not; this is down to some code in os-autoinst that does some arch-specific stuff:

    if ($vars->{ARCH} eq 'aarch64' || $vars->{ARCH} eq 'arm') {
        my $video_device = ($vars->{QEMU_OVERRIDE_VIDEO_DEVICE_AARCH64}) ? 'VGA' : 'virtio-gpu-pci';
        sp('device', $video_device);
        $arch_supports_boot_order = 0;
        $use_usb_kbd              = 1;
    }
    elsif ($vars->{OFW}) {
        $vars->{QEMUVGA} ||= "std";
        $vars->{QEMUMACHINE} = "usb=off";
        sp('g', '1024x768');
        $use_usb_kbd = 1;
    }

$vars->{OFW} is basically a proxy for ppc64(le) - it means Open Firmware, which is always used on ppc64(le). I don't fully know the reason for this offhand, there's a commit message "Also, aarch64 need usb-kbd", for ppc64(le) the line shows up in a commit with the message "Use xhci usb on POWER". I can try and fiddle an x86_64 test to run with usb-kbd and see if that triggers the issue.

The issue seems to be very specifically linked to typing the character "<" - I haven't seen the same behaviour with any other character (the desktop_browser test on GNOME is failing on typing issues far more than it did before, but those failures seem somewhat more random, though they *do* seem to always happen on a shift+key or ctrl+key combo).

Comment 14 Adam Williamson 2018-12-18 16:35:20 UTC
Indeed, hacking things up so an x86_64 test runs with '-device usb-kbd' causes the bug to occur there too.

Comment 15 Adam Williamson 2018-12-18 16:48:30 UTC
Using qemu-2:3.1.0-1.fc29.x86_64 from virt-preview COPR does not solve the bug.

Comment 16 Daniel Berrangé 2018-12-18 16:50:44 UTC
When you say   

   type_string "<<<<<<";

what exactly is that doing in terms of libvirt/QEMU ?  Is that injecting keys via the monitor, or is it injecting via VNC/SPICE/whatever ?

Comment 17 Adam Williamson 2018-12-18 17:14:38 UTC
Hah, so, I looked that up yesterday :)

It uses VNC. Here's the whole sequence of code.

type_string: https://github.com/os-autoinst/os-autoinst/blob/master/consoles/vnc_base.pm#L109 . It basically cuts the string into characters, does a bit of mapping for 'special' characters which doesn't concern us here, then sends the characters one at a time to:

map_and_send_key: https://github.com/os-autoinst/os-autoinst/blob/master/consoles/VNC.pm#L722 . This initializes a key mapping via 'init_x11_keymap' and maps the character to a value or values (for characters that use shift+key or ctrl+key etc) from that keymap. Then it sends a 'key down' event for each  mapped value, sleeping 2ms (I think) between each event, then sends a 'key up' event for the mapped value. It uses send_key_event_down and send_key_event_up to send these events:

send_key_event_down/up: https://github.com/os-autoinst/os-autoinst/blob/master/consoles/VNC.pm#L541 . These just take the value pressed and call _send_key_event, basically.

_send_key_event: https://github.com/os-autoinst/os-autoinst/blob/master/consoles/VNC.pm#L520 . This wraps up the value with the necessary padding and flags and so on and sends it over the socket to qemu.

Per https://github.com/os-autoinst/os-autoinst/blob/master/consoles/VNC.pm#L562 ($keymap_x11) and https://github.com/os-autoinst/os-autoinst/blob/master/consoles/VNC.pm#L671 (init_x11_keymap), we find the keymap value for the shift key is 0xffe1 , and I think the value for the key that produces '<' when shifted - which is ',' - should be 44 .

So if I got all that right, this code would basically send four events via VNC to qemu, in this order, with 2ms between each event:

(
            CCnN,
            4,             # message_type
            1,             # down-flag
            0,             # padding
            0xffe1,        # key
)

(
            CCnN,
            4,             # message_type
            1,             # down-flag
            0,             # padding
            44,        # key
)

(
            CCnN,
            4,             # message_type
            0,             # down-flag
            0,             # padding
            0xffe1,        # key
)

(
            CCnN,
            4,             # message_type
            0,             # down-flag
            0,             # padding
            44,        # key
)

Comment 18 Adam Williamson 2018-12-18 17:44:11 UTC
Downgrading to qemu-2:2.11.2-4.fc28 *DOES* fix the bug.

Comment 19 Adam Williamson 2018-12-18 18:26:51 UTC
Aha! So, we've got it.

This is sort of a redux of https://bugs.launchpad.net/qemu/+bug/1738283 . That's a bug we had in openQA a year or so back, where we suddenly had issues with typing < , > or | characters. It turned out to be caused by a sort of 'stray' key definition in qemu's 'pc-bios/keymaps/en-us' keyboard map: it has an entry for key '86':

# evdev 86 (0x56), QKeyCode "less", number 0x56
less 0x56
greater 0x56 shift
bar 0x56 altgr
brokenbar 0x56 shift altgr

Only that key does not exist on any real en-us keyboard. key 86 is the key between the left Shift key and the first letter key on the same row, *on 102/105-key keyboards* (European keyboards). 101/104-key keyboards (US etc.) *do not have that key*, but the qemu-keymap tool that produces this keymap file basically makes up an entry for that key in the layouts it generates for en-us and other mappings associated with 101/104-key physical keyboards.

It seems that Cole tested and found that *that original bug* was gone with qemu 2.12, so he dropped the patch. And indeed that exact bug *is* gone - we don't have any issue typing > or | , and on x86_64 without usb-kbd we don't have any issue typing < . But it seems that 'stray' key definition *is* still causing *THIS* bug - with usb-kbd , we have this problem when typing < .

The specific nature of the bug that remains is basically that if you do this:

'shift' down
'comma' down
'shift' up
'comma' up

on a VM via vnc with usb-kbd, it gets read as this series of 'input_event_key_qcode' events:

shift down
comma down
shift up
less up

and that screws things up. If you do this:

'shift' down
'comma' down
'comma' up
'shift' up

it works fine. So this bug *only* happens with usb-kbd, this specific way of ordering shifted key events, the 'comma / less' key, and the problematic '86' entry in the en-us keymap. Take out any of those things and the bug goes away.

For now, I'll restore my patch that removes the '86' keymap entry to the Fedora package. We can't really just upstream that patch, because the file is meant to be generated by 'qemu-keymap' - the "correct" fix would be to make qemu-keymap not generate a 'key 86' entry for layouts associated with 101/104-key physical keyboards.

Comment 20 Adam Williamson 2018-12-18 18:28:07 UTC
edit - it'd also, of course, be "correct" to fix the code so the '86' entry doesn't screw it up in this case. It's just that it'd definitely *not* be the 'correct' fix to just change the 'en-us' file itself upstream, so we can't just upstream a patch that does that.

Comment 21 Fedora Update System 2018-12-19 00:33:23 UTC
qemu-3.0.0-3.fc29 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2018-ef7bc97212

Comment 22 Adam Williamson 2018-12-19 01:25:05 UTC
I've already installed the update on the openQA worker boxes, so from now on, this should be resolved for openQA tests.

Comment 23 Daniel Berrangé 2018-12-19 11:40:52 UTC
See also this upstream discussion of a long term fix https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg04631.html

Comment 24 Fedora Update System 2018-12-21 21:05:24 UTC
qemu-3.0.0-3.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-ef7bc97212

Comment 25 Fedora Update System 2018-12-23 02:52:42 UTC
qemu-3.0.0-3.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.