Created attachment 1441773 [details] Avocado-vt results of migration which failed due to too long response on QMP (catch_monitor-avocado-vt-vm1.log) Description of problem: I'm running Avocado-vt migration test which occasionally fails because "query-status" command takes too long to respond (>20s). Version-Release number of selected component (if applicable): qemu-kvm-ma-2.10.0-21.el7_5.2 kernel-4.14.0-49.el7a.s390x How reproducible: 3/5 Steps to Reproduce: 1. Setup "Avocado-vt" and get RHEL.7.devel image 2. avocado --show all run --vt-guest-os RHEL.7.devel migrate.with_reboot.tcp Actual results: ERROR: MonitorLockError: Could not acquire exclusive lock to read QMP events Expected results: PASS Additional info: It might be possible to reproduce it without Avocado-vt. Basically what happens is the src machine is booted with: MALLOC_PERTURB_=1 /usr/libexec/qemu-kvm \ -S \ -name 'avocado-vt-vm1' \ -sandbox off \ -machine s390-ccw-virtio \ -nodefaults \ -vga none \ -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_rgV7i8/monitor-qmpmonitor1-20180525-045207-GArZ3FDu,server,nowait \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_rgV7i8/monitor-catch_monitor-20180525-045207-GArZ3FDu,server,nowait \ -mon chardev=qmp_id_catch_monitor,mode=control \ -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_rgV7i8/serial-serial0-20180525-045207-GArZ3FDu,server,nowait \ -device sclpconsole,chardev=serial_id_serial0 \ -device virtio-scsi-ccw,id=virtio_scsi_ccw0 \ -drive id=drive_image1,if=none,format=qcow2,file=/var/lib/libvirt/images/avocado/avocado-vt/images/rhel7devel-s390x.qcow2 \ -device scsi-hd,id=image1,drive=drive_image1 \ -device virtio-net-ccw,mac=9a:cc:cd:ce:cf:d0,id=id0NoZen,netdev=idsXij0h \ -netdev tap,id=idsXij0h,fd=18 \ -m 2048 \ -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 \ -cpu 'host' \ -vnc :0 \ -rtc base=utc,clock=host \ -boot strict=on \ -enable-kvm dst machine with: MALLOC_PERTURB_=1 /usr/libexec/qemu-kvm \ -S \ -name 'avocado-vt-vm1' \ -sandbox off \ -machine s390-ccw-virtio \ -nodefaults \ -vga none \ -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_rgV7i8/monitor-qmpmonitor1-20180525-045219-H2lQBVoc,server,nowait \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_rgV7i8/monitor-catch_monitor-20180525-045219-H2lQBVoc,server,nowait \ -mon chardev=qmp_id_catch_monitor,mode=control \ -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_rgV7i8/serial-serial0-20180525-045219-H2lQBVoc,server,nowait \ -device sclpconsole,chardev=serial_id_serial0 \ -device virtio-scsi-ccw,id=virtio_scsi_ccw0 \ -drive id=drive_image1,if=none,format=qcow2,file=/var/lib/libvirt/images/avocado/avocado-vt/images/rhel7devel-s390x.qcow2 \ -device scsi-hd,id=image1,drive=drive_image1 \ -device virtio-net-ccw,mac=9a:cc:cd:ce:cf:d0,id=idGUaE7Y,netdev=idIvDQPb \ -netdev tap,id=idIvDQPb,fd=28 \ -m 2048 \ -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 \ -cpu 'host' \ -vnc :1 \ -rtc base=utc,clock=host \ -boot strict=on \ -enable-kvm \ -incoming tcp:0:5200 You can see there are 2 QMP monitors. The usual "greeting" and some "query-*" commands are executed immediately followed by: First monitor: 2018-05-25 05:25:34: {"execute": "query-status", "id": "Wcwn133t"} 2018-05-25 05:25:34: {"return": {"status": "inmigrate", "singlestep": false, "running": false}, "id": "Wcwn133t"} 2018-05-25 05:25:34: {"execute": "query-cpus", "id": "p4jqntiW"} 2018-05-25 05:25:34: {"return": [{"arch": "s390", "current": true, "props": {"core-id": 0}, "cpu-state": "operating", "CPU": 0, "qom_path": "/machine/unattached/device[0]", "halted": false, "thread_id": 84360}, {"arch": "s390", "current": false, "props": {"core-id": 1}, "cpu-state": "stopped", "CPU": 1, "qom_path": "/machine/unattached/device[1]", "halted": true, "thread_id": 84361}], "id": "p4jqntiW"} 2018-05-25 05:25:34: {"execute": "query-status", "id": "7NMYjJb8"} 2018-05-25 05:25:34: {"return": {"status": "inmigrate", "singlestep": false, "running": false}, "id": "7NMYjJb8"} 2018-05-25 05:28:15: {"execute": "query-status", "id": "6qNhjvpH"} 2018-05-25 05:28:15: {"return": {"status": "paused", "singlestep": false, "running": false}, "id": "6qNhjvpH"} At the same time the second monitor: 2018-05-25 05:25:34: {"execute": "query-status", "id": "OTLnoKT0"} 2018-05-25 05:25:34: {"return": {"status": "inmigrate", "singlestep": false, "running": false}, "id": "OTLnoKT0"} 2018-05-25 05:25:40: {"execute": "query-status", "id": "NVtTiYnf"} 2018-05-25 05:26:06: {"return": {"status": "inmigrate", "singlestep": false, "running": false}, "id": "NVtTiYnf"} 2018-05-25 05:26:13: {"execute": "human-monitor-command", "arguments": {"command-line": "info registers"}, "id": "FodBo9Df"} In the second monitor you can see the long pause between NVtTiYnf execute and return, which causes the failure. The machine actually proceeds and the migration finishes well. This only happens early basically just after the machine is executed.
I haven't reproduced this bug after running the same tp-qemu test case "migrate.with_reboot.tcp" with more than 50 cycles repeat. Will update here once I got a chance to reproduce it. qemu-kvm-ma-2.10.0-21.el7_5.2 kernel-4.14.0-49.2.1.el7a.s390x
Hi Lukáš, I've now tried to reproduce this problem on the kerneldev5 LPAR, but so far I also failed. I've ran it 50 times with qemu-kvm 2.12, and all tests succeeded. Then I installed qemu-kvm-ma-2.10.0-21.el7_5.2.s390x and kernel 4.14.0-49.9.1.el7a.s390x and ran it again for 50 times, and avocado reported "PASS" again for all tests. Is there something special with your LPAR maybe? I wonder whether it has shared CPUs with other LPARs and you've just hit a wrong point in time when the other LPARs were very loaded? Can you still reproduce the problem (maybe also try with qemu v2.12 from RHEL7.6)?
(In reply to Thomas Huth from comment #6) > Hi Lukáš, > I've now tried to reproduce this problem on the kerneldev5 LPAR, but so far > I also failed. I've ran it 50 times with qemu-kvm 2.12, and all tests > succeeded. Then I installed qemu-kvm-ma-2.10.0-21.el7_5.2.s390x and kernel > 4.14.0-49.9.1.el7a.s390x and ran it again for 50 times, and avocado reported > "PASS" again for all tests. > Is there something special with your LPAR maybe? I wonder whether it has > shared CPUs with other LPARs and you've just hit a wrong point in time when > the other LPARs were very loaded? Can you still reproduce the problem (maybe > also try with qemu v2.12 from RHEL7.6)? Hi, Thomas At the Monday night this week, I also borrowed Lukáš's LPAR (kerneldev8), and run at least 30 times, all PASS... Lukáš got my test results and logs, not sure what is the difference between my testing and his.
I haven't got this issue for quite a while, but I noticed the system memory was at it's limits so it is possible that the process got swapped, which could have resulted in long timeouts. I'll keep an eye on this and let you know on tomorrows meeting whether we could close this one or not.
OK, it looks like the problem was really low memory. I'm sorry for bothering you.