Bug 1582707 - "query-status" occasionally takes very long time on s390x
Summary: "query-status" occasionally takes very long time on s390x
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-ma
Version: 7.6
Hardware: s390x
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Thomas Huth
QA Contact: Qunfang Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-26 07:33 UTC by Lukáš Doktor
Modified: 2018-07-13 10:38 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-07-13 10:38:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Avocado-vt results of migration which failed due to too long response on QMP (catch_monitor-avocado-vt-vm1.log) (60.75 KB, application/zip)
2018-05-26 07:33 UTC, Lukáš Doktor
no flags Details

Description Lukáš Doktor 2018-05-26 07:33:19 UTC
Created attachment 1441773 [details]
Avocado-vt results of migration which failed due to too long response on QMP (catch_monitor-avocado-vt-vm1.log)

Description of problem:
I'm running Avocado-vt migration test which occasionally fails because "query-status" command takes too long to respond (>20s).

Version-Release number of selected component (if applicable):
qemu-kvm-ma-2.10.0-21.el7_5.2
kernel-4.14.0-49.el7a.s390x

How reproducible:
3/5

Steps to Reproduce:
1. Setup "Avocado-vt" and get RHEL.7.devel image
2. avocado --show all run --vt-guest-os RHEL.7.devel migrate.with_reboot.tcp

Actual results:
ERROR: MonitorLockError: Could not acquire exclusive lock to read QMP events

Expected results:
PASS

Additional info:
It might be possible to reproduce it without Avocado-vt. Basically what happens is the src machine is booted with:

MALLOC_PERTURB_=1  /usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine s390-ccw-virtio  \
    -nodefaults  \
    -vga none  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_rgV7i8/monitor-qmpmonitor1-20180525-045207-GArZ3FDu,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_rgV7i8/monitor-catch_monitor-20180525-045207-GArZ3FDu,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_rgV7i8/serial-serial0-20180525-045207-GArZ3FDu,server,nowait \
    -device sclpconsole,chardev=serial_id_serial0 \
    -device virtio-scsi-ccw,id=virtio_scsi_ccw0 \
    -drive id=drive_image1,if=none,format=qcow2,file=/var/lib/libvirt/images/avocado/avocado-vt/images/rhel7devel-s390x.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -device virtio-net-ccw,mac=9a:cc:cd:ce:cf:d0,id=id0NoZen,netdev=idsXij0h  \
    -netdev tap,id=idsXij0h,fd=18 \
    -m 2048  \
    -smp 2,maxcpus=2,cores=1,threads=1,sockets=2  \
    -cpu 'host'  \
    -vnc :0  \
    -rtc base=utc,clock=host \
    -boot strict=on \
    -enable-kvm


dst machine with:

MALLOC_PERTURB_=1  /usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine s390-ccw-virtio  \
    -nodefaults  \
    -vga none  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_rgV7i8/monitor-qmpmonitor1-20180525-045219-H2lQBVoc,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_rgV7i8/monitor-catch_monitor-20180525-045219-H2lQBVoc,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_rgV7i8/serial-serial0-20180525-045219-H2lQBVoc,server,nowait \
    -device sclpconsole,chardev=serial_id_serial0 \
    -device virtio-scsi-ccw,id=virtio_scsi_ccw0 \
    -drive id=drive_image1,if=none,format=qcow2,file=/var/lib/libvirt/images/avocado/avocado-vt/images/rhel7devel-s390x.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -device virtio-net-ccw,mac=9a:cc:cd:ce:cf:d0,id=idGUaE7Y,netdev=idIvDQPb  \
    -netdev tap,id=idIvDQPb,fd=28 \
    -m 2048  \
    -smp 2,maxcpus=2,cores=1,threads=1,sockets=2  \
    -cpu 'host'  \
    -vnc :1  \
    -rtc base=utc,clock=host \
    -boot strict=on \
    -enable-kvm \
    -incoming tcp:0:5200


You can see there are 2 QMP monitors. The usual "greeting" and some "query-*" commands are executed immediately followed by:

First monitor:

2018-05-25 05:25:34: {"execute": "query-status", "id": "Wcwn133t"}
2018-05-25 05:25:34: {"return": {"status": "inmigrate", "singlestep": false, "running": false}, "id": "Wcwn133t"}
2018-05-25 05:25:34: {"execute": "query-cpus", "id": "p4jqntiW"}
2018-05-25 05:25:34: {"return": [{"arch": "s390", "current": true, "props": {"core-id": 0}, "cpu-state": "operating", "CPU": 0, "qom_path": "/machine/unattached/device[0]", "halted": false, "thread_id": 84360}, {"arch": "s390", "current": false, "props": {"core-id": 1}, "cpu-state": "stopped", "CPU": 1, "qom_path": "/machine/unattached/device[1]", "halted": true, "thread_id": 84361}], "id": "p4jqntiW"}
2018-05-25 05:25:34: {"execute": "query-status", "id": "7NMYjJb8"}
2018-05-25 05:25:34: {"return": {"status": "inmigrate", "singlestep": false, "running": false}, "id": "7NMYjJb8"}
2018-05-25 05:28:15: {"execute": "query-status", "id": "6qNhjvpH"}
2018-05-25 05:28:15: {"return": {"status": "paused", "singlestep": false, "running": false}, "id": "6qNhjvpH"}

At the same time the second monitor:

2018-05-25 05:25:34: {"execute": "query-status", "id": "OTLnoKT0"}
2018-05-25 05:25:34: {"return": {"status": "inmigrate", "singlestep": false, "running": false}, "id": "OTLnoKT0"}
2018-05-25 05:25:40: {"execute": "query-status", "id": "NVtTiYnf"}
2018-05-25 05:26:06: {"return": {"status": "inmigrate", "singlestep": false, "running": false}, "id": "NVtTiYnf"}
2018-05-25 05:26:13: {"execute": "human-monitor-command", "arguments": {"command-line": "info registers"}, "id": "FodBo9Df"}

In the second monitor you can see the long pause between NVtTiYnf execute and return, which causes the failure. The machine actually proceeds and the migration finishes well. This only happens early basically just after the machine is executed.

Comment 2 Qunfang Zhang 2018-05-31 08:41:15 UTC
I haven't reproduced this bug after running the same tp-qemu test case "migrate.with_reboot.tcp" with more than 50 cycles repeat.  Will update here once I got a chance to reproduce it. 

qemu-kvm-ma-2.10.0-21.el7_5.2
kernel-4.14.0-49.2.1.el7a.s390x

Comment 6 Thomas Huth 2018-06-27 08:46:50 UTC
Hi Lukáš,
I've now tried to reproduce this problem on the kerneldev5 LPAR, but so far I also failed. I've ran it 50 times with qemu-kvm 2.12, and all tests succeeded. Then I installed qemu-kvm-ma-2.10.0-21.el7_5.2.s390x and kernel 4.14.0-49.9.1.el7a.s390x and ran it again for 50 times, and avocado reported "PASS" again for all tests.
Is there something special with your LPAR maybe? I wonder whether it has shared CPUs with other LPARs and you've just hit a wrong point in time when the other LPARs were very loaded? Can you still reproduce the problem (maybe also try with qemu v2.12 from RHEL7.6)?

Comment 7 Qunfang Zhang 2018-06-27 08:52:11 UTC
(In reply to Thomas Huth from comment #6)
> Hi Lukáš,
> I've now tried to reproduce this problem on the kerneldev5 LPAR, but so far
> I also failed. I've ran it 50 times with qemu-kvm 2.12, and all tests
> succeeded. Then I installed qemu-kvm-ma-2.10.0-21.el7_5.2.s390x and kernel
> 4.14.0-49.9.1.el7a.s390x and ran it again for 50 times, and avocado reported
> "PASS" again for all tests.
> Is there something special with your LPAR maybe? I wonder whether it has
> shared CPUs with other LPARs and you've just hit a wrong point in time when
> the other LPARs were very loaded? Can you still reproduce the problem (maybe
> also try with qemu v2.12 from RHEL7.6)?

Hi, Thomas

At the Monday night this week, I also borrowed Lukáš's LPAR (kerneldev8), and run at least 30 times, all PASS... Lukáš got my test results and logs, not sure what is the difference between my testing and his.

Comment 8 Lukáš Doktor 2018-07-12 18:02:58 UTC
I haven't got this issue for quite a while, but I noticed the system memory was at it's limits so it is possible that the process got swapped, which could have resulted in long timeouts.

I'll keep an eye on this and let you know on tomorrows meeting whether we could close this one or not.

Comment 9 Lukáš Doktor 2018-07-13 10:38:15 UTC
OK, it looks like the problem was really low memory. I'm sorry for bothering you.


Note You need to log in before you can comment on or make changes to this bug.