Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1582707

Summary:

"query-status" occasionally takes very long time on s390x

Product:

Red Hat Enterprise Linux 7

Reporter:

Lukáš Doktor <ldoktor>

Component:

qemu-kvm-ma

Assignee:

Thomas Huth <thuth>

Status:

CLOSED NOTABUG

QA Contact:

Qunfang Zhang <qzhang>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

7.6

CC:

areis, cohuck, dhildenb, knoel, ldoktor, michen, qzhang, xuma, yhong

Target Milestone:

Target Release:

---

Hardware:

s390x

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-07-13 10:38:15 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Avocado-vt results of migration which failed due to too long response on QMP (catch_monitor-avocado-vt-vm1.log)	none

Description Lukáš Doktor 2018-05-26 07:33:19 UTC

Created attachment 1441773 [details]
Avocado-vt results of migration which failed due to too long response on QMP (catch_monitor-avocado-vt-vm1.log)

Description of problem:
I'm running Avocado-vt migration test which occasionally fails because "query-status" command takes too long to respond (>20s).

Version-Release number of selected component (if applicable):
qemu-kvm-ma-2.10.0-21.el7_5.2
kernel-4.14.0-49.el7a.s390x

How reproducible:
3/5

Steps to Reproduce:
1. Setup "Avocado-vt" and get RHEL.7.devel image
2. avocado --show all run --vt-guest-os RHEL.7.devel migrate.with_reboot.tcp

Actual results:
ERROR: MonitorLockError: Could not acquire exclusive lock to read QMP events

Expected results:
PASS

Additional info:
It might be possible to reproduce it without Avocado-vt. Basically what happens is the src machine is booted with:

MALLOC_PERTURB_=1  /usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine s390-ccw-virtio  \
    -nodefaults  \
    -vga none  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_rgV7i8/monitor-qmpmonitor1-20180525-045207-GArZ3FDu,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_rgV7i8/monitor-catch_monitor-20180525-045207-GArZ3FDu,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_rgV7i8/serial-serial0-20180525-045207-GArZ3FDu,server,nowait \
    -device sclpconsole,chardev=serial_id_serial0 \
    -device virtio-scsi-ccw,id=virtio_scsi_ccw0 \
    -drive id=drive_image1,if=none,format=qcow2,file=/var/lib/libvirt/images/avocado/avocado-vt/images/rhel7devel-s390x.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -device virtio-net-ccw,mac=9a:cc:cd:ce:cf:d0,id=id0NoZen,netdev=idsXij0h  \
    -netdev tap,id=idsXij0h,fd=18 \
    -m 2048  \
    -smp 2,maxcpus=2,cores=1,threads=1,sockets=2  \
    -cpu 'host'  \
    -vnc :0  \
    -rtc base=utc,clock=host \
    -boot strict=on \
    -enable-kvm


dst machine with:

MALLOC_PERTURB_=1  /usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine s390-ccw-virtio  \
    -nodefaults  \
    -vga none  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_rgV7i8/monitor-qmpmonitor1-20180525-045219-H2lQBVoc,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_rgV7i8/monitor-catch_monitor-20180525-045219-H2lQBVoc,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_rgV7i8/serial-serial0-20180525-045219-H2lQBVoc,server,nowait \
    -device sclpconsole,chardev=serial_id_serial0 \
    -device virtio-scsi-ccw,id=virtio_scsi_ccw0 \
    -drive id=drive_image1,if=none,format=qcow2,file=/var/lib/libvirt/images/avocado/avocado-vt/images/rhel7devel-s390x.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -device virtio-net-ccw,mac=9a:cc:cd:ce:cf:d0,id=idGUaE7Y,netdev=idIvDQPb  \
    -netdev tap,id=idIvDQPb,fd=28 \
    -m 2048  \
    -smp 2,maxcpus=2,cores=1,threads=1,sockets=2  \
    -cpu 'host'  \
    -vnc :1  \
    -rtc base=utc,clock=host \
    -boot strict=on \
    -enable-kvm \
    -incoming tcp:0:5200


You can see there are 2 QMP monitors. The usual "greeting" and some "query-*" commands are executed immediately followed by:

First monitor:

2018-05-25 05:25:34: {"execute": "query-status", "id": "Wcwn133t"}
2018-05-25 05:25:34: {"return": {"status": "inmigrate", "singlestep": false, "running": false}, "id": "Wcwn133t"}
2018-05-25 05:25:34: {"execute": "query-cpus", "id": "p4jqntiW"}
2018-05-25 05:25:34: {"return": [{"arch": "s390", "current": true, "props": {"core-id": 0}, "cpu-state": "operating", "CPU": 0, "qom_path": "/machine/unattached/device[0]", "halted": false, "thread_id": 84360}, {"arch": "s390", "current": false, "props": {"core-id": 1}, "cpu-state": "stopped", "CPU": 1, "qom_path": "/machine/unattached/device[1]", "halted": true, "thread_id": 84361}], "id": "p4jqntiW"}
2018-05-25 05:25:34: {"execute": "query-status", "id": "7NMYjJb8"}
2018-05-25 05:25:34: {"return": {"status": "inmigrate", "singlestep": false, "running": false}, "id": "7NMYjJb8"}
2018-05-25 05:28:15: {"execute": "query-status", "id": "6qNhjvpH"}
2018-05-25 05:28:15: {"return": {"status": "paused", "singlestep": false, "running": false}, "id": "6qNhjvpH"}

At the same time the second monitor:

2018-05-25 05:25:34: {"execute": "query-status", "id": "OTLnoKT0"}
2018-05-25 05:25:34: {"return": {"status": "inmigrate", "singlestep": false, "running": false}, "id": "OTLnoKT0"}
2018-05-25 05:25:40: {"execute": "query-status", "id": "NVtTiYnf"}
2018-05-25 05:26:06: {"return": {"status": "inmigrate", "singlestep": false, "running": false}, "id": "NVtTiYnf"}
2018-05-25 05:26:13: {"execute": "human-monitor-command", "arguments": {"command-line": "info registers"}, "id": "FodBo9Df"}

In the second monitor you can see the long pause between NVtTiYnf execute and return, which causes the failure. The machine actually proceeds and the migration finishes well. This only happens early basically just after the machine is executed.

Comment 2 Qunfang Zhang 2018-05-31 08:41:15 UTC

I haven't reproduced this bug after running the same tp-qemu test case "migrate.with_reboot.tcp" with more than 50 cycles repeat.  Will update here once I got a chance to reproduce it. 

qemu-kvm-ma-2.10.0-21.el7_5.2
kernel-4.14.0-49.2.1.el7a.s390x

Comment 6 Thomas Huth 2018-06-27 08:46:50 UTC

Hi Lukáš,
I've now tried to reproduce this problem on the kerneldev5 LPAR, but so far I also failed. I've ran it 50 times with qemu-kvm 2.12, and all tests succeeded. Then I installed qemu-kvm-ma-2.10.0-21.el7_5.2.s390x and kernel 4.14.0-49.9.1.el7a.s390x and ran it again for 50 times, and avocado reported "PASS" again for all tests.
Is there something special with your LPAR maybe? I wonder whether it has shared CPUs with other LPARs and you've just hit a wrong point in time when the other LPARs were very loaded? Can you still reproduce the problem (maybe also try with qemu v2.12 from RHEL7.6)?

Comment 7 Qunfang Zhang 2018-06-27 08:52:11 UTC

(In reply to Thomas Huth from comment #6)
> Hi Lukáš,
> I've now tried to reproduce this problem on the kerneldev5 LPAR, but so far
> I also failed. I've ran it 50 times with qemu-kvm 2.12, and all tests
> succeeded. Then I installed qemu-kvm-ma-2.10.0-21.el7_5.2.s390x and kernel
> 4.14.0-49.9.1.el7a.s390x and ran it again for 50 times, and avocado reported
> "PASS" again for all tests.
> Is there something special with your LPAR maybe? I wonder whether it has
> shared CPUs with other LPARs and you've just hit a wrong point in time when
> the other LPARs were very loaded? Can you still reproduce the problem (maybe
> also try with qemu v2.12 from RHEL7.6)?

Hi, Thomas

At the Monday night this week, I also borrowed Lukáš's LPAR (kerneldev8), and run at least 30 times, all PASS... Lukáš got my test results and logs, not sure what is the difference between my testing and his.

Comment 8 Lukáš Doktor 2018-07-12 18:02:58 UTC

I haven't got this issue for quite a while, but I noticed the system memory was at it's limits so it is possible that the process got swapped, which could have resulted in long timeouts.

I'll keep an eye on this and let you know on tomorrows meeting whether we could close this one or not.

Comment 9 Lukáš Doktor 2018-07-13 10:38:15 UTC

OK, it looks like the problem was really low memory. I'm sorry for bothering you.