1549079 – Migrate during reboot occasionally hangs in firmware with virtio-scsi:setup:inquiry...

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1549079 - Migrate during reboot occasionally hangs in firmware with virtio-scsi:setup:inquiry...

Summary: Migrate during reboot occasionally hangs in firmware with virtio-scsi:setup:i...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-ma
Sub Component:
Version:	7.5-Alt
Hardware:	s390x
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Thomas Huth
QA Contact:	xianwang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-02-26 11:04 UTC by Lukáš Doktor
Modified:	2018-10-30 08:06 UTC (History)
CC List:	12 users (show)
Fixed In Version:	qemu-kvm-ma-2.12.0-1.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-30 08:03:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Avocado-vt results of the migrate.with_reboot.tcp (29.96 KB, application/zip) 2018-02-26 11:04 UTC, Lukáš Doktor	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1571149	1	None	None	None	2021-01-20 06:05:38 UTC
Red Hat Product Errata	RHSA-2018:3062	0	None	None	None	2018-10-30 08:06:00 UTC

Internal Links: 1571149

Description Lukáš Doktor 2018-02-26 11:04:52 UTC

Created attachment 1400813 [details]
Avocado-vt results of the migrate.with_reboot.tcp

Description of problem:
I'm running `migrate.with_reboot.tcp` Avocado-vt test and in ~50% of cases the incoming migration machine hangs while in firmware.

Version-Release number of selected component (if applicable):
* qemu-kvm-ma-2.10.0-20.el7
* kernel-4.14.0-28.el7a.s390x
* but I'm getting those failures for about 1 month always with the latest qemu-kvm-ma
* guest is RHEL.7.4 GA or latest nightly

How reproducible:
* ~50% of cases on RHEL.7.4
* ~10% of cases on nightly RHEL.7.5

Steps to Reproduce:
1. Install Avocado-vt
2. Run "avocado run unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads migrate.with_reboot.tcp --vt-guest-os RHEL.7.4 --vt-arch s390x --vt-machine-type s390-virtio" (which installs RHEL.7.4 guest and then runs migration)

Steps to Reproduce without Avocado:
1. run RHEL.7.4 guest via qemu
2. in guest execute "shutdown -r now"
3. run incoming qemu "-incomming tcp:0:5200"
4. and in first qemu monitor start migration "{'execute': 'migrate', 'arguments': {'uri': 'tcp:localhost:5200', 'blk': False, 'inc': False}, 'id': 'm1LCbCYP'}"

Actual results:
In qemu serial output:
2018-02-20 12:55:09: [   14.001986] systemd-shutdown[1]: Not all DM devices detached, 1 left.
2018-02-20 12:55:09: [   14.001990] systemd-shutdown[1]: Cannot finalize remaining DM devices, continuing.
2018-02-20 12:55:09: [   14.024997] systemd-shutdown[1]: Successfully changed into root pivot.
2018-02-20 12:55:09: [   14.025001] systemd-shutdown[1]: Returning to initrd...
2018-02-20 12:55:09: [   14.056838] dracut Warning: Killing all remaining processes
2018-02-20 12:55:09: dracut Warning: Killing all remaining processes
2018-02-20 12:55:09: [   14.132683] XFS (dm-0): Unmounting Filesystem
2018-02-20 12:55:09: [   14.143876] dracut Warning: Unmounted /oldroot.
2018-02-20 12:55:09: [   14.159976] dracut: Disassembling device-mapper devices
2018-02-20 12:55:09: LOADPARM=[........]
2018-02-20 12:55:09: Using virtio-scsi.
2018-02-20 12:55:18: 
2018-02-20 12:55:18: ! virtio-scsi:setup:inquiry: response VS RESP=ff !
2018-02-20 12:56:44: (Process terminated with status 0)
2018-02-20 13:01:15: Ncat: Broken pipe.
2018-02-20 13:01:16: (Process terminated with status 1)

Expected results:
The machine should reboot into system.

Additional info:

Comment 3 Thomas Huth 2018-04-04 13:06:08 UTC

The fix that increases the timeout from 3 to 30 seconds has now been merged upstream and will be part of QEMU 2.12:
https://git.qemu.org/?p=qemu.git;a=commitdiff;h=23bf419c1c48e71c08737045f7d9c9b5e05ac4a7

Comment 5 xianwang 2018-08-08 10:05:37 UTC

I have tried 100 times for this case and there are 3 times failed, the error log is not same with bug report, but the result is same that guest hang.
uname:
4.14.0-27.el7a.s390x
qemu-kvm-ma-2.10.0-20.el7.s390x

# python ConfigTest.py --testcase=migrate.with_reboot.tcp --guestname=RHEL.7.5 --platform=s390x --vcpu=2 --nrepeat=100

RESULTS    : PASS 98 | ERROR 3 | FAIL 1 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB TIME   : 7092.43 s
JOB HTML   : /root/avocado/job-results/job-2018-08-08T01.23-96a50d1/results.html

log:
02:39:21 ERROR| Reproduced traceback from: /home/qzhang/kar/workspace/avocado-vt/avocado_vt/test.py:434
02:39:21 ERROR| Traceback (most recent call last):
02:39:21 ERROR|   File "/var/lib/avocado/data/avocado-vt/test-providers.d/downloads/io-github-autotest-qemu/qemu/tests/migration_with_reboot.py", line 56, in run
02:39:21 ERROR|     session = bg.join()
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 136, in join
02:39:21 ERROR|     six.reraise(*self._e)
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 110, in run
02:39:21 ERROR|     self._retval = self._target(*self._args, **self._kwargs)
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/error_context.py", line 135, in new_fn
02:39:21 ERROR|     return fn(*args, **kwargs)
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_vm.py", line 4156, in reboot
02:39:21 ERROR|     if not utils_misc.wait_for(_check_go_down, timeout=timeout):
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 685, in wait_for
02:39:21 ERROR|     output = func()
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1835, in get_event
02:39:21 ERROR|     for e in self.get_events():
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1823, in get_events
02:39:21 ERROR|     self._read_objects()
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1511, in _read_objects
02:39:21 ERROR|     s += self._recvall()
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 306, in _recvall
02:39:21 ERROR|     e)
02:39:21 ERROR| MonitorSocketError: Could not receive data from monitor    ([Errno 104] Connection reset by peer)

Comment 6 xianwang 2018-08-08 10:14:30 UTC

(In reply to xianwang from comment #5)
> I have tried 100 times for this case and there are 3 times failed, the error
> log is not same with bug report, but the result is same that guest hang.
> uname:
> 4.14.0-27.el7a.s390x
> qemu-kvm-ma-2.10.0-20.el7.s390x
> 
> # python ConfigTest.py --testcase=migrate.with_reboot.tcp
> --guestname=RHEL.7.5 --platform=s390x --vcpu=2 --nrepeat=100
> 
> RESULTS    : PASS 98 | ERROR 3 | FAIL 1 | SKIP 0 | WARN 0 | INTERRUPT 0 |
> CANCEL 0
> JOB TIME   : 7092.43 s
> JOB HTML   :
> /root/avocado/job-results/job-2018-08-08T01.23-96a50d1/results.html
> 
> log:
> 02:39:21 ERROR| Reproduced traceback from:
> /home/qzhang/kar/workspace/avocado-vt/avocado_vt/test.py:434
> 02:39:21 ERROR| Traceback (most recent call last):
> 02:39:21 ERROR|   File
> "/var/lib/avocado/data/avocado-vt/test-providers.d/downloads/io-github-
> autotest-qemu/qemu/tests/migration_with_reboot.py", line 56, in run
> 02:39:21 ERROR|     session = bg.join()
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 136, in
> join
> 02:39:21 ERROR|     six.reraise(*self._e)
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 110, in
> run
> 02:39:21 ERROR|     self._retval = self._target(*self._args, **self._kwargs)
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/error_context.py", line 135,
> in new_fn
> 02:39:21 ERROR|     return fn(*args, **kwargs)
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_vm.py", line 4156, in
> reboot
> 02:39:21 ERROR|     if not utils_misc.wait_for(_check_go_down,
> timeout=timeout):
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 685, in
> wait_for
> 02:39:21 ERROR|     output = func()
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1835,
> in get_event
> 02:39:21 ERROR|     for e in self.get_events():
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1823,
> in get_events
> 02:39:21 ERROR|     self._read_objects()
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1511,
> in _read_objects
> 02:39:21 ERROR|     s += self._recvall()
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 306,
> in _recvall
> 02:39:21 ERROR|     e)
> 02:39:21 ERROR| MonitorSocketError: Could not receive data from monitor   
> ([Errno 104] Connection reset by peer)


I have checked serial log again, guest does not hang at firmware, so, it is not same with comment0, I could not reproduce this bug, serial log is as following:
2018-08-08 02:38:11: [   14.300847] dracut: Waiting for mdraid devices to be clean.
2018-08-08 02:38:11: [   14.302507] dracut: Disassembling mdraid devices.
2018-08-08 02:38:11: Rebooting.
2018-08-08 02:38:11: LOADPARM=[........]
2018-08-08 02:38:11: Using virtio-scsi.
2018-08-08 02:38:12: Using SCSI scheme.
2018-08-08 02:38:12: ....
2018-08-08 02:38:12: .
2018-08-08 02:38:12:
2018-08-08 02:38:12: [    0.094471] Initializing cgroup subsys cpuset
2018-08-08 02:38:12: [    0.094475] Initializing cgroup subsys cpu

Comment 7 xianwang 2018-08-09 03:00:14 UTC

I tried 200 times with this case on newest package, pass 194, error 6, and the error cases are same with comment6, so, this bug is not hit on newest version, I will verify this bug now, and if we hit it later we could reopen it.

Host:
4.14.0-101.el7a.s390x
qemu-kvm-rhev-2.12.0-9.el7.s390x

# python ConfigTest.py --testcase=migrate.with_reboot.tcp --guestname=RHEL.7.5 --platform=s390x --vcpu=2 --nrepeat=200

Comment 8 Thomas Huth 2018-08-09 05:48:19 UTC

ok, thanks for the verification! Are you going to open a new bug for the problem discovered in comment 6 ?

Comment 9 xianwang 2018-08-09 07:16:20 UTC

(In reply to Thomas Huth from comment #8)
> ok, thanks for the verification! Are you going to open a new bug for the
> problem discovered in comment 6 ?

I think this error is an auto script issue after my analysis, the serial log is ok without error, what's more, I have tried this scenario for several times manually, it works well, so I will not open new bug for it.

Comment 10 Thomas Huth 2018-08-09 08:44:53 UTC

(In reply to xianwang from comment #9)
> I think this error is an auto script issue after my analysis

Ok, all right, I was not sure after reading the comment, so if this is only a script issue, then it does not make sense to open a new BZ for this, indeed.

Comment 12 errata-xmlrpc 2018-10-30 08:03:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3062

Note You need to log in before you can comment on or make changes to this bug.