Bug 1549079

Summary: Migrate during reboot occasionally hangs in firmware with virtio-scsi:setup:inquiry...
Product: Red Hat Enterprise Linux 7 Reporter: Lukáš Doktor <ldoktor>
Component: qemu-kvm-maAssignee: Thomas Huth <thuth>
Status: CLOSED ERRATA QA Contact: xianwang <xianwang>
Severity: high Docs Contact:
Priority: medium    
Version: 7.5-AltCC: chayang, cohuck, coli, dhildenb, hhuang, juzhang, michen, mrezanin, qzhang, thuth, xianwang, yhong
Target Milestone: rc   
Target Release: ---   
Hardware: s390x   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-ma-2.12.0-1.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-30 08:03:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Avocado-vt results of the migrate.with_reboot.tcp none

Description Lukáš Doktor 2018-02-26 11:04:52 UTC
Created attachment 1400813 [details]
Avocado-vt results of the migrate.with_reboot.tcp

Description of problem:
I'm running `migrate.with_reboot.tcp` Avocado-vt test and in ~50% of cases the incoming migration machine hangs while in firmware.

Version-Release number of selected component (if applicable):
* qemu-kvm-ma-2.10.0-20.el7
* kernel-4.14.0-28.el7a.s390x
* but I'm getting those failures for about 1 month always with the latest qemu-kvm-ma
* guest is RHEL.7.4 GA or latest nightly

How reproducible:
* ~50% of cases on RHEL.7.4
* ~10% of cases on nightly RHEL.7.5

Steps to Reproduce:
1. Install Avocado-vt
2. Run "avocado run unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads migrate.with_reboot.tcp --vt-guest-os RHEL.7.4 --vt-arch s390x --vt-machine-type s390-virtio" (which installs RHEL.7.4 guest and then runs migration)

Steps to Reproduce without Avocado:
1. run RHEL.7.4 guest via qemu
2. in guest execute "shutdown -r now"
3. run incoming qemu "-incomming tcp:0:5200"
4. and in first qemu monitor start migration "{'execute': 'migrate', 'arguments': {'uri': 'tcp:localhost:5200', 'blk': False, 'inc': False}, 'id': 'm1LCbCYP'}"

Actual results:
In qemu serial output:
2018-02-20 12:55:09: [   14.001986] systemd-shutdown[1]: Not all DM devices detached, 1 left.
2018-02-20 12:55:09: [   14.001990] systemd-shutdown[1]: Cannot finalize remaining DM devices, continuing.
2018-02-20 12:55:09: [   14.024997] systemd-shutdown[1]: Successfully changed into root pivot.
2018-02-20 12:55:09: [   14.025001] systemd-shutdown[1]: Returning to initrd...
2018-02-20 12:55:09: [   14.056838] dracut Warning: Killing all remaining processes
2018-02-20 12:55:09: dracut Warning: Killing all remaining processes
2018-02-20 12:55:09: [   14.132683] XFS (dm-0): Unmounting Filesystem
2018-02-20 12:55:09: [   14.143876] dracut Warning: Unmounted /oldroot.
2018-02-20 12:55:09: [   14.159976] dracut: Disassembling device-mapper devices
2018-02-20 12:55:09: LOADPARM=[........]
2018-02-20 12:55:09: Using virtio-scsi.
2018-02-20 12:55:18: 
2018-02-20 12:55:18: ! virtio-scsi:setup:inquiry: response VS RESP=ff !
2018-02-20 12:56:44: (Process terminated with status 0)
2018-02-20 13:01:15: Ncat: Broken pipe.
2018-02-20 13:01:16: (Process terminated with status 1)

Expected results:
The machine should reboot into system.

Additional info:

Comment 3 Thomas Huth 2018-04-04 13:06:08 UTC
The fix that increases the timeout from 3 to 30 seconds has now been merged upstream and will be part of QEMU 2.12:
https://git.qemu.org/?p=qemu.git;a=commitdiff;h=23bf419c1c48e71c08737045f7d9c9b5e05ac4a7

Comment 5 xianwang 2018-08-08 10:05:37 UTC
I have tried 100 times for this case and there are 3 times failed, the error log is not same with bug report, but the result is same that guest hang.
uname:
4.14.0-27.el7a.s390x
qemu-kvm-ma-2.10.0-20.el7.s390x

# python ConfigTest.py --testcase=migrate.with_reboot.tcp --guestname=RHEL.7.5 --platform=s390x --vcpu=2 --nrepeat=100

RESULTS    : PASS 98 | ERROR 3 | FAIL 1 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB TIME   : 7092.43 s
JOB HTML   : /root/avocado/job-results/job-2018-08-08T01.23-96a50d1/results.html

log:
02:39:21 ERROR| Reproduced traceback from: /home/qzhang/kar/workspace/avocado-vt/avocado_vt/test.py:434
02:39:21 ERROR| Traceback (most recent call last):
02:39:21 ERROR|   File "/var/lib/avocado/data/avocado-vt/test-providers.d/downloads/io-github-autotest-qemu/qemu/tests/migration_with_reboot.py", line 56, in run
02:39:21 ERROR|     session = bg.join()
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 136, in join
02:39:21 ERROR|     six.reraise(*self._e)
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 110, in run
02:39:21 ERROR|     self._retval = self._target(*self._args, **self._kwargs)
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/error_context.py", line 135, in new_fn
02:39:21 ERROR|     return fn(*args, **kwargs)
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_vm.py", line 4156, in reboot
02:39:21 ERROR|     if not utils_misc.wait_for(_check_go_down, timeout=timeout):
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 685, in wait_for
02:39:21 ERROR|     output = func()
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1835, in get_event
02:39:21 ERROR|     for e in self.get_events():
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1823, in get_events
02:39:21 ERROR|     self._read_objects()
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1511, in _read_objects
02:39:21 ERROR|     s += self._recvall()
02:39:21 ERROR|   File "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 306, in _recvall
02:39:21 ERROR|     e)
02:39:21 ERROR| MonitorSocketError: Could not receive data from monitor    ([Errno 104] Connection reset by peer)

Comment 6 xianwang 2018-08-08 10:14:30 UTC
(In reply to xianwang from comment #5)
> I have tried 100 times for this case and there are 3 times failed, the error
> log is not same with bug report, but the result is same that guest hang.
> uname:
> 4.14.0-27.el7a.s390x
> qemu-kvm-ma-2.10.0-20.el7.s390x
> 
> # python ConfigTest.py --testcase=migrate.with_reboot.tcp
> --guestname=RHEL.7.5 --platform=s390x --vcpu=2 --nrepeat=100
> 
> RESULTS    : PASS 98 | ERROR 3 | FAIL 1 | SKIP 0 | WARN 0 | INTERRUPT 0 |
> CANCEL 0
> JOB TIME   : 7092.43 s
> JOB HTML   :
> /root/avocado/job-results/job-2018-08-08T01.23-96a50d1/results.html
> 
> log:
> 02:39:21 ERROR| Reproduced traceback from:
> /home/qzhang/kar/workspace/avocado-vt/avocado_vt/test.py:434
> 02:39:21 ERROR| Traceback (most recent call last):
> 02:39:21 ERROR|   File
> "/var/lib/avocado/data/avocado-vt/test-providers.d/downloads/io-github-
> autotest-qemu/qemu/tests/migration_with_reboot.py", line 56, in run
> 02:39:21 ERROR|     session = bg.join()
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 136, in
> join
> 02:39:21 ERROR|     six.reraise(*self._e)
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 110, in
> run
> 02:39:21 ERROR|     self._retval = self._target(*self._args, **self._kwargs)
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/error_context.py", line 135,
> in new_fn
> 02:39:21 ERROR|     return fn(*args, **kwargs)
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_vm.py", line 4156, in
> reboot
> 02:39:21 ERROR|     if not utils_misc.wait_for(_check_go_down,
> timeout=timeout):
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 685, in
> wait_for
> 02:39:21 ERROR|     output = func()
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1835,
> in get_event
> 02:39:21 ERROR|     for e in self.get_events():
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1823,
> in get_events
> 02:39:21 ERROR|     self._read_objects()
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1511,
> in _read_objects
> 02:39:21 ERROR|     s += self._recvall()
> 02:39:21 ERROR|   File
> "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 306,
> in _recvall
> 02:39:21 ERROR|     e)
> 02:39:21 ERROR| MonitorSocketError: Could not receive data from monitor   
> ([Errno 104] Connection reset by peer)


I have checked serial log again, guest does not hang at firmware, so, it is not same with comment0, I could not reproduce this bug, serial log is as following:
2018-08-08 02:38:11: [   14.300847] dracut: Waiting for mdraid devices to be clean.
2018-08-08 02:38:11: [   14.302507] dracut: Disassembling mdraid devices.
2018-08-08 02:38:11: Rebooting.
2018-08-08 02:38:11: LOADPARM=[........]
2018-08-08 02:38:11: Using virtio-scsi.
2018-08-08 02:38:12: Using SCSI scheme.
2018-08-08 02:38:12: ....
2018-08-08 02:38:12: .
2018-08-08 02:38:12:
2018-08-08 02:38:12: [    0.094471] Initializing cgroup subsys cpuset
2018-08-08 02:38:12: [    0.094475] Initializing cgroup subsys cpu

Comment 7 xianwang 2018-08-09 03:00:14 UTC
I tried 200 times with this case on newest package, pass 194, error 6, and the error cases are same with comment6, so, this bug is not hit on newest version, I will verify this bug now, and if we hit it later we could reopen it.

Host:
4.14.0-101.el7a.s390x
qemu-kvm-rhev-2.12.0-9.el7.s390x

# python ConfigTest.py --testcase=migrate.with_reboot.tcp --guestname=RHEL.7.5 --platform=s390x --vcpu=2 --nrepeat=200

Comment 8 Thomas Huth 2018-08-09 05:48:19 UTC
ok, thanks for the verification! Are you going to open a new bug for the problem discovered in comment 6 ?

Comment 9 xianwang 2018-08-09 07:16:20 UTC
(In reply to Thomas Huth from comment #8)
> ok, thanks for the verification! Are you going to open a new bug for the
> problem discovered in comment 6 ?

I think this error is an auto script issue after my analysis, the serial log is ok without error, what's more, I have tried this scenario for several times manually, it works well, so I will not open new bug for it.

Comment 10 Thomas Huth 2018-08-09 08:44:53 UTC
(In reply to xianwang from comment #9)
> I think this error is an auto script issue after my analysis

Ok, all right, I was not sure after reading the comment, so if this is only a script issue, then it does not make sense to open a new BZ for this, indeed.

Comment 12 errata-xmlrpc 2018-10-30 08:03:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3062