Bug 1549079
Summary: | Migrate during reboot occasionally hangs in firmware with virtio-scsi:setup:inquiry... | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Lukáš Doktor <ldoktor> | ||||
Component: | qemu-kvm-ma | Assignee: | Thomas Huth <thuth> | ||||
Status: | CLOSED ERRATA | QA Contact: | xianwang <xianwang> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 7.5-Alt | CC: | chayang, cohuck, coli, dhildenb, hhuang, juzhang, michen, mrezanin, qzhang, thuth, xianwang, yhong | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | s390x | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | qemu-kvm-ma-2.12.0-1.el7 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-10-30 08:03:30 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
The fix that increases the timeout from 3 to 30 seconds has now been merged upstream and will be part of QEMU 2.12: https://git.qemu.org/?p=qemu.git;a=commitdiff;h=23bf419c1c48e71c08737045f7d9c9b5e05ac4a7 I have tried 100 times for this case and there are 3 times failed, the error log is not same with bug report, but the result is same that guest hang. uname: 4.14.0-27.el7a.s390x qemu-kvm-ma-2.10.0-20.el7.s390x # python ConfigTest.py --testcase=migrate.with_reboot.tcp --guestname=RHEL.7.5 --platform=s390x --vcpu=2 --nrepeat=100 RESULTS : PASS 98 | ERROR 3 | FAIL 1 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0 JOB TIME : 7092.43 s JOB HTML : /root/avocado/job-results/job-2018-08-08T01.23-96a50d1/results.html log: 02:39:21 ERROR| Reproduced traceback from: /home/qzhang/kar/workspace/avocado-vt/avocado_vt/test.py:434 02:39:21 ERROR| Traceback (most recent call last): 02:39:21 ERROR| File "/var/lib/avocado/data/avocado-vt/test-providers.d/downloads/io-github-autotest-qemu/qemu/tests/migration_with_reboot.py", line 56, in run 02:39:21 ERROR| session = bg.join() 02:39:21 ERROR| File "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 136, in join 02:39:21 ERROR| six.reraise(*self._e) 02:39:21 ERROR| File "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 110, in run 02:39:21 ERROR| self._retval = self._target(*self._args, **self._kwargs) 02:39:21 ERROR| File "/home/qzhang/kar/workspace/avocado-vt/virttest/error_context.py", line 135, in new_fn 02:39:21 ERROR| return fn(*args, **kwargs) 02:39:21 ERROR| File "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_vm.py", line 4156, in reboot 02:39:21 ERROR| if not utils_misc.wait_for(_check_go_down, timeout=timeout): 02:39:21 ERROR| File "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 685, in wait_for 02:39:21 ERROR| output = func() 02:39:21 ERROR| File "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1835, in get_event 02:39:21 ERROR| for e in self.get_events(): 02:39:21 ERROR| File "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1823, in get_events 02:39:21 ERROR| self._read_objects() 02:39:21 ERROR| File "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1511, in _read_objects 02:39:21 ERROR| s += self._recvall() 02:39:21 ERROR| File "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 306, in _recvall 02:39:21 ERROR| e) 02:39:21 ERROR| MonitorSocketError: Could not receive data from monitor ([Errno 104] Connection reset by peer) (In reply to xianwang from comment #5) > I have tried 100 times for this case and there are 3 times failed, the error > log is not same with bug report, but the result is same that guest hang. > uname: > 4.14.0-27.el7a.s390x > qemu-kvm-ma-2.10.0-20.el7.s390x > > # python ConfigTest.py --testcase=migrate.with_reboot.tcp > --guestname=RHEL.7.5 --platform=s390x --vcpu=2 --nrepeat=100 > > RESULTS : PASS 98 | ERROR 3 | FAIL 1 | SKIP 0 | WARN 0 | INTERRUPT 0 | > CANCEL 0 > JOB TIME : 7092.43 s > JOB HTML : > /root/avocado/job-results/job-2018-08-08T01.23-96a50d1/results.html > > log: > 02:39:21 ERROR| Reproduced traceback from: > /home/qzhang/kar/workspace/avocado-vt/avocado_vt/test.py:434 > 02:39:21 ERROR| Traceback (most recent call last): > 02:39:21 ERROR| File > "/var/lib/avocado/data/avocado-vt/test-providers.d/downloads/io-github- > autotest-qemu/qemu/tests/migration_with_reboot.py", line 56, in run > 02:39:21 ERROR| session = bg.join() > 02:39:21 ERROR| File > "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 136, in > join > 02:39:21 ERROR| six.reraise(*self._e) > 02:39:21 ERROR| File > "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 110, in > run > 02:39:21 ERROR| self._retval = self._target(*self._args, **self._kwargs) > 02:39:21 ERROR| File > "/home/qzhang/kar/workspace/avocado-vt/virttest/error_context.py", line 135, > in new_fn > 02:39:21 ERROR| return fn(*args, **kwargs) > 02:39:21 ERROR| File > "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_vm.py", line 4156, in > reboot > 02:39:21 ERROR| if not utils_misc.wait_for(_check_go_down, > timeout=timeout): > 02:39:21 ERROR| File > "/home/qzhang/kar/workspace/avocado-vt/virttest/utils_misc.py", line 685, in > wait_for > 02:39:21 ERROR| output = func() > 02:39:21 ERROR| File > "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1835, > in get_event > 02:39:21 ERROR| for e in self.get_events(): > 02:39:21 ERROR| File > "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1823, > in get_events > 02:39:21 ERROR| self._read_objects() > 02:39:21 ERROR| File > "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 1511, > in _read_objects > 02:39:21 ERROR| s += self._recvall() > 02:39:21 ERROR| File > "/home/qzhang/kar/workspace/avocado-vt/virttest/qemu_monitor.py", line 306, > in _recvall > 02:39:21 ERROR| e) > 02:39:21 ERROR| MonitorSocketError: Could not receive data from monitor > ([Errno 104] Connection reset by peer) I have checked serial log again, guest does not hang at firmware, so, it is not same with comment0, I could not reproduce this bug, serial log is as following: 2018-08-08 02:38:11: [ 14.300847] dracut: Waiting for mdraid devices to be clean. 2018-08-08 02:38:11: [ 14.302507] dracut: Disassembling mdraid devices. 2018-08-08 02:38:11: Rebooting. 2018-08-08 02:38:11: LOADPARM=[........] 2018-08-08 02:38:11: Using virtio-scsi. 2018-08-08 02:38:12: Using SCSI scheme. 2018-08-08 02:38:12: .... 2018-08-08 02:38:12: . 2018-08-08 02:38:12: 2018-08-08 02:38:12: [ 0.094471] Initializing cgroup subsys cpuset 2018-08-08 02:38:12: [ 0.094475] Initializing cgroup subsys cpu I tried 200 times with this case on newest package, pass 194, error 6, and the error cases are same with comment6, so, this bug is not hit on newest version, I will verify this bug now, and if we hit it later we could reopen it. Host: 4.14.0-101.el7a.s390x qemu-kvm-rhev-2.12.0-9.el7.s390x # python ConfigTest.py --testcase=migrate.with_reboot.tcp --guestname=RHEL.7.5 --platform=s390x --vcpu=2 --nrepeat=200 ok, thanks for the verification! Are you going to open a new bug for the problem discovered in comment 6 ? (In reply to Thomas Huth from comment #8) > ok, thanks for the verification! Are you going to open a new bug for the > problem discovered in comment 6 ? I think this error is an auto script issue after my analysis, the serial log is ok without error, what's more, I have tried this scenario for several times manually, it works well, so I will not open new bug for it. (In reply to xianwang from comment #9) > I think this error is an auto script issue after my analysis Ok, all right, I was not sure after reading the comment, so if this is only a script issue, then it does not make sense to open a new BZ for this, indeed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:3062 |
Created attachment 1400813 [details] Avocado-vt results of the migrate.with_reboot.tcp Description of problem: I'm running `migrate.with_reboot.tcp` Avocado-vt test and in ~50% of cases the incoming migration machine hangs while in firmware. Version-Release number of selected component (if applicable): * qemu-kvm-ma-2.10.0-20.el7 * kernel-4.14.0-28.el7a.s390x * but I'm getting those failures for about 1 month always with the latest qemu-kvm-ma * guest is RHEL.7.4 GA or latest nightly How reproducible: * ~50% of cases on RHEL.7.4 * ~10% of cases on nightly RHEL.7.5 Steps to Reproduce: 1. Install Avocado-vt 2. Run "avocado run unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads migrate.with_reboot.tcp --vt-guest-os RHEL.7.4 --vt-arch s390x --vt-machine-type s390-virtio" (which installs RHEL.7.4 guest and then runs migration) Steps to Reproduce without Avocado: 1. run RHEL.7.4 guest via qemu 2. in guest execute "shutdown -r now" 3. run incoming qemu "-incomming tcp:0:5200" 4. and in first qemu monitor start migration "{'execute': 'migrate', 'arguments': {'uri': 'tcp:localhost:5200', 'blk': False, 'inc': False}, 'id': 'm1LCbCYP'}" Actual results: In qemu serial output: 2018-02-20 12:55:09: [ 14.001986] systemd-shutdown[1]: Not all DM devices detached, 1 left. 2018-02-20 12:55:09: [ 14.001990] systemd-shutdown[1]: Cannot finalize remaining DM devices, continuing. 2018-02-20 12:55:09: [ 14.024997] systemd-shutdown[1]: Successfully changed into root pivot. 2018-02-20 12:55:09: [ 14.025001] systemd-shutdown[1]: Returning to initrd... 2018-02-20 12:55:09: [ 14.056838] dracut Warning: Killing all remaining processes 2018-02-20 12:55:09: dracut Warning: Killing all remaining processes 2018-02-20 12:55:09: [ 14.132683] XFS (dm-0): Unmounting Filesystem 2018-02-20 12:55:09: [ 14.143876] dracut Warning: Unmounted /oldroot. 2018-02-20 12:55:09: [ 14.159976] dracut: Disassembling device-mapper devices 2018-02-20 12:55:09: LOADPARM=[........] 2018-02-20 12:55:09: Using virtio-scsi. 2018-02-20 12:55:18: 2018-02-20 12:55:18: ! virtio-scsi:setup:inquiry: response VS RESP=ff ! 2018-02-20 12:56:44: (Process terminated with status 0) 2018-02-20 13:01:15: Ncat: Broken pipe. 2018-02-20 13:01:16: (Process terminated with status 1) Expected results: The machine should reboot into system. Additional info: