Description of problem: After an issue with the storage which caused VMs to enter paused state, while the domain was re-activated, resume to these VMs fails on vdsm Version-Release number of selected component (if applicable): [root@puma55 libvirt]# rpm -qa |grep vdsm vdsm-hook-openstacknet-4.20.39-1.el7ev.noarch vdsm-python-4.20.39-1.el7ev.noarch vdsm-4.20.39-1.el7ev.x86_64 vdsm-hook-ethtool-options-4.20.39-1.el7ev.noarch vdsm-hook-vhostmd-4.20.39-1.el7ev.noarch vdsm-jsonrpc-4.20.39-1.el7ev.noarch vdsm-hook-fcoe-4.20.39-1.el7ev.noarch vdsm-yajsonrpc-4.20.39-1.el7ev.noarch vdsm-network-4.20.39-1.el7ev.x86_64 vdsm-hook-vmfex-dev-4.20.39-1.el7ev.noarch vdsm-api-4.20.39-1.el7ev.noarch vdsm-common-4.20.39-1.el7ev.noarch vdsm-client-4.20.39-1.el7ev.noarch vdsm-http-4.20.39-1.el7ev.noarch [root@puma55 libvirt]# rpm -qa |grep ovirt ovirt-host-dependencies-4.2.3-1.el7ev.x86_64 ovirt-host-deploy-1.7.4-1.el7ev.noarch cockpit-machines-ovirt-172-2.el7.noarch ovirt-hosted-engine-ha-2.2.16-1.el7ev.noarch ovirt-provider-ovn-driver-1.2.14-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.26-1.el7ev.noarch ovirt-vmconsole-host-1.0.5-4.el7ev.noarch ovirt-imageio-daemon-1.4.4-0.el7ev.noarch ovirt-setup-lib-1.1.5-1.el7ev.noarch ovirt-host-4.2.3-1.el7ev.x86_64 python-ovirt-engine-sdk4-4.2.8-1.el7ev.x86_64 ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch cockpit-ovirt-dashboard-0.11.33-1.el7ev.noarch ovirt-imageio-common-1.4.4-0.el7ev.x86_64 ovirt-vmconsole-1.0.5-4.el7ev.noarch [root@puma55 libvirt]# rpm -qa |grep libvirt libvirt-daemon-driver-storage-disk-3.9.0-14.el7_5.7.x86_64 libvirt-client-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-driver-storage-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-driver-lxc-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-driver-interface-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-driver-storage-scsi-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-kvm-3.9.0-14.el7_5.7.x86_64 libvirt-3.9.0-14.el7_5.7.x86_64 libvirt-python-3.9.0-1.el7.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-driver-storage-rbd-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-config-nwfilter-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-driver-nodedev-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-driver-storage-core-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-driver-storage-mpath-3.9.0-14.el7_5.7.x86_64 libvirt-libs-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-driver-nwfilter-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-driver-secret-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-driver-storage-logical-3.9.0-14.el7_5.7.x86_64 libvirt-lock-sanlock-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-driver-storage-iscsi-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-driver-network-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-config-network-3.9.0-14.el7_5.7.x86_64 libvirt-daemon-driver-qemu-3.9.0-14.el7_5.7.x86_64 [root@puma55 libvirt]# rpm -qa |grep qemu qemu-kvm-common-rhev-2.10.0-21.el7_5.7.x86_64 qemu-img-rhev-2.10.0-21.el7_5.7.x86_64 qemu-kvm-tools-rhev-2.10.0-21.el7_5.7.x86_64 ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch qemu-kvm-rhev-2.10.0-21.el7_5.7.x86_64 libvirt-daemon-driver-qemu-3.9.0-14.el7_5.7.x86_64 [root@puma55 libvirt]# rpm -qa |grep sanlock sanlock-lib-3.6.0-1.el7.x86_64 sanlock-3.6.0-1.el7.x86_64 libvirt-lock-sanlock-3.9.0-14.el7_5.7.x86_64 sanlock-python-3.6.0-1.el7.x86_64 [root@puma55 libvirt]# rpm -qa |grep selinux libselinux-ruby-2.5-12.el7.x86_64 libselinux-2.5-12.el7.x86_64 libselinux-2.5-12.el7.i686 selinux-policy-3.13.1-192.el7_5.6.noarch libselinux-utils-2.5-12.el7.x86_64 libselinux-python-2.5-12.el7.x86_64 selinux-policy-targeted-3.13.1-192.el7_5.6.noarch openvswitch-selinux-extra-policy-1.0-7.el7fdp.noarch [root@puma55 libvirt]# uname -a Linux puma55.scl.lab.tlv.redhat.com 3.10.0-862.11.6.el7.x86_64 #1 SMP Fri Aug 10 16:55:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux [root@puma55 libvirt]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.5 (Maipo) How reproducible: In this specific situation, cannot resume any VM Steps to Reproduce: 1. While there are VMs running with disk reside on domain A, disconnect domain A from the hosts. 2. Wait for the VMs to enter paused state and re-connect the hosts to the storage VM 2ea5abc3-0922-47b6-baa3-699dce5040cf moves to Paused: 2018-10-02 16:12:03,089+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-12) [] VM '2ea5abc3-0922-47b6-baa3-699dce5040cf'(cfme-5.9-05) moved from 'Unknown' --> 'Paused' Actual results: VMs should be resumed automatically, but that doesn't succeed: 2018-10-02 16:12:09,321+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-2) [] VM '2ea5abc3-0922-47b6-baa3-699dce5040cf'(cfme-5.9-05) moved from 'Paused' --> 'Up' 2018-10-02 16:12:09,337+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-2) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM cfme-5.9-05 has recovered fro m paused back to up. It enters not responding: 2018-10-02 16:12:18,360+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-63) [] VM '2ea5abc3-0922-47b6-baa3-699dce5040cf'(cfme-5.9-05) moved from 'Up' --> 'NotResponding' In vdsm.log: 2018-10-02 16:16:34,060+0300 WARN (jsonrpc/4) [virt.vm] (vmId='2ea5abc3-0922-47b6-baa3-699dce5040cf') monitor became unresponsive (command timeout, age=264.27) (vm:6215) 2018-10-02 16:16:34,061+0300 INFO (jsonrpc/4) [api.host] FINISH getAllVmStats return={'status': {'message': 'Done', 'code': 0}, 'statsList': (suppressed)} from=::ffff:10.35.68.3,55246 (api:52) 2018-10-02 16:16:34,062+0300 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.01 seconds (__init__:573) 2018-10-02 16:16:39,775+0300 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/1 running <Task <Operation action=<function <lambda> at 0x7ff974571de8> at 0x7ff974424ad0> timeout=30.0, duration=270 at 0x7ff97436c590> task#=80 at 0x7ff974581150>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in __call__ self._func() File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 101, in <lambda> lambda: recovery.lookup_external_vms(cif), File: "/usr/lib/python2.7/site-packages/vdsm/virt/recovery.py", line 249, in lookup_external_vms for vm_id in cif.pop_unknown_vm_ids(): File: "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 171, in pop_unknown_vm_ids with self.vmContainerLock: File: "/usr/lib/python2.7/site-packages/pthreading.py", line 60, in __enter__ self.acquire() File: "/usr/lib/python2.7/site-packages/pthreading.py", line 68, in acquire rc = self.lock() if blocking else self.trylock() File: "/usr/lib/python2.7/site-packages/pthread.py", line 96, in lock return _libpthread.pthread_mutex_lock(self._mutex) (executor:363) Expected results: VM should be resumed successfully from paused after EIO Additional info:
Logs: https://drive.google.com/file/d/1APJU33Dyq2gwZHjfbzsJD_u24h5nOdX4/view?usp=sharing
Please specify exactly when the storage was blocked and when exactly it was reconnected. Also, why are you testing on RHEL 7.5?
(In reply to Michal Skrivanek from comment #2) > Please specify exactly when the storage was blocked and when exactly it was > reconnected. > Storage got disconnected here: host1 logs: 2018-10-02 16:04:02,349+0300 WARN (mpathlistener) [storage.mpathhealth] Path u'sdy' failed for multipath device u'20024f4005854005e', no valid paths left (mpathhealth:143) 2018-10-02 16:04:03,315+0300 ERROR (check/loop) [storage.Monitor] Error checking path /dev/24e7091b-e312-404b-93ce-3e49b4113914/metadata (monitor:498) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 496, in _pathChecked delay = result.delay() File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) MiscFileReadException: Internal file read failure: ('/dev/24e7091b-e312-404b-93ce-3e49b4113914/metadata', 1, 'Read timeout') > Also, why are you testing on RHEL 7.5? This is not a test environment but rather a production one we use within the QE department
16:04-16:12 there was no connectivity between puma55 and engine in the meantime vdsm was restarted by someone, apparently not fencing as that seems to be disabled for his host. Still, there's: 2018-10-02 16:04:50,679+0300 INFO (MainThread) [vds] Received signal 15, shutting down (vdsmd:68) some time in between libvirt got into a lock state. Logs from libvirt are missing. Please add libvirt logs or full sosreport of that host
sos report: https://drive.google.com/file/d/1MsqQOykS09mJjYghudOI9gRqWvlexyHK/view?usp=sharing
I'm not able to get this from Google Drive, Elad, Can you host it elsewhere?
This one should work: http://file.tlv.redhat.com/ebenahar/sosreport-puma55.scl.lab.tlv.redhat.com-20181015144234.tar.xz
The libvirt logs included here are all empty, unfortunately, and the libvirt logs start after the event. Is this reproducible? I tried disconnecting storage last week, and did not encounter it. But in vdsm log, there are a huge amount of errors about a full executor queue, which looks like it may a different bug (shouldn't vdsm drop a task from the queue if there's an exception? It seems like yes, but I've never checked)
Haven't seen the issue ever since its original occurrence.
*** Bug 1664032 has been marked as a duplicate of this bug. ***
Re-opened due to an upstream report. Can you please add any logs or steps to reproduce?
Description of problem is the same: A connectivity problem between ovirt host and the iscsi storage caused VMs to enter paused state. After the storage domain was recovered, VMs are locked in pased state. Steps to Reproduce: 1. Start VMs on the host, whose disks are on the iscsi storage, disconnect the storage from the host(interrupt network connection between the host and the storage). 2. After VMs turned to paused state, re-connect the hosts to the storage Actual results: VMs should be resumed automatically, but that doesn't succeed. It enters not responding or remains in paused state. The behaviour differs from time to time. Expected results: VM should be resumed successfully from paused after EIO What kind of additional logs should be collected?
Any updates?
vdsm.log, hopefully
Created attachment 1630039 [details] vdsm log The vdsm.log is attached. The problem is still persist.
Please attach a complete sosreport