Bug 1664032 - Failure to resume VMs from paused: TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainResume)
Summary: Failure to resume VMs from paused: TimeoutError: Timed out during operation: ...
Keywords:
Status: CLOSED DUPLICATE of bug 1635581
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.20.31
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Dan Kenigsberg
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-07 14:50 UTC by NIkita Katrich
Modified: 2019-01-08 11:31 UTC (History)
3 users (show)

Fixed In Version:
Clone Of: 1635581
Environment:
Last Closed: 2019-01-08 11:31:36 UTC
oVirt Team: Virt
Embargoed:


Attachments (Terms of Use)

Description NIkita Katrich 2019-01-07 14:50:41 UTC
+++ This bug was initially created as a clone of Bug #1635581 +++

Description of problem:
After an issue with the storage which caused VMs to enter paused state, while the domain was re-activated, resume to these VMs fails on vdsm


Version-Release number of selected component (if applicable):

[root@host02 ~]#  cat /etc/redhat-release 
CentOS Linux release 7.6.1810 (Core) 
[root@host02 ~]#  rpm -qa |grep libvirt
libvirt-daemon-driver-secret-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-qemu-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-scsi-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-nodedev-4.5.0-10.el7_6.3.x86_64
libvirt-python-4.5.0-1.el7.x86_64
libvirt-daemon-driver-network-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-rbd-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-mpath-4.5.0-10.el7_6.3.x86_64
libvirt-lock-sanlock-4.5.0-10.el7_6.3.x86_64
libvirt-bash-completion-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-nwfilter-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-config-network-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-core-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-iscsi-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.3.x86_64
libvirt-client-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-lxc-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-logical-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-kvm-4.5.0-10.el7_6.3.x86_64
libvirt-libs-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-interface-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-disk-4.5.0-10.el7_6.3.x86_64
libvirt-4.5.0-10.el7_6.3.x86_64
[root@host02 ~]#  rpm -qa |grep ovirt
ovirt-imageio-common-1.4.5-0.el7.x86_64
ovirt-hosted-engine-setup-2.2.32-1.el7.noarch
cockpit-ovirt-dashboard-0.11.37-1.el7.noarch
ovirt-setup-lib-1.1.5-1.el7.noarch
ovirt-release41-4.1.8-1.el7.centos.noarch
ovirt-imageio-daemon-1.4.5-0.el7.noarch
ovirt-provider-ovn-driver-1.2.16-1.el7.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7.noarch
python-ovirt-engine-sdk4-4.2.9-2.el7.x86_64
ovirt-host-deploy-1.7.4-1.el7.noarch
ovirt-host-dependencies-4.2.3-1.el7.x86_64
ovirt-host-4.2.3-1.el7.x86_64
ovirt-vmconsole-host-1.0.5-4.el7.centos.noarch
ovirt-release42-4.2.7.1-1.el7.noarch
ovirt-hosted-engine-ha-2.2.18-1.el7.noarch
ovirt-vmconsole-1.0.5-4.el7.centos.noarch
cockpit-machines-ovirt-176-4.el7.centos.noarch
[root@host02 ~]#  rpm -qa |grep qemu
libvirt-daemon-driver-qemu-4.5.0-10.el7_6.3.x86_64
qemu-kvm-common-ev-2.12.0-18.el7_6.1.1.x86_64
qemu-kvm-tools-ev-2.12.0-18.el7_6.1.1.x86_64
qemu-img-ev-2.12.0-18.el7_6.1.1.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
qemu-kvm-ev-2.12.0-18.el7_6.1.1.x86_64
vdsm-hook-qemucmdline-4.20.43-1.el7.noarch
[root@host02 ~]#  rpm -qa |grep sanlock
sanlock-lib-3.6.0-1.el7.x86_64
sanlock-python-3.6.0-1.el7.x86_64
libvirt-lock-sanlock-4.5.0-10.el7_6.3.x86_64
sanlock-3.6.0-1.el7.x86_64
[root@host02 ~]#  rpm -qa |grep selinux
selinux-policy-targeted-3.13.1-229.el7_6.6.noarch
libselinux-python-2.5-14.1.el7.x86_64
libselinux-2.5-14.1.el7.i686
selinux-policy-3.13.1-229.el7_6.6.noarch
libselinux-utils-2.5-14.1.el7.x86_64
libselinux-2.5-14.1.el7.x86_64
[root@host02 ~]#  rpm -qa |grep vdsm
vdsm-jsonrpc-4.20.43-1.el7.noarch
vdsm-hook-vhostmd-4.20.43-1.el7.noarch
vdsm-yajsonrpc-4.20.43-1.el7.noarch
vdsm-api-4.20.43-1.el7.noarch
vdsm-client-4.20.43-1.el7.noarch
vdsm-common-4.20.43-1.el7.noarch
vdsm-python-4.20.43-1.el7.noarch
vdsm-4.20.43-1.el7.x86_64
vdsm-hook-fcoe-4.20.43-1.el7.noarch
vdsm-hook-macspoof-4.20.43-1.el7.noarch
vdsm-network-4.20.43-1.el7.x86_64
vdsm-hook-vmfex-dev-4.20.43-1.el7.noarch
vdsm-hook-ethtool-options-4.20.43-1.el7.noarch
vdsm-hook-openstacknet-4.20.43-1.el7.noarch
vdsm-http-4.20.43-1.el7.noarch
vdsm-hook-qemucmdline-4.20.43-1.el7.noarch
[root@host02 ~]#  uname -a
Linux host02 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:
In this specific situation, cannot resume any VM


Steps to Reproduce:
1. While there are VMs running with disk reside on domain A, disconnect domain A from the hosts.
2. Wait for the VMs to enter paused state and re-connect the hosts to the storage


VM 'ae51780f-c1bf-48c3-b672-7018612c00f6' moved from 'Up' --> 'Paused'

(vmId='ae51780f-c1bf-48c3-b672-7018612c00f6') abnormal vm stop device ua-e3a12af7-bc3b-4c8a-9d74-424dba4933d7 error eio
(vmId='ae51780f-c1bf-48c3-b672-7018612c00f6') CPU stopped: onIOError


Actual results:
VMs should be resumed automatically, but that doesn't succeed:

VM 'ae51780f-c1bf-48c3-b672-7018612c00f6'(test261218_1) moved from 'Paused' --> 'Up'
EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM test261218_1 has recovered from paused back to up.

It enters not responding:

VM 'ae51780f-c1bf-48c3-b672-7018612c00f6'(test261218_1) moved from 'Up' --> 'NotResponding'

In vdsm.log:

(vmId='ae51780f-c1bf-48c3-b672-7018612c00f6') monitor became unresponsive (command timeout, age=988.350000001)

could not run <function <lambda> at 0x7f69900d1a28> on ['ae51780f-c1bf-48c3-b672-7018612c00f6']

Worker blocked: <Worker name=jsonrpc/1 running <Task <JsonRpcTask {'params': {u'vmID': u'ae51780f-c1bf-48c3-b672-7018612c00f6'}, 'jsonrpc': '2.0', 'method': u'VM.cont', 'id': u'4f33fa5a-7ea4-4f6a-88ac-e1811432ed14'} at 0x7f6970027950> timeout=60, duration=900 at 0x7f6970027b50> task#=4 at 0x7f695876c9d0>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__ self._handler(self._ctx, self._req) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest response = self._handle_request(req, ctx) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request res = method(**params) File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod result = fn(*methodArgs) File: "<string>", line 2, in cont File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File: "<string>", line 2, in cont File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 167, in cont return self.vm.cont() File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1530, in cont self._underlyingCont() File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4272, in _underlyingCont self._dom.resume() File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 1976, in resume ret = libvirtmod.virDomainResume(self._o)

Comment 1 Ryan Barry 2019-01-07 14:56:34 UTC
Why was this cloned?

Comment 2 NIkita Katrich 2019-01-08 10:19:42 UTC
Cause the original report was closed, and the problem is still exist in our installation.

Comment 3 Ryan Barry 2019-01-08 11:30:57 UTC
Ok -- I closed the original because it was opened internally with no known external occurrences, and there wasn't sufficient information to reproduce it or start on a patch, and our QE had not seen it in months.

I'll re-open that one and close this one.

Can you add any logs you have or steps to reproduce to the original bug?

Comment 4 Ryan Barry 2019-01-08 11:31:36 UTC

*** This bug has been marked as a duplicate of bug 1635581 ***


Note You need to log in before you can comment on or make changes to this bug.