1635581 – Failure to resume VMs from paused: TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainResume)

Bug 1635581 - Failure to resume VMs from paused: TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainResume)

Summary: Failure to resume VMs from paused: TimeoutError: Timed out during operation: ...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	4.20.31
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	high with 1 vote
Target Milestone:	ovirt-4.4.0
Target Release:	---
Assignee:	Dan Kenigsberg
QA Contact:	Avihai
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1664032 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-10-03 10:27 UTC by Elad
Modified:	2019-10-29 08:20 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Clones:	1664032 (view as bug list)
Environment:
Last Closed:	2019-10-07 14:40:26 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.3+

Attachments	(Terms of Use)
vdsm log (150.48 KB, application/gzip) 2019-10-29 07:56 UTC, NIkita Katrich	no flags	Details
View All

Description Elad 2018-10-03 10:27:39 UTC

Description of problem:
After an issue with the storage which caused VMs to enter paused state, while the domain was re-activated, resume to these VMs fails on vdsm


Version-Release number of selected component (if applicable):

[root@puma55 libvirt]# rpm -qa |grep vdsm
vdsm-hook-openstacknet-4.20.39-1.el7ev.noarch
vdsm-python-4.20.39-1.el7ev.noarch
vdsm-4.20.39-1.el7ev.x86_64
vdsm-hook-ethtool-options-4.20.39-1.el7ev.noarch
vdsm-hook-vhostmd-4.20.39-1.el7ev.noarch
vdsm-jsonrpc-4.20.39-1.el7ev.noarch
vdsm-hook-fcoe-4.20.39-1.el7ev.noarch
vdsm-yajsonrpc-4.20.39-1.el7ev.noarch
vdsm-network-4.20.39-1.el7ev.x86_64
vdsm-hook-vmfex-dev-4.20.39-1.el7ev.noarch
vdsm-api-4.20.39-1.el7ev.noarch
vdsm-common-4.20.39-1.el7ev.noarch
vdsm-client-4.20.39-1.el7ev.noarch
vdsm-http-4.20.39-1.el7ev.noarch
[root@puma55 libvirt]# rpm -qa |grep ovirt
ovirt-host-dependencies-4.2.3-1.el7ev.x86_64
ovirt-host-deploy-1.7.4-1.el7ev.noarch
cockpit-machines-ovirt-172-2.el7.noarch
ovirt-hosted-engine-ha-2.2.16-1.el7ev.noarch
ovirt-provider-ovn-driver-1.2.14-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.26-1.el7ev.noarch
ovirt-vmconsole-host-1.0.5-4.el7ev.noarch
ovirt-imageio-daemon-1.4.4-0.el7ev.noarch
ovirt-setup-lib-1.1.5-1.el7ev.noarch
ovirt-host-4.2.3-1.el7ev.x86_64
python-ovirt-engine-sdk4-4.2.8-1.el7ev.x86_64
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
cockpit-ovirt-dashboard-0.11.33-1.el7ev.noarch
ovirt-imageio-common-1.4.4-0.el7ev.x86_64
ovirt-vmconsole-1.0.5-4.el7ev.noarch
[root@puma55 libvirt]# rpm -qa |grep libvirt
libvirt-daemon-driver-storage-disk-3.9.0-14.el7_5.7.x86_64
libvirt-client-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-driver-storage-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-driver-lxc-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-driver-interface-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-driver-storage-scsi-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-kvm-3.9.0-14.el7_5.7.x86_64
libvirt-3.9.0-14.el7_5.7.x86_64
libvirt-python-3.9.0-1.el7.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-driver-storage-rbd-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-config-nwfilter-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-driver-nodedev-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-driver-storage-core-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-driver-storage-mpath-3.9.0-14.el7_5.7.x86_64
libvirt-libs-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-driver-nwfilter-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-driver-secret-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-driver-storage-logical-3.9.0-14.el7_5.7.x86_64
libvirt-lock-sanlock-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-driver-storage-iscsi-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-driver-network-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-config-network-3.9.0-14.el7_5.7.x86_64
libvirt-daemon-driver-qemu-3.9.0-14.el7_5.7.x86_64
[root@puma55 libvirt]# rpm -qa |grep qemu
qemu-kvm-common-rhev-2.10.0-21.el7_5.7.x86_64
qemu-img-rhev-2.10.0-21.el7_5.7.x86_64
qemu-kvm-tools-rhev-2.10.0-21.el7_5.7.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
qemu-kvm-rhev-2.10.0-21.el7_5.7.x86_64
libvirt-daemon-driver-qemu-3.9.0-14.el7_5.7.x86_64
[root@puma55 libvirt]# rpm -qa |grep sanlock
sanlock-lib-3.6.0-1.el7.x86_64
sanlock-3.6.0-1.el7.x86_64
libvirt-lock-sanlock-3.9.0-14.el7_5.7.x86_64
sanlock-python-3.6.0-1.el7.x86_64
[root@puma55 libvirt]# rpm -qa |grep selinux
libselinux-ruby-2.5-12.el7.x86_64
libselinux-2.5-12.el7.x86_64
libselinux-2.5-12.el7.i686
selinux-policy-3.13.1-192.el7_5.6.noarch
libselinux-utils-2.5-12.el7.x86_64
libselinux-python-2.5-12.el7.x86_64
selinux-policy-targeted-3.13.1-192.el7_5.6.noarch
openvswitch-selinux-extra-policy-1.0-7.el7fdp.noarch
[root@puma55 libvirt]# uname -a
Linux puma55.scl.lab.tlv.redhat.com 3.10.0-862.11.6.el7.x86_64 #1 SMP Fri Aug 10 16:55:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@puma55 libvirt]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.5 (Maipo)


How reproducible:
In this specific situation, cannot resume any VM


Steps to Reproduce:
1. While there are VMs running with disk reside on domain A, disconnect domain A from the hosts.
2. Wait for the VMs to enter paused state and re-connect the hosts to the storage


VM 2ea5abc3-0922-47b6-baa3-699dce5040cf moves to Paused:


2018-10-02 16:12:03,089+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-12) [] VM '2ea5abc3-0922-47b6-baa3-699dce5040cf'(cfme-5.9-05) moved from 'Unknown' --> 'Paused'




Actual results:
VMs should be resumed automatically, but that doesn't succeed:


2018-10-02 16:12:09,321+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-2) [] VM '2ea5abc3-0922-47b6-baa3-699dce5040cf'(cfme-5.9-05) moved from 'Paused' --> 'Up'
2018-10-02 16:12:09,337+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-2) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM cfme-5.9-05 has recovered fro
m paused back to up.

It enters not responding:

2018-10-02 16:12:18,360+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-63) [] VM '2ea5abc3-0922-47b6-baa3-699dce5040cf'(cfme-5.9-05) moved from 'Up' --> 'NotResponding'




In vdsm.log:



2018-10-02 16:16:34,060+0300 WARN  (jsonrpc/4) [virt.vm] (vmId='2ea5abc3-0922-47b6-baa3-699dce5040cf') monitor became unresponsive (command timeout, age=264.27) (vm:6215)
2018-10-02 16:16:34,061+0300 INFO  (jsonrpc/4) [api.host] FINISH getAllVmStats return={'status': {'message': 'Done', 'code': 0}, 'statsList': (suppressed)} from=::ffff:10.35.68.3,55246 (api:52)
2018-10-02 16:16:34,062+0300 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.01 seconds (__init__:573)
2018-10-02 16:16:39,775+0300 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/1 running <Task <Operation action=<function <lambda> at 0x7ff974571de8> at 0x7ff974424ad0> timeout=30.0, duration=270 at 0x7ff97436c590> task#=80 at 0x7ff974581150>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in __call__
  self._func()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 101, in <lambda>
  lambda: recovery.lookup_external_vms(cif),
File: "/usr/lib/python2.7/site-packages/vdsm/virt/recovery.py", line 249, in lookup_external_vms
  for vm_id in cif.pop_unknown_vm_ids():
File: "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 171, in pop_unknown_vm_ids
  with self.vmContainerLock:
File: "/usr/lib/python2.7/site-packages/pthreading.py", line 60, in __enter__
  self.acquire()
File: "/usr/lib/python2.7/site-packages/pthreading.py", line 68, in acquire
  rc = self.lock() if blocking else self.trylock()
File: "/usr/lib/python2.7/site-packages/pthread.py", line 96, in lock
  return _libpthread.pthread_mutex_lock(self._mutex) (executor:363)


Expected results:
VM should be resumed successfully from paused after EIO


Additional info:

Comment 1 Elad 2018-10-03 11:27:16 UTC

Logs:

https://drive.google.com/file/d/1APJU33Dyq2gwZHjfbzsJD_u24h5nOdX4/view?usp=sharing

Comment 2 Michal Skrivanek 2018-10-04 05:17:36 UTC

Please specify exactly when the storage was blocked and when exactly it was reconnected. 

Also, why are you testing on RHEL 7.5?

Comment 3 Elad 2018-10-04 05:42:31 UTC

(In reply to Michal Skrivanek from comment #2)
> Please specify exactly when the storage was blocked and when exactly it was
> reconnected. 
> 
Storage got disconnected here:

host1 logs:

2018-10-02 16:04:02,349+0300 WARN  (mpathlistener) [storage.mpathhealth] Path u'sdy' failed for multipath device u'20024f4005854005e', no valid paths left (mpathhealth:143)
2018-10-02 16:04:03,315+0300 ERROR (check/loop) [storage.Monitor] Error checking path /dev/24e7091b-e312-404b-93ce-3e49b4113914/metadata (monitor:498)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 496, in _pathChecked
    delay = result.delay()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 391, in delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
MiscFileReadException: Internal file read failure: ('/dev/24e7091b-e312-404b-93ce-3e49b4113914/metadata', 1, 'Read timeout')



> Also, why are you testing on RHEL 7.5?
This is not a test environment but rather a production one we use within the QE department

Comment 4 Michal Skrivanek 2018-10-04 07:24:54 UTC

16:04-16:12 there was no connectivity between puma55 and engine

in the meantime vdsm was restarted by someone, apparently not fencing as that seems to be disabled for his host. Still, there's:
2018-10-02 16:04:50,679+0300 INFO  (MainThread) [vds] Received signal 15, shutting down (vdsmd:68)

some time in between libvirt got into a lock state. Logs from libvirt are missing. Please add libvirt logs or full sosreport of that host

Comment 5 Elad 2018-10-15 12:25:26 UTC

sos report:
https://drive.google.com/file/d/1MsqQOykS09mJjYghudOI9gRqWvlexyHK/view?usp=sharing

Comment 6 Ryan Barry 2018-10-26 12:43:41 UTC

I'm not able to get this from Google Drive, Elad, Can you host it elsewhere?

Comment 7 Elad 2018-10-28 09:04:03 UTC

This one should work:

http://file.tlv.redhat.com/ebenahar/sosreport-puma55.scl.lab.tlv.redhat.com-20181015144234.tar.xz

Comment 8 Ryan Barry 2018-11-19 12:29:46 UTC

The libvirt logs included here are all empty, unfortunately, and the libvirt logs start after the event. Is this reproducible? I tried disconnecting storage last week, and did not encounter it.

But in vdsm log, there are a huge amount of errors about a full executor queue, which looks like it may a different bug (shouldn't vdsm drop a task from the queue if there's an exception? It seems like yes, but I've never checked)

Comment 9 Elad 2018-12-03 16:02:30 UTC

Haven't seen the issue ever since its original occurrence.

Comment 10 Ryan Barry 2019-01-08 11:31:36 UTC

*** Bug 1664032 has been marked as a duplicate of this bug. ***

Comment 11 Ryan Barry 2019-01-08 11:32:30 UTC

Re-opened due to an upstream report. Can you please add any logs or steps to reproduce?

Comment 12 NIkita Katrich 2019-01-08 14:12:40 UTC

Description of problem is the same:
A connectivity problem between ovirt host and the iscsi storage caused VMs to enter paused state. After the storage domain was recovered, VMs are locked in pased state.


Steps to Reproduce:
1. Start VMs on the host, whose disks are on the iscsi storage, disconnect the storage from the host(interrupt network connection between the host and the storage).
2. After VMs turned to paused state, re-connect the hosts to the storage


Actual results:
VMs should be resumed automatically, but that doesn't succeed. It enters not responding or remains in paused state. The behaviour differs from time to time.

Expected results:
VM should be resumed successfully from paused after EIO

What kind of additional logs should be collected?

Comment 13 NIkita Katrich 2019-02-07 12:49:36 UTC

Any updates?

Comment 14 Ryan Barry 2019-02-11 23:51:11 UTC

vdsm.log, hopefully

Comment 15 NIkita Katrich 2019-10-29 07:56:56 UTC

Created attachment 1630039 [details]
vdsm log

The vdsm.log is attached. The problem is still persist.

Comment 16 Ryan Barry 2019-10-29 08:20:26 UTC

Please attach a complete sosreport

Note You need to log in before you can comment on or make changes to this bug.