Created attachment 1419184 [details] VDSM and Engine logs Description of problem: Got exception while trying to remove VM using automation test. * Similar issue as https://bugzilla.redhat.com/show_bug.cgi?id=1079737 From engine log: 2018-04-05 06:12:01,793+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-63) [] Failed in 'SpmStatusVDS' method 2018-04-05 06:12:01,808+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-63) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_1 command SpmStatusVDS failed: (-202, 'Sanlock resource read failure', 'IO timeout') 2018-04-05 06:12:01,809+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-63) [] Command 'SpmStatusVDSCommand(HostName = host_mixed_1, SpmStatusVDSCommandParameters:{hostId='af0ef1b8-a8eb-4021-9fe0-ea3eaa03489c', storagePoolId='07364ef0-8154-48e6-819b-5f756577cf3a'})' execution failed: VDSGenericException: VDSErrorException: Failed to SpmStatusVDS, error = (-202, 'Sanlock resource read failure', 'IO timeout'), code = 100 From VDSM log: 2018-04-05 06:12:01,578+0300 INFO (jsonrpc/2) [vdsm.api] FINISH getSpmStatus error=(-202, 'Sanlock resource read failure', 'IO timeout') from=::ffff:10.46.16.85,40574, task_id=a03d11df-b36b-4e5d-b442-8a8f784729ce (api:50) 2018-04-05 06:12:01,579+0300 ERROR (jsonrpc/2) [storage.TaskManager.Task] (Task='a03d11df-b36b-4e5d-b442-8a8f784729ce') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in getSpmStatus File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 636, in getSpmStatus status = self._getSpmStatusInfo(pool) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 630, in _getSpmStatusInfo (pool.spmRole,) + pool.getSpmStatus())) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 141, in getSpmStatus return self._backend.getSpmStatus() File "/usr/lib/python2.7/site-packages/vdsm/storage/spbackends.py", line 432, in getSpmStatus lVer, spmId = self.masterDomain.inquireClusterLock() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 855, in inquireClusterLock return self._manifest.inquireDomainLock() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 526, in inquireDomainLock return self._domainLock.inquire(self.getDomainLease()) File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 418, in inquire resource = sanlock.read_resource(lease.path, lease.offset) SanlockException: (-202, 'Sanlock resource read failure', 'IO timeout') 2018-04-05 06:12:01,780+0300 INFO (jsonrpc/2) [storage.TaskManager.Task] (Task='a03d11df-b36b-4e5d-b442-8a8f784729ce') aborting: Task is aborted: u"(-202, 'Sanlock resource read failure', 'IO timeout')" - code 100 (task:1181) 2018-04-05 06:12:01,781+0300 ERROR (jsonrpc/2) [storage.Dispatcher] FINISH getSpmStatus error=(-202, 'Sanlock resource read failure', 'IO timeout') (dispatcher:86) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 73, in wrapper result = ctask.prepare(func, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper return m(self, *a, **kw) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189, in prepare raise self.error SanlockException: (-202, 'Sanlock resource read failure', 'IO timeout') 2018-04-05 06:12:01,782+0300 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call StoragePool.getSpmStatus failed (error 100) in 10.20 seconds (__init__:573) Version-Release number of selected component (if applicable): ovirt-engine-4.2.2.6-0.1.el7.noarch vdsm-4.20.23-1.el7ev.x86_64 sanlock-3.6.0-1.el7.x86_64 qemu-img-rhev-2.10.0-21.el7_5.1.x86_64 How reproducible: ? Steps to Reproduce: Unknown Actual results: Failed to remove VM. Expected results: Additional info:
This means sanlock was not able to access storage. Most likely your storage was not accessible at this time either because you block access to storage, or because of some issue in your environment. I don't see any bug the system. Please reopen when you have data suggesting that this is a bug in the system, and when you can reproduce this.
Hi Nir, Can you consider adding more relevant message to logs instead of the traceback? We would avoid bugs like this, if the right ERROR is displayed that storage is unaccesible, so Admin can check it. Can you consider reopenning for clarifying the ERROR?
(In reply to Lukas Svaty from comment #2) > Can you consider adding more relevant message to logs instead of the > traceback? I don't think so. The traceback is required to understand the sanlock failure and we cannot hide it. We would not be able to debug the system if we start to hide tracebacks in logs. > We would avoid bugs like this, if the right ERROR is displayed that > storage is unaccesible, so Admin can check it. Storage QE should be familiar with vdsm errors and know what should be checked when they happen. Maybe we need better documentation about common errors. > Can you consider reopenning for clarifying the ERROR? The error is very clear: SanlockException: (-202, 'Sanlock resource read failure', 'IO timeout') I agree that we can add possible recovery info for some errors. For example for error -202, it is likely that there an issue on storage or network, so we can recommend to check storage and network health. But we learned few weeks ago that this can happen when importing images from glance (bug 1832967), because of incorrect import code. So recommending to check storage and network is can be misleading in such case, it is pretty hard to provide good message based on single error. Let's open RFE for adding recovery info for storage errors. We can work on this for 4.5.