1565083 – VDSM - "Sanlock resource read failure" exception during getSpmStatus

Bug 1565083 - VDSM - "Sanlock resource read failure" exception during getSpmStatus

Summary: VDSM - "Sanlock resource read failure" exception during getSpmStatus

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.2.1.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-4.2.4
Target Release:	---
Assignee:	Nir Soffer
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-09 10:57 UTC by Yosi Ben Shimon
Modified:	2020-05-20 21:55 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-05-28 16:57:35 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.2+ ylavi: exception+

Attachments	(Terms of Use)
VDSM and Engine logs (6.07 MB, application/x-gzip) 2018-04-09 10:57 UTC, Yosi Ben Shimon	no flags	Details
View All

Description Yosi Ben Shimon 2018-04-09 10:57:09 UTC

Created attachment 1419184 [details]
VDSM and Engine logs

Description of problem:
Got exception while trying to remove VM using automation test.
* Similar issue as https://bugzilla.redhat.com/show_bug.cgi?id=1079737

From engine log:
2018-04-05 06:12:01,793+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-63) [] Failed in 'SpmStatusVDS' method
2018-04-05 06:12:01,808+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-63) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_1 command SpmStatusVDS failed: (-202, 'Sanlock resource read failure', 'IO timeout')
2018-04-05 06:12:01,809+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-63) [] Command 'SpmStatusVDSCommand(HostName = host_mixed_1, SpmStatusVDSCommandParameters:{hostId='af0ef1b8-a8eb-4021-9fe0-ea3eaa03489c', storagePoolId='07364ef0-8154-48e6-819b-5f756577cf3a'})' execution failed: VDSGenericException: VDSErrorException: Failed to SpmStatusVDS, error = (-202, 'Sanlock resource read failure', 'IO timeout'), code = 100


From VDSM log:
2018-04-05 06:12:01,578+0300 INFO  (jsonrpc/2) [vdsm.api] FINISH getSpmStatus error=(-202, 'Sanlock resource read failure', 'IO timeout') from=::ffff:10.46.16.85,40574, task_id=a03d11df-b36b-4e5d-b442-8a8f784729ce (api:50)
2018-04-05 06:12:01,579+0300 ERROR (jsonrpc/2) [storage.TaskManager.Task] (Task='a03d11df-b36b-4e5d-b442-8a8f784729ce') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in getSpmStatus
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 636, in getSpmStatus
    status = self._getSpmStatusInfo(pool)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 630, in _getSpmStatusInfo
    (pool.spmRole,) + pool.getSpmStatus()))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 141, in getSpmStatus
    return self._backend.getSpmStatus()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/spbackends.py", line 432, in getSpmStatus
    lVer, spmId = self.masterDomain.inquireClusterLock()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 855, in inquireClusterLock
    return self._manifest.inquireDomainLock()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 526, in inquireDomainLock
    return self._domainLock.inquire(self.getDomainLease())
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 418, in inquire
    resource = sanlock.read_resource(lease.path, lease.offset)
SanlockException: (-202, 'Sanlock resource read failure', 'IO timeout')
2018-04-05 06:12:01,780+0300 INFO  (jsonrpc/2) [storage.TaskManager.Task] (Task='a03d11df-b36b-4e5d-b442-8a8f784729ce') aborting: Task is aborted: u"(-202, 'Sanlock resource read failure', 'IO timeout')" - code 100 (task:1181)
2018-04-05 06:12:01,781+0300 ERROR (jsonrpc/2) [storage.Dispatcher] FINISH getSpmStatus error=(-202, 'Sanlock resource read failure', 'IO timeout') (dispatcher:86)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 73, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper
    return m(self, *a, **kw)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189, in prepare
    raise self.error
SanlockException: (-202, 'Sanlock resource read failure', 'IO timeout')
2018-04-05 06:12:01,782+0300 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call StoragePool.getSpmStatus failed (error 100) in 10.20 seconds (__init__:573)

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.2.6-0.1.el7.noarch
vdsm-4.20.23-1.el7ev.x86_64
sanlock-3.6.0-1.el7.x86_64
qemu-img-rhev-2.10.0-21.el7_5.1.x86_64

How reproducible:
?

Steps to Reproduce:
Unknown

Actual results:
Failed to remove VM.

Expected results:


Additional info:

Comment 1 Nir Soffer 2018-05-28 16:57:35 UTC

This means sanlock was not able to access storage. Most likely your storage
was not accessible at this time either because you block access to storage,
or because of some issue in your environment.

I don't see any bug the system. Please reopen when you have data suggesting that
this is a bug in the system, and when you can reproduce this.

Comment 2 Lukas Svaty 2020-01-06 11:20:56 UTC

Hi Nir,

Can you consider adding more relevant message to logs instead of the traceback?
We would avoid bugs like this, if the right ERROR is displayed that storage is unaccesible, so Admin can check it.

Can you consider reopenning for clarifying the ERROR?

Comment 3 Nir Soffer 2020-05-20 21:55:48 UTC

(In reply to Lukas Svaty from comment #2)
> Can you consider adding more relevant message to logs instead of the
> traceback?

I don't think so. The traceback is required to understand the sanlock
failure and we cannot hide it. We would not be able to debug the 
system if we start to hide tracebacks in logs.

> We would avoid bugs like this, if the right ERROR is displayed that
> storage is unaccesible, so Admin can check it.

Storage QE should be familiar with vdsm errors and know what should
be checked when they happen. Maybe we need better documentation 
about common errors.
 
> Can you consider reopenning for clarifying the ERROR?

The error is very clear:

    SanlockException: (-202, 'Sanlock resource read failure', 'IO timeout')

I agree that we can add possible recovery info for some errors.
For example for error -202, it is likely that there an issue on storage
or network, so we can recommend to check storage and network health.

But we learned few weeks ago that this can happen when importing images
from glance (bug 1832967), because of incorrect import code.

So recommending to check storage and network is can be misleading in
such case, it is pretty hard to provide good message based on single
error.

Let's open RFE for adding recovery info for storage errors. We can work
on this for 4.5.

Note You need to log in before you can comment on or make changes to this bug.