1079737 – [vdsm] 'SanlockException: (-202, 'Unable to read resource owners', 'Sanlock exception')' during getSpmStatus

Bug 1079737 - [vdsm] 'SanlockException: (-202, 'Unable to read resource owners', 'Sanlock exception')' during getSpmStatus

Summary: [vdsm] 'SanlockException: (-202, 'Unable to read resource owners', 'Sanlock e...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.4.0
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Federico Simoncelli
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-03-23 15:43 UTC by Elad
Modified:	2016-02-10 17:03 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-04-09 07:35:24 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	amureini: Triaged+

Attachments	(Terms of Use)
engine and vdsm logs (1.83 MB, application/x-gzip) 2014-03-23 15:43 UTC, Elad	no flags	Details
View All

Description Elad 2014-03-23 15:43:19 UTC

Created attachment 877860 [details]
engine and vdsm logs

Description of problem:
Engine reported that DC moved to 'Non-responsive' state and back to up after a few seconds with 'general exception' error message. In vdsm.log I saw the following error message:

Thread-14::ERROR::2014-03-23 16:34:53,700::task::866::TaskManager.Task::(_setError) Task=`7fb06425-3553-4276-9b07-03563ae41347`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 620, in getSpmStatus
    status = self._getSpmStatusInfo(pool)
  File "/usr/share/vdsm/storage/hsm.py", line 614, in _getSpmStatusInfo
    (pool.spmRole,) + pool.getSpmStatus()))
  File "/usr/share/vdsm/storage/sp.py", line 126, in getSpmStatus
    return self._backend.getSpmStatus()
  File "/usr/share/vdsm/storage/spbackends.py", line 223, in getSpmStatus
    lVer, spmId = self.masterDomain.inquireClusterLock()
  File "/usr/share/vdsm/storage/sd.py", line 494, in inquireClusterLock
    return self._clusterLock.inquire()
  File "/usr/share/vdsm/storage/clusterlock.py", line 266, in inquire
    self.getLockDisk())
SanlockException: (-202, 'Unable to read resource owners', 'Sanlock exception')
Thread-14::DEBUG::2014-03-23 16:34:53,735::task::885::TaskManager.Task::(_run) Task=`7fb06425-3553-4276-9b07-03563ae41347`::Task._run: 7fb06425-3553-4276-9b07-03563ae41347 ('f63ec017-b5e4-42c2-a0b6-9f81d291066b',)
 {} failed - stopping task
Thread-14::DEBUG::2014-03-23 16:34:53,736::task::1211::TaskManager.Task::(stop) Task=`7fb06425-3553-4276-9b07-03563ae41347`::stopping in state preparing (force False)
Thread-14::DEBUG::2014-03-23 16:34:53,736::task::990::TaskManager.Task::(_decref) Task=`7fb06425-3553-4276-9b07-03563ae41347`::ref 1 aborting True
Thread-14::INFO::2014-03-23 16:34:53,736::task::1168::TaskManager.Task::(prepare) Task=`7fb06425-3553-4276-9b07-03563ae41347`::aborting: Task is aborted: u"(-202, 'Unable to read resource owners', 'Sanlock excepti
on')" - code 100
Thread-14::DEBUG::2014-03-23 16:34:53,736::task::1173::TaskManager.Task::(prepare) Task=`7fb06425-3553-4276-9b07-03563ae41347`::Prepare: aborted: (-202, 'Unable to read resource owners', 'Sanlock exception')
Thread-14::DEBUG::2014-03-23 16:34:53,736::task::990::TaskManager.Task::(_decref) Task=`7fb06425-3553-4276-9b07-03563ae41347`::ref 0 aborting True
Thread-14::DEBUG::2014-03-23 16:34:53,737::task::925::TaskManager.Task::(_doAbort) Task=`7fb06425-3553-4276-9b07-03563ae41347`::Task._doAbort: force False
Thread-14::DEBUG::2014-03-23 16:34:53,737::resourceManager::977::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-14::DEBUG::2014-03-23 16:34:53,737::task::595::TaskManager.Task::(_updateState) Task=`7fb06425-3553-4276-9b07-03563ae41347`::moving from state preparing -> state aborting
Thread-14::DEBUG::2014-03-23 16:34:53,737::task::550::TaskManager.Task::(__state_aborting) Task=`7fb06425-3553-4276-9b07-03563ae41347`::_aborting: recover policy none
Thread-14::DEBUG::2014-03-23 16:34:53,737::task::595::TaskManager.Task::(_updateState) Task=`7fb06425-3553-4276-9b07-03563ae41347`::moving from state aborting -> state failed
Thread-14::DEBUG::2014-03-23 16:34:53,738::resourceManager::940::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-14::DEBUG::2014-03-23 16:34:53,738::resourceManager::977::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-14::ERROR::2014-03-23 16:34:53,738::dispatcher::68::Storage.Dispatcher.Protect::(run) (-202, 'Unable to read resource owners', 'Sanlock exception')
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 60, in run
    result = ctask.prepare(self.func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 103, in wrapper
    return m(self, *a, **kw)
  File "/usr/share/vdsm/storage/task.py", line 1176, in prepare
    raise self.error
SanlockException: (-202, 'Unable to read resource owners', 'Sanlock exception')


Version-Release number of selected component (if applicable):
vdsm-4.14.5-0.1.beta2.el6ev.x86_64
sanlock-2.8-1.el6.x86_64


How reproducible:
?

Steps to Reproduce:
Unknown


Actual results:
DC moved to 'Non-responsive' state for a few seconds


Engine.log:

2014-03-23 16:34:43,749 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-44) Failed in SpmStatusVDS method
2014-03-23 16:34:43,751 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-44) Command SpmStatusVDSCommand(HostName = green-vdsb, HostId = 64008a1d-5898-403a-8f1a-
2b36679acac4, storagePoolId = f63ec017-b5e4-42c2-a0b6-9f81d291066b) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to SpmStatusVDS, error = (-202, 'Unable to read re
source owners', 'Sanlock exception'), code = 100
2014-03-23 16:34:43,753 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-44) [4a7bd813] Running command: SetStoragePoolStatusCommand internal: true. Entities aff
ected :  ID: f63ec017-b5e4-42c2-a0b6-9f81d291066b Type: StoragePool
2014-03-23 16:34:43,764 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-44) [4a7bd813] Correlation ID: 4a7bd813, Call Stack: null, Custom Event ID: -1, M
essage: Invalid status on Data Center elad1. Setting Data Center status to Non Responsive (On host green-vdsb, Error: General Exception).


Additional info: engine and vdsm logs

Comment 1 Allon Mureinik 2014-03-26 09:22:34 UTC

Fede, can you take a look please?

Comment 2 Federico Simoncelli 2014-03-31 09:15:38 UTC

There's a sanlock error but no sanlock/messages logs attached.
As far as I can see we failed to read the spm owner ("Unable to read resource owners") and it can be for any reason (read failure/timeout).

I tend to believe that the read got a timeout because the host/network/storage were busy with multiple live migrations.

Can you attach the missing logs?

Comment 3 Elad 2014-03-31 09:28:41 UTC

(In reply to Federico Simoncelli from comment #2)
> There's a sanlock error but no sanlock/messages logs attached.
> As far as I can see we failed to read the spm owner ("Unable to read
> resource owners") and it can be for any reason (read failure/timeout).
> 
> I tend to believe that the read got a timeout because the
> host/network/storage were busy with multiple live migrations.
> 
> Can you attach the missing logs?

Sorry, the host was reprovisioned since then, so I don't have the logs anymore.

According to vdsm.log, at the same time of the sanlock exception, latency to the storage domain was low, as reported by repoStats:

Thread-14::INFO::2014-03-23 16:35:04,622::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'fd31e320-73d7-4d97-8485-d849d85a938c': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000687215', 'lastCheck': '3.9', 'valid': True}, '0029843a-893d-4e97-a407-b3d8648baae6': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000686904', 'lastCheck': '4.9', 'valid': True}, '806b3b3c-fe1e-4a83-bae9-c909f229eaa9': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000335582', 'lastCheck': '5.0', 'valid': True}, 'e065bb0d-faf4-4ef1-a0c3-28e60229e566': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000388921', 'lastCheck': '0.4', 'valid': True}}

Comment 4 Federico Simoncelli 2014-03-31 11:58:02 UTC

(In reply to Elad from comment #3)
> (In reply to Federico Simoncelli from comment #2)
> > There's a sanlock error but no sanlock/messages logs attached.
> > As far as I can see we failed to read the spm owner ("Unable to read
> > resource owners") and it can be for any reason (read failure/timeout).
> > 
> > I tend to believe that the read got a timeout because the
> > host/network/storage were busy with multiple live migrations.
> > 
> > Can you attach the missing logs?
> 
> Sorry, the host was reprovisioned since then, so I don't have the logs
> anymore.
> 
> According to vdsm.log, at the same time of the sanlock exception, latency to
> the storage domain was low, as reported by repoStats:
> 
> Thread-14::INFO::2014-03-23
> 16:35:04,622::logUtils::47::dispatcher::(wrapper) Run and protect:
> repoStats, Return response: {'fd31e320-73d7-4d97-8485-d849d85a938c':
> {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000687215',
> 'lastCheck': '3.9', 'valid': True}, '0029843a-893d-4e97-a407-b3d8648baae6':
> {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000686904',
> 'lastCheck': '4.9', 'valid': True}, '806b3b3c-fe1e-4a83-bae9-c909f229eaa9':
> {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000335582',
> 'lastCheck': '5.0', 'valid': True}, 'e065bb0d-faf4-4ef1-a0c3-28e60229e566':
> {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000388921',
> 'lastCheck': '0.4', 'valid': True}}

Error 202 and these logs are indicating an AIO timeout:

Thread-14::INFO::2014-03-23 16:34:43,692::logUtils::44::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID='f63ec017-b5e4-42c2-a0b6-9f81d291066b', options=None)
...
Thread-14::ERROR::2014-03-23 16:34:53,700::task::866::TaskManager.Task::(_setError) Task=`7fb06425-3553-4276-9b07-03563ae41347`::Unexpected error
SanlockException: (-202, 'Unable to read resource owners', 'Sanlock exception')

Without the sanlock and messages logs we don't know more.

Anyway David can you take a look?

Comment 5 David Teigland 2014-03-31 20:09:14 UTC

read_resource_owners returns the standard io timeout error -202.

Comment 6 Allon Mureinik 2014-04-02 08:58:39 UTC

Elad, can you please attach sanlock logs and /var/log/messages?

Comment 7 Elad 2014-04-09 07:35:24 UTC

Closing for know since I'm not manage to reproduce and I don't have the relevant sanlock.log

Note You need to log in before you can comment on or make changes to this bug.