Created attachment 877860 [details] engine and vdsm logs Description of problem: Engine reported that DC moved to 'Non-responsive' state and back to up after a few seconds with 'general exception' error message. In vdsm.log I saw the following error message: Thread-14::ERROR::2014-03-23 16:34:53,700::task::866::TaskManager.Task::(_setError) Task=`7fb06425-3553-4276-9b07-03563ae41347`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 620, in getSpmStatus status = self._getSpmStatusInfo(pool) File "/usr/share/vdsm/storage/hsm.py", line 614, in _getSpmStatusInfo (pool.spmRole,) + pool.getSpmStatus())) File "/usr/share/vdsm/storage/sp.py", line 126, in getSpmStatus return self._backend.getSpmStatus() File "/usr/share/vdsm/storage/spbackends.py", line 223, in getSpmStatus lVer, spmId = self.masterDomain.inquireClusterLock() File "/usr/share/vdsm/storage/sd.py", line 494, in inquireClusterLock return self._clusterLock.inquire() File "/usr/share/vdsm/storage/clusterlock.py", line 266, in inquire self.getLockDisk()) SanlockException: (-202, 'Unable to read resource owners', 'Sanlock exception') Thread-14::DEBUG::2014-03-23 16:34:53,735::task::885::TaskManager.Task::(_run) Task=`7fb06425-3553-4276-9b07-03563ae41347`::Task._run: 7fb06425-3553-4276-9b07-03563ae41347 ('f63ec017-b5e4-42c2-a0b6-9f81d291066b',) {} failed - stopping task Thread-14::DEBUG::2014-03-23 16:34:53,736::task::1211::TaskManager.Task::(stop) Task=`7fb06425-3553-4276-9b07-03563ae41347`::stopping in state preparing (force False) Thread-14::DEBUG::2014-03-23 16:34:53,736::task::990::TaskManager.Task::(_decref) Task=`7fb06425-3553-4276-9b07-03563ae41347`::ref 1 aborting True Thread-14::INFO::2014-03-23 16:34:53,736::task::1168::TaskManager.Task::(prepare) Task=`7fb06425-3553-4276-9b07-03563ae41347`::aborting: Task is aborted: u"(-202, 'Unable to read resource owners', 'Sanlock excepti on')" - code 100 Thread-14::DEBUG::2014-03-23 16:34:53,736::task::1173::TaskManager.Task::(prepare) Task=`7fb06425-3553-4276-9b07-03563ae41347`::Prepare: aborted: (-202, 'Unable to read resource owners', 'Sanlock exception') Thread-14::DEBUG::2014-03-23 16:34:53,736::task::990::TaskManager.Task::(_decref) Task=`7fb06425-3553-4276-9b07-03563ae41347`::ref 0 aborting True Thread-14::DEBUG::2014-03-23 16:34:53,737::task::925::TaskManager.Task::(_doAbort) Task=`7fb06425-3553-4276-9b07-03563ae41347`::Task._doAbort: force False Thread-14::DEBUG::2014-03-23 16:34:53,737::resourceManager::977::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-14::DEBUG::2014-03-23 16:34:53,737::task::595::TaskManager.Task::(_updateState) Task=`7fb06425-3553-4276-9b07-03563ae41347`::moving from state preparing -> state aborting Thread-14::DEBUG::2014-03-23 16:34:53,737::task::550::TaskManager.Task::(__state_aborting) Task=`7fb06425-3553-4276-9b07-03563ae41347`::_aborting: recover policy none Thread-14::DEBUG::2014-03-23 16:34:53,737::task::595::TaskManager.Task::(_updateState) Task=`7fb06425-3553-4276-9b07-03563ae41347`::moving from state aborting -> state failed Thread-14::DEBUG::2014-03-23 16:34:53,738::resourceManager::940::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-14::DEBUG::2014-03-23 16:34:53,738::resourceManager::977::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-14::ERROR::2014-03-23 16:34:53,738::dispatcher::68::Storage.Dispatcher.Protect::(run) (-202, 'Unable to read resource owners', 'Sanlock exception') Traceback (most recent call last): File "/usr/share/vdsm/storage/dispatcher.py", line 60, in run result = ctask.prepare(self.func, *args, **kwargs) File "/usr/share/vdsm/storage/task.py", line 103, in wrapper return m(self, *a, **kw) File "/usr/share/vdsm/storage/task.py", line 1176, in prepare raise self.error SanlockException: (-202, 'Unable to read resource owners', 'Sanlock exception') Version-Release number of selected component (if applicable): vdsm-4.14.5-0.1.beta2.el6ev.x86_64 sanlock-2.8-1.el6.x86_64 How reproducible: ? Steps to Reproduce: Unknown Actual results: DC moved to 'Non-responsive' state for a few seconds Engine.log: 2014-03-23 16:34:43,749 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-44) Failed in SpmStatusVDS method 2014-03-23 16:34:43,751 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-44) Command SpmStatusVDSCommand(HostName = green-vdsb, HostId = 64008a1d-5898-403a-8f1a- 2b36679acac4, storagePoolId = f63ec017-b5e4-42c2-a0b6-9f81d291066b) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to SpmStatusVDS, error = (-202, 'Unable to read re source owners', 'Sanlock exception'), code = 100 2014-03-23 16:34:43,753 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-44) [4a7bd813] Running command: SetStoragePoolStatusCommand internal: true. Entities aff ected : ID: f63ec017-b5e4-42c2-a0b6-9f81d291066b Type: StoragePool 2014-03-23 16:34:43,764 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-44) [4a7bd813] Correlation ID: 4a7bd813, Call Stack: null, Custom Event ID: -1, M essage: Invalid status on Data Center elad1. Setting Data Center status to Non Responsive (On host green-vdsb, Error: General Exception). Additional info: engine and vdsm logs
Fede, can you take a look please?
There's a sanlock error but no sanlock/messages logs attached. As far as I can see we failed to read the spm owner ("Unable to read resource owners") and it can be for any reason (read failure/timeout). I tend to believe that the read got a timeout because the host/network/storage were busy with multiple live migrations. Can you attach the missing logs?
(In reply to Federico Simoncelli from comment #2) > There's a sanlock error but no sanlock/messages logs attached. > As far as I can see we failed to read the spm owner ("Unable to read > resource owners") and it can be for any reason (read failure/timeout). > > I tend to believe that the read got a timeout because the > host/network/storage were busy with multiple live migrations. > > Can you attach the missing logs? Sorry, the host was reprovisioned since then, so I don't have the logs anymore. According to vdsm.log, at the same time of the sanlock exception, latency to the storage domain was low, as reported by repoStats: Thread-14::INFO::2014-03-23 16:35:04,622::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'fd31e320-73d7-4d97-8485-d849d85a938c': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000687215', 'lastCheck': '3.9', 'valid': True}, '0029843a-893d-4e97-a407-b3d8648baae6': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000686904', 'lastCheck': '4.9', 'valid': True}, '806b3b3c-fe1e-4a83-bae9-c909f229eaa9': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000335582', 'lastCheck': '5.0', 'valid': True}, 'e065bb0d-faf4-4ef1-a0c3-28e60229e566': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000388921', 'lastCheck': '0.4', 'valid': True}}
(In reply to Elad from comment #3) > (In reply to Federico Simoncelli from comment #2) > > There's a sanlock error but no sanlock/messages logs attached. > > As far as I can see we failed to read the spm owner ("Unable to read > > resource owners") and it can be for any reason (read failure/timeout). > > > > I tend to believe that the read got a timeout because the > > host/network/storage were busy with multiple live migrations. > > > > Can you attach the missing logs? > > Sorry, the host was reprovisioned since then, so I don't have the logs > anymore. > > According to vdsm.log, at the same time of the sanlock exception, latency to > the storage domain was low, as reported by repoStats: > > Thread-14::INFO::2014-03-23 > 16:35:04,622::logUtils::47::dispatcher::(wrapper) Run and protect: > repoStats, Return response: {'fd31e320-73d7-4d97-8485-d849d85a938c': > {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000687215', > 'lastCheck': '3.9', 'valid': True}, '0029843a-893d-4e97-a407-b3d8648baae6': > {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000686904', > 'lastCheck': '4.9', 'valid': True}, '806b3b3c-fe1e-4a83-bae9-c909f229eaa9': > {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000335582', > 'lastCheck': '5.0', 'valid': True}, 'e065bb0d-faf4-4ef1-a0c3-28e60229e566': > {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000388921', > 'lastCheck': '0.4', 'valid': True}} Error 202 and these logs are indicating an AIO timeout: Thread-14::INFO::2014-03-23 16:34:43,692::logUtils::44::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID='f63ec017-b5e4-42c2-a0b6-9f81d291066b', options=None) ... Thread-14::ERROR::2014-03-23 16:34:53,700::task::866::TaskManager.Task::(_setError) Task=`7fb06425-3553-4276-9b07-03563ae41347`::Unexpected error SanlockException: (-202, 'Unable to read resource owners', 'Sanlock exception') Without the sanlock and messages logs we don't know more. Anyway David can you take a look?
read_resource_owners returns the standard io timeout error -202.
Elad, can you please attach sanlock logs and /var/log/messages?
Closing for know since I'm not manage to reproduce and I don't have the relevant sanlock.log