Created attachment 1327995 [details] engine.log and vdsm.log Description of problem: Attach storage domain fails sometimes with the following: 2017-09-19 17:21:51,812+0300 DEBUG (event/28) [storage.Event.storage.DomainMonitor.onDomainStateChange] Event emitted (misc:451) 2017-09-19 17:21:51,828+0300 ERROR (tasks/0) [storage.TaskManager.Task] Unexpected error (task:649) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 628, in _loadMetaFile for line in getProcPool().readLines(filename): File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 328, in readLines return ioproc.readlines(path) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 562, in readlines return self.readfile(path, direct).splitlines() File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 550, in readfile "direct": direct}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 459, in _sendCommand raise OSError(errcode, errstr) OSError: [Errno 2] No such file or directory 2017-09-19 17:21:51,829+0300 ERROR (tasks/0) [storage.TaskManager] taskManager: Skipping directory: c687c9a2-25d1-4b94-8ba0-2bc9b94943ae (taskManager:223) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/taskManager.py", line 215, in loadDumpedTasks t = Task.loadTask(store, taskID) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1158, in loadTask t._load(store, ext) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 748, in _load self._loadRecoveryMetaFile(taskDir, rn) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 702, in _loadRecoveryMetaFile self._loadMetaFile(taskFile, self.recoveries[n], Recovery.fields) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 650, in _loadMetaFile raise se.TaskMetaDataLoadError(filename) TaskMetaDataLoadError: Can't load Task Metadata: ('/rhev/data-center/c7ac57d3-224f-44af-89c6-84338469ec40/mastersd/master/tasks/c687c9a2-25d1-4b94-8ba0-2bc9b94943ae/c687c9a2-25d1-4b94-8ba0- 2bc9b94943ae.recover.0',) 2017-09-19 17:21:19,103+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-11) [4ad0a4e5-a457-48a1-be0c-63b6226fbeda] EVENT_ID: IRS_BROKER_COMMAND _FAILURE(10,803), VDSM command AttachStorageDomainVDS failed: Message timeout which can be caused by communication issues 2017-09-19 17:21:19,104+03 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (default task-11) [4ad0a4e5-a457-48a1-be0c-63b6226fbeda] ERROR, AttachStorageDomainVDSCommand( AttachStorageDomainVDSCommandParameters:{storagePoolId='c7ac57d3-224f-44af-89c6-84338469ec40', ignoreFailoverLimit='false', storageDomainId='5c1ec113-33a8-4f4e-bae9-26bd849f31a3'}), excepti on: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues, log id: 4a99351c 2017-09-19 17:21:19,106+03 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (default task-11) [4ad0a4e5-a457-48a1-be0c-63b6226fbeda] Exception Version-Release number of selected component (if applicable): vdsm-4.20.3-47.git4801123.el7.centos.x86_64 ovirt-engine-4.2.0-0.0.master.20170915203601.git96df02d.el7.centos.noarch How reproducible: Happens sometimes, with iSCSI Steps to Reproduce: 1. Create an iSCSI domain Actual results: Attach storage domain fails with the mentioned exception. Expected results: Additional info: engine.log and vdsm.log
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
The specified storage domain mentioned in the errors is yellow-vdsb.qa.lab.tlv.redhat.com:/Storage_NFS/export_domain_storage_local_ge11 It looks like there was a problem in the file system of this storage domain: 2017-09-19 17:21:51,551+0300 DEBUG (tasks/0) [storage.StorageDomain] creating task dir: /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_export__domain__storage__local__ge11/25b9af4c-be25-485c-b96c-d610bb2834cf/master/tasks (sd:959) 2017-09-19 17:21:51,753+0300 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='4561f359-bb1f-422a-8484-6aad18c4e289') 4561f359-bb1f-422a-8484-6aad18c4e289: load from /rhev/data-center/c7ac57d3-224f-44af-89c6-84338469ec40/mastersd/master/tasks, ext '' (task:726) It also looks like there was a VDSM restart at 17:17:25 But the SPM status is 'SPM' only at 17:21:52: 2017-09-19 17:21:47,839+0300 DEBUG (jsonrpc/1) [storage.TaskManager.Task] (Task='6e865337-3f77-450a-af65-ae404e9616f0') finished: {'spm_st': {'spmId': -1, 'spmStatus': 'Free', 'spmLver': -1}} (task:1197) 2017-09-19 17:21:52,066+0300 INFO (jsonrpc/7) [vdsm.api] FINISH getSpmStatus return={'spm_st': {'spmId': 3, 'spmStatus': 'SPM', 'spmLver': 40L}} from=::ffff:10.35.162.10,58696, flow_id=14a4544b, task_id=fef0e486-2d69-4007-8478-688a06486d69 (api:52) The VDSM tries to upgrade the iSCSI storage domain and the engine operation fails with timeout exception. Elad, Is it only reproduces when there is VDSM restart?
Created attachment 1330931 [details] vdsm.log - 26.9 No, here is an occurrence of the issue without VDSM restart (vdsm.log attached) 2017-09-19 11:23:37,184+0300 ERROR (tasks/2) [storage.TaskManager] taskManager: Skipping directory: c687c9a2-25d1-4b94-8ba0-2bc9b94943ae (taskManager:223) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/taskManager.py", line 215, in loadDumpedTasks t = Task.loadTask(store, taskID) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1158, in loadTask t._load(store, ext) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 748, in _load self._loadRecoveryMetaFile(taskDir, rn) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 702, in _loadRecoveryMetaFile self._loadMetaFile(taskFile, self.recoveries[n], Recovery.fields) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 650, in _loadMetaFile raise se.TaskMetaDataLoadError(filename) TaskMetaDataLoadError: Can't load Task Metadata: ('/rhev/data-center/c7ac57d3-224f-44af-89c6-84338469ec40/mastersd/master/tasks/c687c9a2-25d1-4b94-8ba0-2bc9b94943ae/c687c9a2-25d1-4b94-8ba0- 2bc9b94943ae.recover.0',)
Maor, this is marked as a blocker for oVirt GA. What is the status here?
(In reply to Yaniv Kaul from comment #4) > Maor, this is marked as a blocker for oVirt GA. > What is the status here? The upgrade to V4 was introduced back in 4.0. This shouldn't be a blocker NOW.
(In reply to Yaniv Kaul from comment #4) > Maor, this is marked as a blocker for oVirt GA. > What is the status here? Sorry about the delay in the status, I'm still working on that with Elad to find the root cause
I can't seem to reproduce this on my env. Elad is that still reproduce?
Not in the past month. This is a pretty common scenario
(In reply to Elad from comment #8) > Not in the past month. This is a pretty common scenario Closing then. If this reoccurs, please reopen and share the logs.
Hi , I see the similar problem in automation build result on RHHI environment during the scenario that tries to put one of the hosts in maintenance state. the engine and vdsm logs are attached . the test tries to put host1 to maintenance.The action fails and at the same time I see the error on the second log. please see the Trace from 2018-05-31 18:56:17,889+0530 ERROR (tasks/7) [storage.TaskManager.Task] Unexpected error (task:652) Traceback (most recent call last): in hypervisor-rhsqa-grafton6-nic2_vdsm.log
Created attachment 1450838 [details] engine & vdsm logs
It seems like the operation to put the host in maintenance indeed failed but I don't think that the error you see is related to it, since the engine log indicate it fails because of the following reason: VDS_CANNOT_MAINTENANCE_GLUSTER_QUORUM_CANNOT_BE_MET Later it seems the admin was force selected the spm on host grafton6: 2018-05-31 18:56:10,683+05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-56) [hosts_syncAction_fa57a56d-23f8-4ecb] EVENT_ID: USER_FORCE_SELECTED_SPM(159), Host rhsqa-grafton6.lab.eng.blr.redhat.com was force selected by admin@internal-authz
I see in the VDSM log that it tries to read the tasks before the host become the SPM. Nir maybe do you have an idea if there is any relation of this exception to the bug: 2018-05-31 18:56:17,760+0530 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') moving from state running -> state preparing (task:602) 2018-05-31 18:56:17,760+0530 INFO (jsonrpc/6) [vdsm.api] START spmStart(spUUID='e6c5d8a2-5386-11e8-8885-004655214801', prevID=-1, prevLVER='-1', maxHostID=250, domVersion='4', options=None) from=::ffff:10.70.44.74,50088, flow_id=60255aba, task_id=0de40b4b-9a07-4ed4-800d-4366123b35db (api:46) .... 2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') _resourcesAcquired: 00_storage.e6c5d8a2-5386-11e8-8885-004655214801 (exclusive) (task:834) 2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') ref 1 aborting False (task:1002) 2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager] scheduled job spmStart for task 0de40b4b-9a07-4ed4-800d-4366123b35db (taskManager:79) 2018-05-31 18:56:17,761+0530 INFO (jsonrpc/6) [vdsm.api] FINISH spmStart return=None from=::ffff:10.70.44.74,50088, flow_id=60255aba, task_id=0de40b4b-9a07-4ed4-800d-4366123b35db (api:52) 2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') Prepare: 1 jobs exist, move to acquiring (task:1193) 2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') moving from state running -> state acquiring (task:602) 2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') moving from state running -> state queued (task:602) 2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager] queuing task: 0de40b4b-9a07-4ed4-800d-4366123b35db (taskManager:57) 2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager] task queued: 0de40b4b-9a07-4ed4-800d-4366123b35db (taskManager:65) 2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') returning (task:1198) 2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') ref 0 aborting False (task:1002) 2018-05-31 18:56:17,762+0530 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call StoragePool.spmStart succeeded in 0.00 seconds (__init__:573) 2018-05-31 18:56:17,762+0530 DEBUG (tasks/7) [storage.ThreadPool] Number of running tasks: 1 (threadPool:60) 2018-05-31 18:56:17,762+0530 INFO (tasks/7) [storage.ThreadPool.WorkerThread] START task 0de40b4b-9a07-4ed4-800d-4366123b35db (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x7fd9ebf4ea28>>, args=None) (threadPool:208) 2018-05-31 18:56:17,762+0530 DEBUG (tasks/7) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') committing task: 0de40b4b-9a07-4ed4-800d-4366123b35db (task:1210) 2018-05-31 18:56:17,762+0530 DEBUG (tasks/7) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') moving from state running -> state running (task:602) 2018-05-31 18:56:17,762+0530 DEBUG (tasks/7) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') Task.run: running job 0: spmStart: <bound method StoragePool.startSpm of <vdsm.storage.sp.StoragePool object at 0x7fd9ebde5c10>> (args: (-1, '-1', 250, 4) kwargs: {}) (task:914) 2018-05-31 18:56:17,762+0530 DEBUG (tasks/7) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') Job.run: running spmStart: <bound method StoragePool.startSpm of <vdsm.storage.sp.StoragePool object at 0x7fd9ebde5c10>> (args: (-1, '-1', 250, 4) kwargs: {}) callback None (task:333) ...... ...... The Exception: 2018-05-31 18:56:17,880+0530 DEBUG (tasks/7) [storage.TaskManager.Task] (Task='60bf8af9-d4d3-4753-a40d-2a8d028d3d3c') 60bf8af9-d4d3-4753-a40d-2a8d028d3d3c: load from /rhev/data-center/e6c5d8a2-5386-11e8-8885-004655214801/mastersd/master/tasks, ext '' (task:729) 2018-05-31 18:56:17,889+0530 ERROR (tasks/7) [storage.TaskManager.Task] Unexpected error (task:652) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 631, in _loadMetaFile for line in getProcPool().readLines(filename): File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 329, in readLines return ioproc.readlines(path) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 554, in readlines return self.readfile(path, direct).splitlines() File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 542, in readfile "direct": direct}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 451, in _sendCommand raise OSError(errcode, errstr) OSError: [Errno 2] No such file or directory 2018-05-31 18:56:17,889+0530 ERROR (tasks/7) [storage.TaskManager] taskManager: Skipping directory: 60bf8af9-d4d3-4753-a40d-2a8d028d3d3c (taskManager:223) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/taskManager.py", line 215, in loadDumpedTasks t = Task.loadTask(store, taskID) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1162, in loadTask t._load(store, ext) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 751, in _load self._loadRecoveryMetaFile(taskDir, rn) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 705, in _loadRecoveryMetaFile self._loadMetaFile(taskFile, self.recoveries[n], Recovery.fields) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 653, in _loadMetaFile raise se.TaskMetaDataLoadError(filename) TaskMetaDataLoadError: Can't load Task Metadata: ('/rhev/data-center/e6c5d8a2-5386-11e8-8885-004655214801/mastersd/master/tasks/60bf8af9-d4d3-4753-a40d-2a8d028d3d3c/60bf8af9-d4d3-4753-a40d-2a8d028d3d3c.recover.0',) ...... ...... 2018-05-31 18:56:17,922+0530 DEBUG (tasks/7) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') moving from state finished -> state finished (task:602) .... 2018-05-31 18:56:17,923+0530 DEBUG (tasks/7) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') Task.run: exit - success: result (task:925) 2018-05-31 18:56:17,923+0530 DEBUG (tasks/7) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') ref 0 aborting False (task:1002) 2018-05-31 18:56:17,923+0530 INFO (tasks/7) [storage.ThreadPool.WorkerThread] FINISH task 0de40b4b-9a07-4ed4-800d-4366123b35db (threadPool:210) 2018-05-31 18:56:17,923+0530 DEBUG (tasks/7) [storage.ThreadPool] Number of running tasks: 0 (threadPool:60)
(In reply to Maor from comment #13) > I see in the VDSM log that it tries to read the tasks before the host become > the SPM. > Nir maybe do you have an idea if there is any relation of this exception to > the bug: I don't have any idea. If you ask more specific question I may be able to help.
4.2.1 has been already released, please re-target
These exceptions: OSError: [Errno 2] No such file or directory ... TaskMetaDataLoadError: Can't load Task Metadata: ('/rhev/data-center/e6c5d8a2-5386-11e8-8885-004655214801/mastersd/master/tasks/60bf8af9-d4d3-4753-a40d-2a8d028d3d3c/60bf8af9-d4d3-4753-a40d-2a8d028d3d3c.recover.0',) Mean that a dumped task could not not loaded because there was no such file in the task directory. This error does not effect the SPM start process, the code is trying to load dumped tasks and ignore the result of the load. So this looks like log issue, moving severity to low since I don't see any real issue. Regarding the exceptions, we have several issues: 1. Logging several exceptions for the same problem - we should log the same issue exactly once. This happens because the code using the anti-pattern of logging an exception and raising new one. 2. I'm not sure why we log a traceback for expected error like a missing dumped task file. This should be logged without a traceback. 3. I'm not sure why missing dumped task is an error, since the SPM code ignore it. This should be probably a warning.
This bug will be focused on the log problem. If the bug of the host maintenance is constantly reproduced, please open a seperate bug with the reproduce steps.
I don't think this will be easy fix. - The code generating these exceptions is the is very complex - Nobody understands this code - There are no tests - There is no specification - There is no documentation We also want to replace the implementation of task persistence, removing the neeed for the mast mount. This will allow dropping this legacy code instead of fixing it. So I would defer this for now.
(In reply to Nir Soffer from comment #19) > I don't think this will be easy fix. > > - The code generating these exceptions is the is very complex > - Nobody understands this code > - There are no tests > - There is no specification > - There is no documentation > > We also want to replace the implementation of task persistence, removing the > neeed > for the mast mount. This will allow dropping this legacy code instead of > fixing it. Is there an open bug on that? > > So I would defer this for now.
(In reply to Maor from comment #20) > > We also want to replace the implementation of task persistence, removing the > > neeed > > for the mast mount. This will allow dropping this legacy code instead of > > fixing it. > > Is there an open bug on that? I don't know about any.
Closing old bugs, reopen if still needed.