Bug 1493184

Summary: TaskMetaDataLoadError during attach storage domain
Product: [oVirt] vdsm Reporter: Elad <ebenahar>
Component: CoreAssignee: Maor <mlipchuk>
Status: CLOSED WONTFIX QA Contact: Elad <ebenahar>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.20.3CC: amureini, bugs, ebenahar, mlipchuk, nsoffer, pagranat, ratamir, tnisan
Target Milestone: ---Keywords: Regression, Reopened
Target Release: ---Flags: sbonazzo: ovirt-4.3-
sbonazzo: blocker-
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-03 08:45:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine.log and vdsm.log
none
vdsm.log - 26.9
none
engine & vdsm logs none

Description Elad 2017-09-19 14:32:06 UTC
Created attachment 1327995 [details]
engine.log and vdsm.log

Description of problem:
Attach storage domain fails sometimes with the following:

2017-09-19 17:21:51,812+0300 DEBUG (event/28) [storage.Event.storage.DomainMonitor.onDomainStateChange] Event emitted (misc:451)
2017-09-19 17:21:51,828+0300 ERROR (tasks/0) [storage.TaskManager.Task] Unexpected error (task:649)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 628, in _loadMetaFile
    for line in getProcPool().readLines(filename):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 328, in readLines
    return ioproc.readlines(path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 562, in readlines
    return self.readfile(path, direct).splitlines()
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 550, in readfile
    "direct": direct}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 459, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 2] No such file or directory
2017-09-19 17:21:51,829+0300 ERROR (tasks/0) [storage.TaskManager] taskManager: Skipping directory: c687c9a2-25d1-4b94-8ba0-2bc9b94943ae (taskManager:223)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/taskManager.py", line 215, in loadDumpedTasks
    t = Task.loadTask(store, taskID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1158, in loadTask
    t._load(store, ext)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 748, in _load
    self._loadRecoveryMetaFile(taskDir, rn)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 702, in _loadRecoveryMetaFile
    self._loadMetaFile(taskFile, self.recoveries[n], Recovery.fields)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 650, in _loadMetaFile
    raise se.TaskMetaDataLoadError(filename)
TaskMetaDataLoadError: Can't load Task Metadata: ('/rhev/data-center/c7ac57d3-224f-44af-89c6-84338469ec40/mastersd/master/tasks/c687c9a2-25d1-4b94-8ba0-2bc9b94943ae/c687c9a2-25d1-4b94-8ba0-
2bc9b94943ae.recover.0',)




2017-09-19 17:21:19,103+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-11) [4ad0a4e5-a457-48a1-be0c-63b6226fbeda] EVENT_ID: IRS_BROKER_COMMAND
_FAILURE(10,803), VDSM command AttachStorageDomainVDS failed: Message timeout which can be caused by communication issues
2017-09-19 17:21:19,104+03 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (default task-11) [4ad0a4e5-a457-48a1-be0c-63b6226fbeda] ERROR, AttachStorageDomainVDSCommand( 
AttachStorageDomainVDSCommandParameters:{storagePoolId='c7ac57d3-224f-44af-89c6-84338469ec40', ignoreFailoverLimit='false', storageDomainId='5c1ec113-33a8-4f4e-bae9-26bd849f31a3'}), excepti
on: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues, log id: 4a99351c
2017-09-19 17:21:19,106+03 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (default task-11) [4ad0a4e5-a457-48a1-be0c-63b6226fbeda] Exception



Version-Release number of selected component (if applicable):
vdsm-4.20.3-47.git4801123.el7.centos.x86_64
ovirt-engine-4.2.0-0.0.master.20170915203601.git96df02d.el7.centos.noarch

How reproducible:
Happens sometimes, with iSCSI


Steps to Reproduce:
1. Create an iSCSI domain 


Actual results:
Attach storage domain fails with the mentioned exception.

Expected results:


Additional info:
engine.log and vdsm.log

Comment 1 Red Hat Bugzilla Rules Engine 2017-09-25 09:29:30 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 2 Maor 2017-09-25 11:54:42 UTC
The specified storage domain mentioned in the errors is 
yellow-vdsb.qa.lab.tlv.redhat.com:/Storage_NFS/export_domain_storage_local_ge11

It looks like there was a problem in the file system of this storage domain:
2017-09-19 17:21:51,551+0300 DEBUG (tasks/0) [storage.StorageDomain] creating task dir: /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_export__domain__storage__local__ge11/25b9af4c-be25-485c-b96c-d610bb2834cf/master/tasks (sd:959)


2017-09-19 17:21:51,753+0300 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='4561f359-bb1f-422a-8484-6aad18c4e289') 4561f359-bb1f-422a-8484-6aad18c4e289: load from /rhev/data-center/c7ac57d3-224f-44af-89c6-84338469ec40/mastersd/master/tasks, ext '' (task:726)


It also looks like there was a VDSM restart at 17:17:25
But the SPM status is 'SPM' only at 17:21:52:

2017-09-19 17:21:47,839+0300 DEBUG (jsonrpc/1) [storage.TaskManager.Task] (Task='6e865337-3f77-450a-af65-ae404e9616f0') finished: {'spm_st': {'spmId': -1, 'spmStatus': 'Free', 'spmLver': -1}} (task:1197)


2017-09-19 17:21:52,066+0300 INFO  (jsonrpc/7) [vdsm.api] FINISH getSpmStatus return={'spm_st': {'spmId': 3, 'spmStatus': 'SPM', 'spmLver': 40L}} from=::ffff:10.35.162.10,58696, flow_id=14a4544b, task_id=fef0e486-2d69-4007-8478-688a06486d69 (api:52)


The VDSM tries to upgrade the iSCSI storage domain and the engine operation fails with timeout exception.

Elad, Is it only reproduces when there is VDSM restart?

Comment 3 Elad 2017-09-26 09:08:54 UTC
Created attachment 1330931 [details]
vdsm.log - 26.9

No, here is an occurrence of the issue without VDSM restart (vdsm.log attached)


2017-09-19 11:23:37,184+0300 ERROR (tasks/2) [storage.TaskManager] taskManager: Skipping directory: c687c9a2-25d1-4b94-8ba0-2bc9b94943ae (taskManager:223)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/taskManager.py", line 215, in loadDumpedTasks
    t = Task.loadTask(store, taskID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1158, in loadTask
    t._load(store, ext)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 748, in _load
    self._loadRecoveryMetaFile(taskDir, rn)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 702, in _loadRecoveryMetaFile
    self._loadMetaFile(taskFile, self.recoveries[n], Recovery.fields)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 650, in _loadMetaFile
    raise se.TaskMetaDataLoadError(filename)
TaskMetaDataLoadError: Can't load Task Metadata: ('/rhev/data-center/c7ac57d3-224f-44af-89c6-84338469ec40/mastersd/master/tasks/c687c9a2-25d1-4b94-8ba0-2bc9b94943ae/c687c9a2-25d1-4b94-8ba0-
2bc9b94943ae.recover.0',)

Comment 4 Yaniv Kaul 2017-11-16 08:08:38 UTC
Maor, this is marked as a blocker for oVirt GA.
What is the status here?

Comment 5 Allon Mureinik 2017-11-16 09:44:28 UTC
(In reply to Yaniv Kaul from comment #4)
> Maor, this is marked as a blocker for oVirt GA.
> What is the status here?
The upgrade to V4 was introduced back in 4.0. This shouldn't be a blocker NOW.

Comment 6 Maor 2017-11-26 16:15:39 UTC
(In reply to Yaniv Kaul from comment #4)
> Maor, this is marked as a blocker for oVirt GA.
> What is the status here?

Sorry about the delay in the status, I'm still working on that with Elad to find the root cause

Comment 7 Maor 2017-11-27 12:56:40 UTC
I can't seem to reproduce this on my env. Elad is that still reproduce?

Comment 8 Elad 2017-11-27 13:15:15 UTC
Not in the past month. This is a pretty common scenario

Comment 9 Allon Mureinik 2017-11-27 13:22:05 UTC
(In reply to Elad from comment #8)
> Not in the past month. This is a pretty common scenario

Closing then.
If this reoccurs, please reopen and share the logs.

Comment 10 Polina 2018-06-13 10:19:17 UTC
Hi , I see the similar problem in automation build result on RHHI environment during the scenario that tries to put one of the hosts in maintenance state. the engine and vdsm logs are attached .
the test tries to put host1 to maintenance.The action fails and at the same time I see the error on the second log.
please see the Trace from 
2018-05-31 18:56:17,889+0530 ERROR (tasks/7) [storage.TaskManager.Task] Unexpected error (task:652) Traceback (most recent call last):
in hypervisor-rhsqa-grafton6-nic2_vdsm.log

Comment 11 Polina 2018-06-13 10:20:15 UTC
Created attachment 1450838 [details]
engine & vdsm logs

Comment 12 Maor 2018-06-13 14:38:20 UTC
It seems like the operation to put the host in maintenance indeed failed but I don't think that the error you see is related to it, since the engine log indicate it fails because of the following reason:
  VDS_CANNOT_MAINTENANCE_GLUSTER_QUORUM_CANNOT_BE_MET

Later it seems the admin was force selected the spm on host grafton6:

2018-05-31 18:56:10,683+05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-56) [hosts_syncAction_fa57a56d-23f8-4ecb] EVENT_ID: USER_FORCE_SELECTED_SPM(159), Host rhsqa-grafton6.lab.eng.blr.redhat.com was force selected by admin@internal-authz

Comment 13 Maor 2018-06-13 14:40:19 UTC
I see in the VDSM log that it tries to read the tasks before the host become the SPM.
Nir maybe do you have an idea if there is any relation of this exception to the bug:

2018-05-31 18:56:17,760+0530 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') moving from state running -> state preparing (task:602)
2018-05-31 18:56:17,760+0530 INFO  (jsonrpc/6) [vdsm.api] START spmStart(spUUID='e6c5d8a2-5386-11e8-8885-004655214801', prevID=-1, prevLVER='-1', maxHostID=250, domVersion='4', options=None) from=::ffff:10.70.44.74,50088, flow_id=60255aba, task_id=0de40b4b-9a07-4ed4-800d-4366123b35db (api:46)
....
2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') _resourcesAcquired: 00_storage.e6c5d8a2-5386-11e8-8885-004655214801 (exclusive) (task:834)
2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') ref 1 aborting False (task:1002)
2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager] scheduled job spmStart for task 0de40b4b-9a07-4ed4-800d-4366123b35db  (taskManager:79)
2018-05-31 18:56:17,761+0530 INFO  (jsonrpc/6) [vdsm.api] FINISH spmStart return=None from=::ffff:10.70.44.74,50088, flow_id=60255aba, task_id=0de40b4b-9a07-4ed4-800d-4366123b35db (api:52)
2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') Prepare: 1 jobs exist, move to acquiring (task:1193)
2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') moving from state running -> state acquiring (task:602)
2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') moving from state running -> state queued (task:602)
2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager] queuing task: 0de40b4b-9a07-4ed4-800d-4366123b35db (taskManager:57)
2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager] task queued: 0de40b4b-9a07-4ed4-800d-4366123b35db (taskManager:65)
2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') returning (task:1198)
2018-05-31 18:56:17,761+0530 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') ref 0 aborting False (task:1002)
2018-05-31 18:56:17,762+0530 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call StoragePool.spmStart succeeded in 0.00 seconds (__init__:573)
2018-05-31 18:56:17,762+0530 DEBUG (tasks/7) [storage.ThreadPool] Number of running tasks: 1 (threadPool:60)
2018-05-31 18:56:17,762+0530 INFO  (tasks/7) [storage.ThreadPool.WorkerThread] START task 0de40b4b-9a07-4ed4-800d-4366123b35db (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x7fd9ebf4ea28>>, args=None) (threadPool:208)
2018-05-31 18:56:17,762+0530 DEBUG (tasks/7) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') committing task: 0de40b4b-9a07-4ed4-800d-4366123b35db (task:1210)
2018-05-31 18:56:17,762+0530 DEBUG (tasks/7) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') moving from state running -> state running (task:602)
2018-05-31 18:56:17,762+0530 DEBUG (tasks/7) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') Task.run: running job 0: spmStart: <bound method StoragePool.startSpm of <vdsm.storage.sp.StoragePool object at 0x7fd9ebde5c10>> (args: (-1, '-1', 250, 4) kwargs: {}) (task:914)
2018-05-31 18:56:17,762+0530 DEBUG (tasks/7) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') Job.run: running spmStart: <bound method StoragePool.startSpm of <vdsm.storage.sp.StoragePool object at 0x7fd9ebde5c10>> (args: (-1, '-1', 250, 4) kwargs: {}) callback None (task:333)
......
......
The Exception:


2018-05-31 18:56:17,880+0530 DEBUG (tasks/7) [storage.TaskManager.Task] (Task='60bf8af9-d4d3-4753-a40d-2a8d028d3d3c') 60bf8af9-d4d3-4753-a40d-2a8d028d3d3c: load from /rhev/data-center/e6c5d8a2-5386-11e8-8885-004655214801/mastersd/master/tasks, ext '' (task:729)
2018-05-31 18:56:17,889+0530 ERROR (tasks/7) [storage.TaskManager.Task] Unexpected error (task:652)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 631, in _loadMetaFile
    for line in getProcPool().readLines(filename):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 329, in readLines
    return ioproc.readlines(path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 554, in readlines
    return self.readfile(path, direct).splitlines()
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 542, in readfile
    "direct": direct}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 451, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 2] No such file or directory
2018-05-31 18:56:17,889+0530 ERROR (tasks/7) [storage.TaskManager] taskManager: Skipping directory: 60bf8af9-d4d3-4753-a40d-2a8d028d3d3c (taskManager:223)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/taskManager.py", line 215, in loadDumpedTasks
    t = Task.loadTask(store, taskID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1162, in loadTask
    t._load(store, ext)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 751, in _load
    self._loadRecoveryMetaFile(taskDir, rn)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 705, in _loadRecoveryMetaFile
    self._loadMetaFile(taskFile, self.recoveries[n], Recovery.fields)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 653, in _loadMetaFile
    raise se.TaskMetaDataLoadError(filename)
TaskMetaDataLoadError: Can't load Task Metadata: ('/rhev/data-center/e6c5d8a2-5386-11e8-8885-004655214801/mastersd/master/tasks/60bf8af9-d4d3-4753-a40d-2a8d028d3d3c/60bf8af9-d4d3-4753-a40d-2a8d028d3d3c.recover.0',)
......
......

2018-05-31 18:56:17,922+0530 DEBUG (tasks/7) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') moving from state finished -> state finished (task:602)
....
2018-05-31 18:56:17,923+0530 DEBUG (tasks/7) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') Task.run: exit - success: result  (task:925)
2018-05-31 18:56:17,923+0530 DEBUG (tasks/7) [storage.TaskManager.Task] (Task='0de40b4b-9a07-4ed4-800d-4366123b35db') ref 0 aborting False (task:1002)
2018-05-31 18:56:17,923+0530 INFO  (tasks/7) [storage.ThreadPool.WorkerThread] FINISH task 0de40b4b-9a07-4ed4-800d-4366123b35db (threadPool:210)
2018-05-31 18:56:17,923+0530 DEBUG (tasks/7) [storage.ThreadPool] Number of running tasks: 0 (threadPool:60)

Comment 14 Nir Soffer 2018-06-13 14:57:18 UTC
(In reply to Maor from comment #13)
> I see in the VDSM log that it tries to read the tasks before the host become
> the SPM.
> Nir maybe do you have an idea if there is any relation of this exception to
> the bug:

I don't have any idea. If you ask more specific question I may be able to help.

Comment 15 Sandro Bonazzola 2018-06-21 12:03:57 UTC
4.2.1 has been already released, please re-target

Comment 17 Nir Soffer 2018-06-25 12:27:16 UTC
These exceptions:

OSError: [Errno 2] No such file or directory
...
TaskMetaDataLoadError: Can't load Task Metadata: ('/rhev/data-center/e6c5d8a2-5386-11e8-8885-004655214801/mastersd/master/tasks/60bf8af9-d4d3-4753-a40d-2a8d028d3d3c/60bf8af9-d4d3-4753-a40d-2a8d028d3d3c.recover.0',)

Mean that a dumped task could not not loaded because there was no such file in the
task directory.

This error does not effect the SPM start process, the code is trying to load dumped
tasks and ignore the result of the load.

So this looks like log issue, moving severity to low since I don't see any real
issue.

Regarding the exceptions, we have several issues:

1. Logging several exceptions for the same problem - we should log the same issue
   exactly once. This happens because the code using the anti-pattern of logging
   an exception and raising new one.

2. I'm not sure why we log a traceback for expected error like a missing dumped
   task file. This should be logged without a traceback.

3. I'm not sure why missing dumped task is an error, since the SPM code ignore
   it. This should be probably a warning.

Comment 18 Maor 2018-06-25 19:26:29 UTC
This bug will be focused on the log problem.
If the bug of the host maintenance is constantly reproduced, please open a seperate bug with the reproduce steps.

Comment 19 Nir Soffer 2018-06-29 21:51:24 UTC
I don't think this will be easy fix.

- The code generating these exceptions is the is very complex
- Nobody understands this code
- There are no tests
- There is no specification
- There is no documentation

We also want to replace the implementation of task persistence, removing the neeed
for the mast mount. This will allow dropping this legacy code instead of fixing it.

So I would defer this for now.

Comment 20 Maor 2018-07-08 06:57:18 UTC
(In reply to Nir Soffer from comment #19)
> I don't think this will be easy fix.
> 
> - The code generating these exceptions is the is very complex
> - Nobody understands this code
> - There are no tests
> - There is no specification
> - There is no documentation
> 
> We also want to replace the implementation of task persistence, removing the
> neeed
> for the mast mount. This will allow dropping this legacy code instead of
> fixing it.

Is there an open bug on that?

> 
> So I would defer this for now.

Comment 21 Nir Soffer 2018-07-10 20:02:38 UTC
(In reply to Maor from comment #20)
> > We also want to replace the implementation of task persistence, removing the
> > neeed
> > for the mast mount. This will allow dropping this legacy code instead of
> > fixing it.
> 
> Is there an open bug on that?

I don't know about any.

Comment 22 Red Hat Bugzilla Rules Engine 2018-08-22 14:21:47 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 23 Tal Nisan 2018-09-03 08:45:43 UTC
Closing old bugs, reopen if still needed.