Created attachment 710097 [details] logs Description of problem: I tried to deactivate the master storage domain but had zombie tasks on the spm. engine writes in log that it's sending SpmStop but not actually sending it and than continues to send disconnectStoragePool which will fail since host is spm. also, the second host (hsm) becomes non-operational since we send DisconnectStoragePoolVDSCommand before sending it to the spm. Version-Release number of selected component (if applicable): sf10 How reproducible: 100% Steps to Reproduce: 1. create zombie tasks in spm (fail live storage migration for example) 2. try to put the master storage domain in maintenance in a two hosts cluster 3. Actual results: we fail to deactivate the storage domain with error from vdsm that the host is spm hsm host becomes non-operational because it got disconnected from pool before the spm did Expected results: we should check if tasks exists in the spm (not just in db async_tasks) and if so stop the command with appropriate message. also, if disconnectStoragePool fails we should send connectStoragePool to the hsm. Additional info:logs
> Steps to Reproduce: > 1. create zombie tasks in spm (fail live storage migration for example) > 2. try to put the master storage domain in maintenance in a two hosts cluster > 3. > > Actual results: > > we fail to deactivate the storage domain with error from vdsm that the host > is spm > hsm host becomes non-operational because it got disconnected from pool > before the spm did > > Expected results: > > we should check if tasks exists in the spm (not just in db async_tasks) and > if so stop the command with appropriate message. > also, if disconnectStoragePool fails we should send connectStoragePool to > the hsm. Engine is not sending spmStop because it already knows that there are tasks running so it shouldn't send *any* disconnectStoragePool commands to begin with. Liron, I have a strong sense of de ja vu here, isn't this a duplicate of another bug you're working on?
Ayal, we had kind of similar bug - if there are tasks spmStop isn't sent. https://bugzilla.redhat.com/show_bug.cgi?id=920220 basically it seems like a corner case in which we have zombie tasks, shouldn't generally happen - "we should check if tasks exists in the spm (not just in db async_tasks)" - this will narrow down the possibility of the case , won't prevent it. Ayal, if we want, I can add it..but on 99% percent of the cases i guess that this check will be just an overhead (as we generally won't have zombie tasks) and in other cases tasks may be initiated immediately after this check so we will reach the same issue. "if disconnectStoragePool fails we should send connectStoragePool to the hsm" - in the rare case in which that happens, the host recovery will return the host..i don't think that we should do recovery operations within the command. Ayal, how do we want to proceed with it?
We need to make sure we do not disconnect the HSM. Wrt zombie tasks, nothing to do there, just fail the op earlier and give the user a proper message.
Dafna, can you elaborate what did you do to have tasks on vdsm that are unknown at the engine in the LSM flow (should be separate bug)?
Steps to Reproduce: 1. create zombie tasks in spm (fail live storage migration for example) - restart of vdsm during live storage miration should create zombie tasks
Dafna, just to understand how we got to that situation - I tried to restart the vdsm during LSM and couldn't get to a situation in which we have unknown tasks (tasks in vdsm but not in the engine), do we have clear reproducer? I can't find the unknown task creation in the engine\vdsm logs.
Ayal, how do we want to proceed with it? with the current implementation of the async task mechanism, we can reach the state in which we have a task in vdsm which isn't in the engine from different flows (as we recieve the guid back from vdsm after the task already created) in which the task was created and we had network error on the way back for example. I guess that we can do two things: 1. Consider only the async tasks persisted in the db in that flow. 2. Wait for infra changes of async tasks (for example, if the uuid will be generated by the engine and added to the db before attempting to create in vdsm). I'm not really a fan of changing the flow (first disconnecting the hsm's and such) - it seems fine to me.
*the problem of the return value of spm stop being wrong in case of async tasks unknown in the engine will be solved regardless - we might always have unknown async tasks currently and therefore it's less relevant. *currently the solution here changes the flow - in case of the last master the hsms would be disconnected only after the spm was stopped and disconnected from the pool.
ovirt 3.4.0 alpha has been released
3.4.0-0.7.beta2.el6. Reproduced with zombie task like describes in the bug (restart vdsm during lsm). HSM host stays active and not becomes non operational if spm wasn't actually disconnected from pool.
Closing as part of 3.4.0