921666 – engine: DeactivateStorageDomainCommand fails with vdsm error: 'Operation not allowed while SPM is active' because we do not actually send SpmStop while there are unknown tasks

Bug 921666 - engine: DeactivateStorageDomainCommand fails with vdsm error: 'Operation not allowed while SPM is active' because we do not actually send SpmStop while there are unknown tasks

Summary: engine: DeactivateStorageDomainCommand fails with vdsm error: 'Operation not ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.2.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.4.0
Assignee:	Liron Aravot
QA Contact:	Leonid Natapov
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	1054108 rhev3.4beta 1142926
TreeView+	depends on / blocked

Reported:	2013-03-14 15:55 UTC by Dafna Ron
Modified:	2016-02-10 20:53 UTC (History)
CC List:	10 users (show)
Fixed In Version:	ovirt-3.4.0-alpha1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1054108 (view as bug list)
Environment:
Last Closed:
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	abaron: Triaged+

Attachments	(Terms of Use)
logs (1.05 MB, application/x-gzip) 2013-03-14 15:55 UTC, Dafna Ron	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	22269	0	None	MERGED	core: order of operations when deactivating the last master domain	2020-06-01 16:04:37 UTC

Description Dafna Ron 2013-03-14 15:55:25 UTC

Created attachment 710097 [details]
logs

Description of problem:

I tried to deactivate the master storage domain but had zombie tasks on the spm. 
engine writes in log that it's sending SpmStop but not actually sending it and than continues to send disconnectStoragePool which will fail since host is spm. 
also, the second host (hsm) becomes non-operational since we send  DisconnectStoragePoolVDSCommand before sending it to the spm. 

Version-Release number of selected component (if applicable):

sf10

How reproducible:

100%

Steps to Reproduce:
1. create zombie tasks in spm (fail live storage migration for example)
2. try to put the master storage domain in maintenance in a two hosts cluster
3.
  
Actual results:

we fail to deactivate the storage domain with error from vdsm that the host is spm
hsm host becomes non-operational because it got disconnected from pool before the spm did

Expected results:

we should check if tasks exists in the spm (not just in db async_tasks) and if so stop the command with appropriate message. 
also, if disconnectStoragePool fails we should send connectStoragePool to the hsm. 

Additional info:logs

Comment 1 Ayal Baron 2013-03-17 19:36:51 UTC

> Steps to Reproduce:
> 1. create zombie tasks in spm (fail live storage migration for example)
> 2. try to put the master storage domain in maintenance in a two hosts cluster
> 3.
>   
> Actual results:
> 
> we fail to deactivate the storage domain with error from vdsm that the host
> is spm
> hsm host becomes non-operational because it got disconnected from pool
> before the spm did
> 
> Expected results:
> 
> we should check if tasks exists in the spm (not just in db async_tasks) and
> if so stop the command with appropriate message. 
> also, if disconnectStoragePool fails we should send connectStoragePool to
> the hsm. 

Engine is not sending spmStop because it already knows that there are tasks running so it shouldn't send *any* disconnectStoragePool commands to begin with.
Liron, I have a strong sense of de ja vu here, isn't this a duplicate of another bug you're working on?

Comment 2 Liron Aravot 2013-03-18 08:43:53 UTC

Ayal, we had kind of similar bug - if there are tasks spmStop isn't sent.
https://bugzilla.redhat.com/show_bug.cgi?id=920220

basically it seems like a corner case in which we have zombie tasks, shouldn't generally happen - 

"we should check if tasks exists in the spm (not just in db async_tasks)" -
this will narrow down the possibility of the case , won't prevent it. Ayal, if we want, I can add it..but on 99% percent of the cases i guess that this check will be just an overhead (as we generally won't have zombie tasks) and in other cases tasks may be initiated immediately after this check so we will reach the same issue.

"if disconnectStoragePool fails we should send connectStoragePool to the hsm" - in the rare case in which that happens, the host recovery will return the host..i don't think that we should do recovery operations within the command.

Ayal, how do we want to proceed with it?

Comment 3 Ayal Baron 2013-03-20 09:49:17 UTC

We need to make sure we do not disconnect the HSM.
Wrt zombie tasks, nothing to do there, just fail the op earlier and give the user a proper message.

Comment 4 Liron Aravot 2013-03-25 07:56:23 UTC

Dafna, can you elaborate what did you do to have tasks on vdsm that are unknown at the engine in the LSM flow (should be separate bug)?

Comment 5 Dafna Ron 2013-03-25 08:07:19 UTC

Steps to Reproduce:
1. create zombie tasks in spm (fail live storage migration for example)

- restart of vdsm during live storage miration should create zombie tasks

Comment 6 Liron Aravot 2013-04-03 14:19:59 UTC

Dafna, just to understand how we got to that situation - I tried to restart the vdsm during LSM and couldn't get to a situation in which we have unknown tasks (tasks in vdsm but not in the engine), do we have clear reproducer? I can't find the unknown task creation in the engine\vdsm logs.

Comment 8 Liron Aravot 2013-04-07 13:59:45 UTC

Ayal, how do we want to proceed with it?
with the current implementation of the async task mechanism, we can reach the state in which we have a task in vdsm which isn't in the engine from different flows (as we recieve the guid back from vdsm after the task already created) in which the task was created and we had network error on the way back for example.

I guess that we can do two things:
1. Consider only the async tasks persisted in the db in that flow.
2. Wait for infra changes of async tasks (for example, if the uuid will be generated by the engine and added to the db before attempting to create in vdsm).

I'm not really a fan of changing the flow (first disconnecting the hsm's and such) - it seems fine to me.

Comment 11 Liron Aravot 2013-12-10 16:40:17 UTC

*the problem of the return value of spm stop being wrong in case of async tasks unknown in the engine will be solved regardless - we might always have unknown async tasks currently and therefore it's less relevant.

*currently the solution here changes the flow - in case of the last master the hsms would be disconnected only after the spm was stopped and disconnected from the pool.

Comment 12 Sandro Bonazzola 2014-01-14 08:43:50 UTC

ovirt 3.4.0 alpha has been released

Comment 13 Leonid Natapov 2014-02-13 11:27:08 UTC

3.4.0-0.7.beta2.el6. Reproduced with zombie task like describes in the bug (restart vdsm during lsm). HSM host stays active and not becomes non operational if spm wasn't actually disconnected from pool.

Comment 15 Itamar Heim 2014-06-12 14:09:51 UTC

Closing as part of 3.4.0

Note You need to log in before you can comment on or make changes to this bug.