Bug 1408982 - Lease related tasks remain on SPM
Summary: Lease related tasks remain on SPM
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ovirt-4.1.1
: 4.1.1.2
Assignee: Nir Soffer
QA Contact: Lilach Zitnitski
URL:
Whiteboard:
: 1420023 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-28 16:57 UTC by Arik
Modified: 2017-04-21 09:32 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-04-21 09:32:30 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: exception+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 72120 0 master MERGED core: Add commands to remove and add VM storage leases 2021-01-03 15:02:00 UTC
oVirt gerrit 72274 0 ovirt-engine-4.1 MERGED core: Add commands to remove and add VM storage leases 2021-01-03 15:01:22 UTC

Description Arik 2016-12-28 16:57:30 UTC
Description of problem:
The addition and removal of VM leases create SPM tasks.
This is unexpected since we agreed that for 4.1 we would treat these operations as if they are synchronous.
Of course, treating async operations as sync mean that possible failures won't be caught, but these are quick operations, relatively simple and since the creation of a lease should probably move to run VM in the future, there's no point in complicating import/add/edit VM flows with polling of these tasks.
However, there is a major problem with not polling these tasks - they remain on the SPM and therefore host that served as SPM while VM lease was created or removed cannot be switched to maintenance without restarting its VDSM process.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Create a VM with a lease
2. Try to switch the SPM to maintenance
3.

Actual results:
Failure to move the SPM as there are uncleared SPM tasks

Expected results:
The SPM should move to maintenance

Additional info:
The tasks should either not be created on the VDSM side - this is dangerous though since no one can predict when and if the lease creation will ever be on run VM, as I believe it should. So another option is that the engine will automatically clear lease related tasks that are finished.

Comment 1 Nir Soffer 2016-12-29 20:53:38 UTC
(In reply to Arik from comment #0)
> Description of problem:
> The addition and removal of VM leases create SPM tasks.
> This is unexpected since we agreed that for 4.1 we would treat these
> operations as if they are synchronous.

I don't remember such agreement. The feature page is very clear about
the behavior of these verbs:

    Lease.create(lease)

    Starts a SPM task creating a lease on the xleases volume in the
    lease storage domain. Can be used only on the SPM.

    Creates a sanlock resource on the domain xleases volume, and mapping
    from lease_id to the resource offset in the volume.

    Arguments: - lease (Lease): the lease to create

    This is an asynchronous operation, the caller must check the task
    status using vdsm tasks API's and usual SPM error handling policies.

    Lease.delete(lease)

    Starts a SPM task removing a lease on the xleases volume of lease
    storage domain. Can be used only on the SPM.

    Clear the sanlock resource allocated for lease_id, and remove the
    mapping from lease_id to resource offset in the volume.

    Arguments: - lease (Lease): the lease to delete

    This is an asynchronous operation, the caller must check the task
    status using vdsm tasks API's and usual SPM error handling policies.

The first version of the feature page talked about new storage jobs,
these are asynchronous as well.

> Of course, treating async operations as sync mean that possible failures
> won't be caught, but these are quick operations, relatively simple and since
> the creation of a lease should probably move to run VM in the future,
> there's no point in complicating import/add/edit VM flows with polling of
> these tasks.

Creating and removing leases is metadata operation, and the only way we
can do these operations now is on the SPM. We rely on the SPM lease to
modify the xleases volume.

The only safe way to do these operations is in SPM task. This also
prevent engine from stopping the SPM in the middle of a lease operation.

> However, there is a major problem with not polling these tasks - they remain
> on the SPM and therefore host that served as SPM while VM lease was created
> or removed cannot be switched to maintenance without restarting its VDSM
> process.

Sure, SPM tasks must be monitored by engine and cleared when the task
completes.

> The tasks should either not be created on the VDSM side - this is dangerous
> though since no one can predict when and if the lease creation will ever be
> on run VM, as I believe it should. So another option is that the engine will
> automatically clear lease related tasks that are finished.

Of course, engine must clear these tasks.

For 4.1 we must handle the task properly on the engine side. This should
be treated like creating a disk. Adding or editing a vm can start
creation of some disks or leases. Until all the operation are finished,
the vm cannot be modified.

If lease creation fails, engine should show a clear error, same way disk
creation error fails, and when editing the vm, the lease should appear
as "No lease".

For future version, we like to move the creation and the deletion of
leases out of the SPM, and run it on any host using storage jobs, and a
special lease in the xleases volume. This will make it easier to consume
this api and recover much quickly from errors, but the apis will be
async as well.

Tal, do you know why this is storage bug, when the root cause is misusing
storage apis in a virt flow?

Comment 2 Tal Nisan 2017-02-07 16:59:07 UTC
*** Bug 1420023 has been marked as a duplicate of this bug. ***

Comment 3 Evgheni Dereveanchin 2017-02-08 09:11:00 UTC
What is the workaround for this bug? Affected code made it into oVirt 4.1.0 release  so anyone trying to use leases will end up with the SPM role stuck on a node that fails to go into Maintenance.

Is it enough to manually clean tasks on the SPM using vdsClient?

Comment 4 Evgheni Dereveanchin 2017-02-08 09:14:33 UTC
Update: just cleaned finish lease-related tasks on SPM which made it possible to migrate the role:

# vdsClient -s 0 getAllTasks
62497948-ad87-4907-856c-56d26ccdb8bd :
         verb = create_lease
         code = 0
         state = finished
         tag = spm
         result = 
         message = 1 jobs completed successfully
         id = 62497948-ad87-4907-856c-56d26ccdb8bd
e6dd3ea5-c280-4e47-bd68-e0b14c962579 :
         verb = delete_lease
         code = 0
         state = finished
         tag = spm
         result = 
         message = 1 jobs completed successfully
         id = e6dd3ea5-c280-4e47-bd68-e0b14c962579



# vdsClient -s 0 clearTask 62497948-ad87-4907-856c-56d26ccdb8bd
{'status': {'message': 'OK', 'code': 0}}



# vdsClient -s 0 clearTask e6dd3ea5-c280-4e47-bd68-e0b14c962579
{'status': {'message': 'OK', 'code': 0}}



# vdsClient -s 0 getAllTasks
<empty output>

Comment 5 Nir Soffer 2017-02-08 09:18:23 UTC
(In reply to Evgheni Dereveanchin from comment #3)
> Is it enough to manually clean tasks on the SPM using vdsClient?

Yes, but you should use vdsm-client, not vdsClient (depracated).

The best way is to do:

    vdsm-client Host getAllTasks

And then for each task you want to clear:

    vdsm-client Task clear taskID=xxxyyy

Comment 6 Yaniv Kaul 2017-02-08 21:10:19 UTC
(In reply to Nir Soffer from comment #5)
> (In reply to Evgheni Dereveanchin from comment #3)
> > Is it enough to manually clean tasks on the SPM using vdsClient?
> 
> Yes, but you should use vdsm-client, not vdsClient (depracated).

Is there a note when running vdsClient that it'll be deprecated?

> 
> The best way is to do:
> 
>     vdsm-client Host getAllTasks
> 
> And then for each task you want to clear:
> 
>     vdsm-client Task clear taskID=xxxyyy

Comment 7 Irit Goihman 2017-02-09 06:30:34 UTC
There will be a deprecation note for all modules using XML-RPC, and for vdsClient in particular.
Please see https://gerrit.ovirt.org/#/c/71811/

Comment 8 Lilach Zitnitski 2017-02-19 09:32:48 UTC
--------------------------------------
Tested with the following code:
----------------------------------------
rhevm-4.1.1.2-0.1.el7.noarch
vdsm-4.19.6-1.el7ev.x86_64

Tested with the following scenario:

Steps to Reproduce:
1. Create a VM with a lease
2. Try to switch the SPM to maintenance

Actual results:
after the spm task is cleared the host can be switched to maintenance. the task is cleared automatically. 

Expected results:

Moving to VERIFIED!

Comment 9 Simon 2017-03-14 11:27:20 UTC
I hit same problem today. Host would not switch to maintenance mode due to uncleared tasks. After manually clearing them host would enter maintenance mode.

OS Version:
RHEL - 7 - 3.1611.el7.centos

VDSM Version:
vdsm-4.19.4-1.el7.centos

Comment 10 Nir Soffer 2017-03-14 17:55:19 UTC
(In reply to Simon from comment #9)
> I hit same problem today ... vdsm-4.19.4-1.el7.centos
This was fixed in vdsm-4.19.6-1.el7ev.x86_64.

Until this version is available, you should clear the task manually.


Note You need to log in before you can comment on or make changes to this bug.