Bug 1902127 - Sanlock exception during Lease.delete leaves lease UPDATING_FLAG in updating state, failing the next Lease.create and engine updates VM lease incorrectly.
Summary: Sanlock exception during Lease.delete leaves lease UPDATING_FLAG in updating ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.4.3
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-4.4.6
: ---
Assignee: Roman Bednář
QA Contact: Ilan Zuckerman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-27 01:57 UTC by Germano Veit Michel
Modified: 2024-03-25 17:16 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-01 13:21:11 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5604631 0 None None None 2020-11-27 04:36:57 UTC
Red Hat Product Errata RHBA-2021:2178 0 None None None 2021-06-01 13:21:51 UTC
oVirt gerrit 113729 0 master MERGED tests: add coverage for new index flows 2021-04-23 07:36:23 UTC
oVirt gerrit 113761 0 master MERGED xlease: remove usage of updating flag during lease add/remove 2021-04-23 07:36:28 UTC
oVirt gerrit 114019 0 master MERGED xlease: redesign index rebuilding 2021-04-23 07:36:30 UTC
oVirt gerrit 114546 0 master MERGED xlease: add comment about trailing empty records in index 2021-05-12 16:44:27 UTC

Description Germano Veit Michel 2020-11-27 01:57:05 UTC
Description of problem:

There seem to be more than one problem here, the user experience is not great if a lease fails to remove.
It's also tricky to reproduce, probably needs a way to make sanlock timeout at an specific IO.

Basically:
* The engine supdate the VM to add/remove lease even if the lease creation/deletion failed on the SPM.
* VDSM leaves the lease in 'updating' state if remove fails due to sanlock timeout, and doesn't rebuild it
gracefully on the next add attempt (as the engine thinks it removed)

VM was running with lease. Now see this:

1. Remove lease from storage fails

2020-11-22 09:54:21,856+0100 INFO  (tasks/1) [storage.xlease] Removing lease 'bd7249c4-c24d-49d0-897f-6dbb0d4da9ab' in lockspace u'0f623c51-66c8-4a5b-a12f-95ab9c6184b4' (xlease:619)

2020-11-22 09:54:31,876+0100 ERROR (tasks/1) [storage.TaskManager.Task] (Task='627ef29e-77ad-447f-8f5e-5bc347f30617') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 2083, in delete_lease
    dom.delete_lease(lease.lease_id)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 1365, in delete_lease
    vol.remove(lease_id)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py", line 636, in remove
    sector=self._block_size)
SanlockException: (-202, 'Sanlock resource write failure', 'IO timeout')

The exception was here on line 636, when asking sanlock to write an empty lease.
But note that it sets the metadata to 'updating' in 625, which is left as is when
sanlock raises the exception, empty record is not written and neither Updating is 
set back to False (which would be correct?)


   609	    def remove(self, lease_id):
   610	        """
   611	        Remove lease from index
   612	
   613	        Raises:
   614	        - NoSuchLease if lease was not found
   615	        - OSError if I/O operation failed
   616	        - sanlock.SanlockException if sanlock operation failed.
   617	        """
   618	        log.info("Removing lease %r in lockspace %r",
   619	                 lease_id, self.lockspace)
   620	        recnum = self._index.find_record(lease_id)
   621	        if recnum == -1:
   622	            raise se.NoSuchLease(lease_id)
   623	
   624	        offset = lease_offset(recnum, self._alignment)
   625	        record = Record(lease_id, offset, updating=True)
   626	        self._write_record(recnum, record)
   627	
   628	        # There is no way to remove a resource, so we write an invalid resource
   629	        # with empty resource and lockspace values.
   630	        # TODO: Use SANLK_WRITE_CLEAR, expected in rhel 7.4.
   631	        sanlock.write_resource(
   632	            b"",
   633	            b"",
   634	            [(self._file.name, offset)],
   635	            align=self._alignment,
   636	            sector=self._block_size)
   637	
   638	        self._write_record(recnum, EMPTY_RECORD)

2. For the engine, the VM is running without a lease (hotunplug succeded but delete lease failed). This does not sound good too.

3. Now as from engine eyes the delete succeded (the hotplug did but delete lease from storage did not) we add the lease again. 
   First the engine requests a Lease.info, and it fails right there, with the update flag left above.

2020-11-24 10:27:37,082+0100 INFO  (jsonrpc/2) [vdsm.api] START lease_info(lease={u'sd_id': u'0f623c51-66c8-4a5b-a12f-95ab9c6184b4', u'lease_id': u'bd7249c4-c24d-49d0-897f-6dbb0d4da9ab'}) from=::ffff:10.160.86.155,60366, flow_id=03537d89-caf6-41cb-955d-6b5a363f700f, task_id=82e5b162-cf68-4f9b-9e55-58db5d944ec9 (api:48)

2020-11-24 10:27:37,095+0100 ERROR (jsonrpc/2) [storage.TaskManager.Task] (Task='82e5b162-cf68-4f9b-9e55-58db5d944ec9') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in lease_info
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3721, in lease_info
    info = dom.lease_info(lease.lease_id)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 717, in lease_info
    return vol.lookup(lease_id)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py", line 562, in lookup
    raise LeaseUpdating(lease_id)
LeaseUpdating: Lease bd7249c4-c24d-49d0-897f-6dbb0d4da9ab is updating

   545	    def lookup(self, lease_id):
   546	        """
   547	        Lookup lease by lease_id and return LeaseInfo if found.
   548	
   549	        Raises:
   550	        - NoSuchLease if lease is not found.
   551	        - InvalidRecord if corrupted lease record is found
   552	        - OSError if io operation failed
   553	        """
   554	        log.debug("Looking up lease %r in lockspace %r",
   555	                  lease_id, self.lockspace)
   556	        recnum = self._index.find_record(lease_id)
   557	        if recnum == -1:
   558	            raise se.NoSuchLease(lease_id)
   559	
   560	        record = self._index.read_record(recnum)
   561	        if record.updating:
   562	            raise LeaseUpdating(lease_id)                         <-----------------
   563	
   564	        offset = lease_offset(recnum, self._alignment)
   565	        return LeaseInfo(self.lockspace, lease_id, self._file.name, offset)


4. Engine sees that as a failure and stops SPM on that host (really necessary?)

2020-11-24 10:27:37,102+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1321) [03537d89-caf6-41cb-955d-6b5a363f700f] EVENT_ID: IRS_BROKER_COMMAND_FAILURE(10,803), VDSM command GetVmLeaseInfoVDS failed: Lease bd7249c4-c24d-49d0-897f-6dbb0d4da9ab is updating
2020-11-24 10:27:37,143+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (default task-1321) [03537d89-caf6-41cb-955d-6b5a363f700f] START, SpmStopVDSCommand(HostName = <removed>, SpmStopVDSCommandParameters:{hostId='f478ded5-3dad-4011-a72d-43bb67c5e50a', storagePoolId='7886cfde-d9d8-492e-af9b-909c76f9a669'}), log id: 2334a64

5. After SPM failover engine goes ahead with create a new lease:

2020-11-24 10:27:44,492+01 INFO  [org.ovirt.engine.core.bll.storage.lease.AddVmLeaseCommand] (default task-1321) [2293ee3a] Creating new VM 'bd7249c4-c24d-49d0-897f-6dbb0d4da9ab' lease, because the VM lease doesn't exists on storage domain '0f623c51-66c8-4a5b-a12f-95ab9c6184b4'

2020-11-24 10:27:44,509+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.AddVmLeaseVDSCommand] (default task-1321) [2293ee3a] START, AddVmLeaseVDSCommand( VmLeaseVDSParameters:{storagePoolId='7886cfde-d9d8-492e-af9b-909c76f9a669', ignoreFailoverLimit='false', leaseId='bd7249c4-c24d-49d0-897f-6dbb0d4da9ab', storageDomainId='0f623c51-66c8-4a5b-a12f-95ab9c6184b4'}), log id: 60a83180

6. Fails again, same problem, different fuction

2020-11-24 10:28:49,129+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-20) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM x01001167 command HSMGetAllTasksStatusesVDS failed: Lease bd7249c4-c24d-49d0-897f-6dbb0d4da9ab is updating

There is a TODO here on add to prevent this from happening?

   567	    def add(self, lease_id):
   568	        """
   569	        Add lease to index, returning LeaseInfo.
   570	
   571	        Raises:
   572	        - LeaseExists if lease already stored for lease_id
   573	        - InvalidRecord if corrupted lease record is found
   574	        - NoSpace if all slots are allocated
   575	        - OSError if I/O operation failed
   576	        - sanlock.SanlockException if sanlock operation failed.
   577	        """
   578	        log.info("Adding lease %r in lockspace %r",
   579	                 lease_id, self.lockspace)
   580	        recnum = self._index.find_record(lease_id)
   581	        if recnum != -1:
   582	            record = self._index.read_record(recnum)
   583	            if record.updating:
   584	                # TODO: rebuild this record instead of failing
   585	                raise LeaseUpdating(lease_id)                    <----------
   586	            else:
   587	                raise LeaseExists(lease_id)
   588	
   589	        recnum = self._index.find_free_record()
   590	        if recnum == -1:
   591	            raise NoSpace(lease_id)
   592	
   593	        offset = lease_offset(recnum, self._alignment)
   594	        record = Record(lease_id, offset, updating=True)
   595	        self._write_record(recnum, record)
   596	
   597	        sanlock.write_resource(
   598	            self.lockspace.encode("utf-8"),
   599	            lease_id.encode("utf-8"),
   600	            [(self._file.name, offset)],
   601	            align=self._alignment,
   602	            sector=self._block_size)
   603	
   604	        record = Record(lease_id, offset)
   605	        self._write_record(recnum, record)

7. But user reports the VM now has the lease set and can be removed again. 

8. The next remove succeeds if there is no Sanlock exception.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Add a *fake* sanlock exception in VDSM code during remove

   609	    def remove(self, lease_id):
 ...
   631	        raise sanlock.SanlockException("some error")
   632	        sanlock.write_resource(
   633	            b"",
   634	            b"",
   635	            [(self._file.name, offset)],
   636	            align=self._alignment,
   637	            sector=self._block_size)
   638	
   639	        self._write_record(recnum, EMPTY_RECORD)

2. Create VM with lease, run it
3. Remove lease

2020-11-27 11:48:59,966+1000 ERROR (tasks/7) [storage.TaskManager.Task] (Task='4cbb496f-7546-4745-8c64-31fef208b61b') Unexpected error (task:880)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 2013, in delete_lease
    dom.delete_lease(lease.lease_id)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 1431, in delete_lease
    vol.remove(lease_id)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/xlease.py", line 631, in remove
    raise sanlock.SanlockException("some error")
sanlock.SanlockException: some error

4. Lease is removed from VM in DB

5. Add lease again from engine

6. Info fails

2020-11-27 11:49:25,718+1000 ERROR (jsonrpc/2) [storage.TaskManager.Task] (Task='c0fd02f0-5da3-4161-90f2-0a1a540d8162') Unexpected error (task:880)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run
    return fn(*args, **kargs)
  File "<decorator-gen-215>", line 2, in lease_info
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 3745, in lease_info
    info = dom.lease_info(lease.lease_id)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 735, in lease_info
    return vol.lookup(lease_id)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/xlease.py", line 562, in lookup
    raise LeaseUpdating(lease_id)
vdsm.storage.xlease.LeaseUpdating: Lease 3012f7fa-fad8-4aa0-bc04-5b2749f7f3fe is updating

7. Engine proceeds with create, fails too

2020-11-27 11:49:26,048+1000 ERROR (tasks/1) [storage.TaskManager.Task] (Task='adb5df89-069e-468d-85e1-944be48028a5') Unexpected error (task:880)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1999, in create_lease
    dom.create_lease(lease.lease_id)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 1421, in create_lease
    vol.add(lease_id)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/xlease.py", line 585, in add
    raise LeaseUpdating(lease_id)
vdsm.storage.xlease.LeaseUpdating: Lease 3012f7fa-fad8-4aa0-bc04-5b2749f7f3fe is updating

8. According to engine VM has a lease. It doesn't really.

Actual results:
* Engine status for the lease of the VM is out of sync from storage point of view
* User needs to persist and understand what he is doing to be able to get out of this.

Expected results:
* Better experience for the user when this fails.

Comment 1 Germano Veit Michel 2020-11-27 04:33:28 UTC
(In reply to Germano Veit Michel from comment #0)
> Version-Release number of selected component (if applicable):
> 
> 

Oops. Customer is on 4.3 and I "reproduced it" (as above) on:
rhvm-4.4.2.6-0.2.el8ev.noarch
vdsm-4.40.26.3-1.el8ev.x86_64

Comment 2 Klaas Demter 2020-11-27 07:51:25 UTC
Customer:
rhvm-4.3.11.4-0.1.el7.noarch
vdsm-4.30.50-1.el7ev.x86_64

Comment 4 RHEL Program Management 2021-03-08 07:59:37 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 5 Roman Bednář 2021-03-09 13:56:23 UTC
Verification should consist of testing lease add and remove flows within a mixed versions setup. In one DC there should be one host - host A with old version of vdsm and host B with vdms version that contains a bug fix.

To create an external lease edit VM properties in GUI and go to High Availability, check "Highly Available" checkbox and for "Target Storage Domain for VM lease" select storage domain from the dropdown menu. To remove the lease select "No VM Lease" from the dropdown menu.

Verification steps:

1) Prepare host B to fake sanlock failure, similarly to what's done in the reproducer in initial comment. Place a fake exception right before sanlock writes a resource in xleases.py - LeasesVolume.remove(). Restart vdsmd after editing the file.

    650         # There is no way to remove a resource, so we write an invalid resource
    651         # with empty resource and lockspace values.
    652         # TODO: Use SANLK_WRITE_CLEAR, expected in rhel 7.4.
    653         raise sanlock.SanlockException("some error")
    654         try:
    655             sanlock.write_resource(
    656                 b"",
    657                 b"",


2) Continue with reproducing the issue on host A using the reproducer in initial comment. Make sure host A is SPM and that you're creating a highly available VM (described above). Lease removal operation should leave the lease in updating state because the removal process set the updating flag and then failed to write sanlock resource, this is the old behavior. Attempting to add a lease will fail and SPM role will move to host B where the task gets finished regardless of the updating flag presence, this is the new behavior. See vdsm log for a new warning (host B) about partially created lease:

2021-03-09 08:20:30,029-0500 INFO  (tasks/3) [storage.xlease] Adding lease 'd03e63ed-1bd6-46d2-a8e6-337ec10b64c7' in lockspace '40b0c8e1-e646-4e56-8420-827252743f93' (xlease:585)
2021-03-09 08:20:30,029-0500 WARN  (tasks/3) [storage.xlease] Ignoring partially created lease in updating state recnum=5 resource=d03e63ed-1bd6-46d2-a8e6-337ec10b64c7 offset=8388608 (xlease:594)
2021-03-09 08:20:30,095-0500 INFO  (jsonrpc/7) [vdsm.api] FINISH getStoragePoolInfo return={'info': {'domains': '40b0c8e1-e646-4e56-8420-827252743f93:Active', 'isoprefix': '', 'lver': 54, 'maste

3) Verify that SPM role moved to host B and that the vm has a lease now. To do so check sanlock status on the host running the vm, you should see xleases entry there:

# sanlock status | grep xleases
r 40b0c8e1-e646-4e56-8420-827252743f93:d03e63ed-1bd6-46d2-a8e6-337ec10b64c7:/dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases:8388608:1 p 9009

4) using the lease id verify that lookup works as well:

# echo '{"lease":{"sd_id":"40b0c8e1-e646-4e56-8420-827252743f93","lease_id":"d03e63ed-1bd6-46d2-a8e6-337ec10b64c7"}}' | vdsm-client -f - Lease info
{
    "lease_id": "d03e63ed-1bd6-46d2-a8e6-337ec10b64c7",
    "offset": 8388608,
    "path": "/dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases",
    "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93"
}


5) Remove the lease using engine GUI, host B is still SPM, this will test new behavior of lease removal.

6) Verify that the lease is now gone (use sanlock status like in step 3).
 
7) Add the lease again, this will test adding a lease with fixed version (make sure host B is still SPM while doing so). This will tests new behavior of adding a lease. The GUI is now consistent with lease info in vdsm (in contrast to step 8 of reproducer in initial commit where inconsistency occurred).

Comment 6 Ilan Zuckerman 2021-04-08 09:41:05 UTC
Hi Roman, could you please specify the file paths which needs to be modified?
If possible, specify what exactly needs to be modified.
I have searched in those files [1] on rhv-release-4.4.5-11 with vdsm-4.40.50.10-1 but couldnt find the right place.

Also, could you please specify which 'old' release version should the 'old' host have, and which release version should be installed on the upgraded host.
The engine release version should be aligned with 'old' host or the 'new' one?

The reproduction steps are now broken into two separated comments (yours from comment #5 and the initial description by Germano).
Could you please align them to a one flow which would be easier to follow? As I find this a bit confusing to verify.

[1]:
/usr/lib/python3.6/site-packages/vdsm/tool/xleases.py
/usr/lib/python3.6/site-packages/vdsm/vdsmd.py

Comment 7 Roman Bednář 2021-04-12 08:40:29 UTC
(In reply to Ilan Zuckerman from comment #6)
> Hi Roman, could you please specify the file paths which needs to be modified?
> If possible, specify what exactly needs to be modified.
> I have searched in those files [1] on rhv-release-4.4.5-11 with
> vdsm-4.40.50.10-1 but couldnt find the right place.

Sorry, I wrote xleases.py while it should be xlease.py. Path should be: "/usr/lib/python3.6/site-packages/vdsm/storage/xlease.py"

The modification is just to add a line to raise exception before sanlock write, nothing else. It's the same thing as shown in the initial comment or Comment 5 - step 1 - line 653

> 
> Also, could you please specify which 'old' release version should the 'old'
> host have, and which release version should be installed on the upgraded
> host.

Old version just means any version before this patch is included in vdsm package, or to be more specific you can use the one for which this bug was reproduced as the "old" one - Comment 1 - vdsm-4.40.26.3-1.el8ev.x86_64

> The engine release version should be aligned with 'old' host or the 'new'
> one?

Good question, I don't think it matters though, for the sake of verification. The patch affects only vdsm, not the engine. However I'm not sure if we have any recommended procedure for updating ovirt cluster but I think you should end up in a situation where you have engine version aligned with the "new" version on one host plus one host with "old" version which is not updated yet. In other words - engine gets updated first and then hosts one by one. Any failover between hosts at a point where there are mixed versions in the cluster must not fail.

> 
> The reproduction steps are now broken into two separated comments (yours
> from comment #5 and the initial description by Germano).
> Could you please align them to a one flow which would be easier to follow?
> As I find this a bit confusing to verify.

The initial description does not consider the mixed version scenario and lacks some details imho, that's the reason I rewrote it to keep it simple. I'd suggest you follow the procedure in Comment 5, no need to combine it with the initial one.

Comment 8 Roman Bednář 2021-04-12 09:52:04 UTC
Due to the nature of the patch the feature for rebuilding index had to be redesigned as well. Sanlock resources used to be the source of truth, now it's the index of xleases volume.

There are several cases which can occur and should be verified. To see all of them check docstring of rebuild_index() in xlease.py

Verification procedure shown below describes only one case as an example in which we do a recovery from missing/deleted sanlock resource using the index (now source of truth).


1) create a lease:

# cat <<EOF | vdsm-client -f - Lease create
{
    "lease": {
        "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93",
        "lease_id": "b3f6fa00-b315-4ad4-8108-f73da817b5c5"
    }
}
EOF


2) find lease offset:

# cat <<EOF | vdsm-client -f - Lease info
{
    "lease": {
        "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93",
        "lease_id": "b3f6fa00-b315-4ad4-8108-f73da817b5c5"
    }
}
EOF

{
    "lease_id": "b3f6fa00-b315-4ad4-8108-f73da817b5c5",
    "offset": 10485760,
    "path": "/dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases",
    "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93"
}

3) check sanlock resource created successfully:

# sanlock direct dump /dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases:10485760
  offset                            lockspace                                         resource  timestamp  own  gen lver
10485760 40b0c8e1-e646-4e56-8420-827252743f93             b3f6fa00-b315-4ad4-8108-f73da817b5c5 0000000000 0000 0000 0
11534336                                                                                       0000000000 0000 0000 0

4) lease is now present in the index of xlease logical volume, now delete sanlock resource directly from storage:

# sanlock direct format -x "40b0c8e1-e646-4e56-8420-827252743f93:/dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases:10485760"
format done 0

(the value of -x argument is: lockspace_name:path:offset - see "man 8 sanlock" and look for RINDEX for more details)

5) check sanlock resource is gone:

# sanlock direct dump /dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases:10485760
  offset                            lockspace                                         resource  timestamp  own  gen lver
10485760 40b0c8e1-e646-4e56-8420-827252743f93 rindex_header 0x1 512 4096 10485760
11534336 40b0c8e1-e646-4e56-8420-827252743f93

6) check lease can still be found, this is because we still have a record in the index:

 # cat <<EOF | vdsm-client -f - Lease info
{
    "lease": {
        "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93",
        "lease_id": "b3f6fa00-b315-4ad4-8108-f73da817b5c5"
    }
}
EOF

{
    "lease_id": "b3f6fa00-b315-4ad4-8108-f73da817b5c5",
    "offset": 10485760,
    "path": "/dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases",
    "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93"
}

7) OPTIONAL: enable debug logging for storage and check vdsm.log after running rebuild to see detailed flow:

# vdsm-client Host setLogLevel level=DEBUG name=storage
true

8) we can now rebuild the lease which recreates missing sanlock resource from the index record:

# vdsm-client Lease rebuild_leases sd_id=40b0c8e1-e646-4e56-8420-827252743f93

9) check sanlock - resource should be present again:

# sanlock direct dump /dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases:10485760
  offset                            lockspace                                         resource  timestamp  own  gen lver
10485760 40b0c8e1-e646-4e56-8420-827252743f93             b3f6fa00-b315-4ad4-8108-f73da817b5c5 0000000000 0000 0000 0
11534336                                                                                       0000000000 0000 0000 0

Comment 13 Ilan Zuckerman 2021-05-09 08:35:16 UTC
I wasnt able to get the expected results by following the steps from comment #5.

I used a engine with Host A (older vdsm) and host B (newer vdsm)

engine:
rhv-release-4.4.5-11

Host A:
vdsm-4.40.38-1.el8ev.x86_64

Host B:
vdsm-4.40.60.6-1.el8ev.x86_64

I did exactly the same manipulations, but at step 2) I wasnt able to get the same kind of behavior. 
As soon as I removed the lease, the SPM role didnt swith to Host B, No warn seen in the vdsm log as described.
I also tried the move the HA vm between the hosts (as i wasnt sure on which host should it run at first), but either way - got the same result.

Comment 16 Ilan Zuckerman 2021-05-10 15:26:27 UTC
Verified on the following setup, according those steps. Big thanks to Roman on assisting with this! :

engine:
rhv-release-4.4.5-11

Host A:
vdsm-4.40.38-1.el8ev.x86_64

Host B:
vdsm-4.40.60.6-1.el8ev.x86_64


1) Select Host A as Storage Pool Manager and migrate VM to Host A

2) Create external lease:

Edit Virtual Machine > High Availability > Target Storage Domain for VM Lease == <SD_ID>

2) Host A - xlease added successfully:

[root@storage-ge15-vdsm1 ~]# sanlock status | grep xleases
r bfe05dca-0832-44d7-93c2-eb816e05a15e:3df73ff2-5a63-4b2e-a76f-ee7fdbf003d7:/dev/bfe05dca-0832-44d7-93c2-eb816e05a15e/xleases:3145728:1 p 956752

3) Host A - insert fake sanlock error:

[root@storage-ge15-vdsm1 ~]# grep -nC 5 "some error" /usr/lib/python3.6/site-packages/vdsm/storage/xlease.py
625-        self._write_record(recnum, record)
626-
627-        # There is no way to remove a resource, so we write an invalid resource
628-        # with empty resource and lockspace values.
629-        # TODO: Use SANLK_WRITE_CLEAR, expected in rhel 7.4.
630:        raise sanlock.SanlockException("some error")
631-        sanlock.write_resource(
632-            b"",
633-            b"",
634-            [(self._file.name, offset)],
635-            align=self._alignment,

4) Host A - restart vdsmd

#systemctl restart vdsmd

5) ovirt GUI - switch SPM role back to Host A

6) Host A - fails to remove a lease 

Edit Virtual Machine > High Availability > Target Storage Domain for VM Lease == No VM Lease


2021-05-10 13:52:59,680+0300 INFO  (tasks/3) [storage.xlease] Removing lease '3df73ff2-5a63-4b2e-a76f-ee7fdbf003d7' in lockspace 'bfe05dca-0832-44d7-93c2-eb816e05a15e' (xlease:618)
2021-05-10 13:52:59,681+0300 DEBUG (tasks/3) [storage.xlease] Closing index for lockspace 'bfe05dca-0832-44d7-93c2-eb816e05a15e' (xlease:671)
2021-05-10 13:52:59,681+0300 ERROR (tasks/3) [storage.TaskManager.Task] (Task='66b93ad2-4ae2-4570-83a1-80c89a78b34b') Unexpected error (task:880)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 2036, in delete_lease
    dom.delete_lease(lease.lease_id)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 1454, in delete_lease
    vol.remove(lease_id)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/xlease.py", line 630, in remove
    raise sanlock.SanlockException("some error")
sanlock.SanlockException: some error
2021-05-10 13:52:59,681+0300 DEBUG (tasks/3) [storage.TaskManager.Task] (Task='66b93ad2-4ae2-4570-83a1-80c89a78b34b') Task._run: 66b93ad2-4ae2-4570-83a1-80c89a78b34b () {} failed - stopping task (task:899)
2021-05-10 13:52:59,681+0300 DEBUG (tasks/3) [storage.TaskManager.Task] (Task='66b93ad2-4ae2-4570-83a1-80c89a78b34b') stopping in state aborting (force False) (task:1265)
2021-05-10 13:52:59,681+0300 DEBUG (tasks/3) [storage.TaskManager.Task] (Task='66b93ad2-4ae2-4570-83a1-80c89a78b34b') ref 1 aborting True (task:1008)
2021-05-10 13:52:59,681+0300 DEBUG (tasks/3) [storage.TaskManager.Task] (Task='66b93ad2-4ae2-4570-83a1-80c89a78b34b') aborting: Task is aborted: 'value=some error abortedcode=100' (task:935)


7) Host A - attemp to add a lease again:

2021-05-10 15:53:24,566+0300 INFO  (tasks/2) [storage.xlease] Adding lease '3df73ff2-5a63-4b2e-a76f-ee7fdbf003d7' in lockspace 'bfe05dca-0832-44d7-93c2-eb816e05a15e' (xlease:579)
2021-05-10 15:53:24,566+0300 DEBUG (tasks/2) [storage.xlease] Closing index for lockspace 'bfe05dca-0832-44d7-93c2-eb816e05a15e' (xlease:671)
2021-05-10 15:53:24,566+0300 ERROR (tasks/2) [storage.TaskManager.Task] (Task='b3adf7e0-7221-40be-80ef-40df80064a40') Unexpected error (task:880)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 2022, in create_lease
    dom.create_lease(lease.lease_id)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 1444, in create_lease
    vol.add(lease_id)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/xlease.py", line 585, in add
    raise LeaseUpdating(lease_id)
vdsm.storage.xlease.LeaseUpdating: Lease 3df73ff2-5a63-4b2e-a76f-ee7fdbf003d7 is updating

8) in ovirt gui it looks like the VM still has a lease but there is not any - this is the behavior before fix

9) Select Host B as a Storage Pool Manager and migrate VM to Host B

10) Remove a lease:

Edit Virtual Machine > High Availability > Target Storage Domain for VM Lease == No VM Lease

11) Add lease again and check it's presence on Host B:

[root@storage-ge15-vdsm2 ~]# sanlock status | grep xleases
r bfe05dca-0832-44d7-93c2-eb816e05a15e:3df73ff2-5a63-4b2e-a76f-ee7fdbf003d7:/dev/bfe05dca-0832-44d7-93c2-eb816e05a15e/xleases:3145728:1 p 161153

Comment 20 errata-xmlrpc 2021-06-01 13:21:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.4.6]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2178

Comment 21 meital avital 2022-08-03 07:30:21 UTC
Due to QE capacity, we are not going to cover this issue in our automation


Note You need to log in before you can comment on or make changes to this bug.