Description of problem: There seem to be more than one problem here, the user experience is not great if a lease fails to remove. It's also tricky to reproduce, probably needs a way to make sanlock timeout at an specific IO. Basically: * The engine supdate the VM to add/remove lease even if the lease creation/deletion failed on the SPM. * VDSM leaves the lease in 'updating' state if remove fails due to sanlock timeout, and doesn't rebuild it gracefully on the next add attempt (as the engine thinks it removed) VM was running with lease. Now see this: 1. Remove lease from storage fails 2020-11-22 09:54:21,856+0100 INFO (tasks/1) [storage.xlease] Removing lease 'bd7249c4-c24d-49d0-897f-6dbb0d4da9ab' in lockspace u'0f623c51-66c8-4a5b-a12f-95ab9c6184b4' (xlease:619) 2020-11-22 09:54:31,876+0100 ERROR (tasks/1) [storage.TaskManager.Task] (Task='627ef29e-77ad-447f-8f5e-5bc347f30617') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 2083, in delete_lease dom.delete_lease(lease.lease_id) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 1365, in delete_lease vol.remove(lease_id) File "/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py", line 636, in remove sector=self._block_size) SanlockException: (-202, 'Sanlock resource write failure', 'IO timeout') The exception was here on line 636, when asking sanlock to write an empty lease. But note that it sets the metadata to 'updating' in 625, which is left as is when sanlock raises the exception, empty record is not written and neither Updating is set back to False (which would be correct?) 609 def remove(self, lease_id): 610 """ 611 Remove lease from index 612 613 Raises: 614 - NoSuchLease if lease was not found 615 - OSError if I/O operation failed 616 - sanlock.SanlockException if sanlock operation failed. 617 """ 618 log.info("Removing lease %r in lockspace %r", 619 lease_id, self.lockspace) 620 recnum = self._index.find_record(lease_id) 621 if recnum == -1: 622 raise se.NoSuchLease(lease_id) 623 624 offset = lease_offset(recnum, self._alignment) 625 record = Record(lease_id, offset, updating=True) 626 self._write_record(recnum, record) 627 628 # There is no way to remove a resource, so we write an invalid resource 629 # with empty resource and lockspace values. 630 # TODO: Use SANLK_WRITE_CLEAR, expected in rhel 7.4. 631 sanlock.write_resource( 632 b"", 633 b"", 634 [(self._file.name, offset)], 635 align=self._alignment, 636 sector=self._block_size) 637 638 self._write_record(recnum, EMPTY_RECORD) 2. For the engine, the VM is running without a lease (hotunplug succeded but delete lease failed). This does not sound good too. 3. Now as from engine eyes the delete succeded (the hotplug did but delete lease from storage did not) we add the lease again. First the engine requests a Lease.info, and it fails right there, with the update flag left above. 2020-11-24 10:27:37,082+0100 INFO (jsonrpc/2) [vdsm.api] START lease_info(lease={u'sd_id': u'0f623c51-66c8-4a5b-a12f-95ab9c6184b4', u'lease_id': u'bd7249c4-c24d-49d0-897f-6dbb0d4da9ab'}) from=::ffff:10.160.86.155,60366, flow_id=03537d89-caf6-41cb-955d-6b5a363f700f, task_id=82e5b162-cf68-4f9b-9e55-58db5d944ec9 (api:48) 2020-11-24 10:27:37,095+0100 ERROR (jsonrpc/2) [storage.TaskManager.Task] (Task='82e5b162-cf68-4f9b-9e55-58db5d944ec9') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in lease_info File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3721, in lease_info info = dom.lease_info(lease.lease_id) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 717, in lease_info return vol.lookup(lease_id) File "/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py", line 562, in lookup raise LeaseUpdating(lease_id) LeaseUpdating: Lease bd7249c4-c24d-49d0-897f-6dbb0d4da9ab is updating 545 def lookup(self, lease_id): 546 """ 547 Lookup lease by lease_id and return LeaseInfo if found. 548 549 Raises: 550 - NoSuchLease if lease is not found. 551 - InvalidRecord if corrupted lease record is found 552 - OSError if io operation failed 553 """ 554 log.debug("Looking up lease %r in lockspace %r", 555 lease_id, self.lockspace) 556 recnum = self._index.find_record(lease_id) 557 if recnum == -1: 558 raise se.NoSuchLease(lease_id) 559 560 record = self._index.read_record(recnum) 561 if record.updating: 562 raise LeaseUpdating(lease_id) <----------------- 563 564 offset = lease_offset(recnum, self._alignment) 565 return LeaseInfo(self.lockspace, lease_id, self._file.name, offset) 4. Engine sees that as a failure and stops SPM on that host (really necessary?) 2020-11-24 10:27:37,102+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1321) [03537d89-caf6-41cb-955d-6b5a363f700f] EVENT_ID: IRS_BROKER_COMMAND_FAILURE(10,803), VDSM command GetVmLeaseInfoVDS failed: Lease bd7249c4-c24d-49d0-897f-6dbb0d4da9ab is updating 2020-11-24 10:27:37,143+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (default task-1321) [03537d89-caf6-41cb-955d-6b5a363f700f] START, SpmStopVDSCommand(HostName = <removed>, SpmStopVDSCommandParameters:{hostId='f478ded5-3dad-4011-a72d-43bb67c5e50a', storagePoolId='7886cfde-d9d8-492e-af9b-909c76f9a669'}), log id: 2334a64 5. After SPM failover engine goes ahead with create a new lease: 2020-11-24 10:27:44,492+01 INFO [org.ovirt.engine.core.bll.storage.lease.AddVmLeaseCommand] (default task-1321) [2293ee3a] Creating new VM 'bd7249c4-c24d-49d0-897f-6dbb0d4da9ab' lease, because the VM lease doesn't exists on storage domain '0f623c51-66c8-4a5b-a12f-95ab9c6184b4' 2020-11-24 10:27:44,509+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.AddVmLeaseVDSCommand] (default task-1321) [2293ee3a] START, AddVmLeaseVDSCommand( VmLeaseVDSParameters:{storagePoolId='7886cfde-d9d8-492e-af9b-909c76f9a669', ignoreFailoverLimit='false', leaseId='bd7249c4-c24d-49d0-897f-6dbb0d4da9ab', storageDomainId='0f623c51-66c8-4a5b-a12f-95ab9c6184b4'}), log id: 60a83180 6. Fails again, same problem, different fuction 2020-11-24 10:28:49,129+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-20) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM x01001167 command HSMGetAllTasksStatusesVDS failed: Lease bd7249c4-c24d-49d0-897f-6dbb0d4da9ab is updating There is a TODO here on add to prevent this from happening? 567 def add(self, lease_id): 568 """ 569 Add lease to index, returning LeaseInfo. 570 571 Raises: 572 - LeaseExists if lease already stored for lease_id 573 - InvalidRecord if corrupted lease record is found 574 - NoSpace if all slots are allocated 575 - OSError if I/O operation failed 576 - sanlock.SanlockException if sanlock operation failed. 577 """ 578 log.info("Adding lease %r in lockspace %r", 579 lease_id, self.lockspace) 580 recnum = self._index.find_record(lease_id) 581 if recnum != -1: 582 record = self._index.read_record(recnum) 583 if record.updating: 584 # TODO: rebuild this record instead of failing 585 raise LeaseUpdating(lease_id) <---------- 586 else: 587 raise LeaseExists(lease_id) 588 589 recnum = self._index.find_free_record() 590 if recnum == -1: 591 raise NoSpace(lease_id) 592 593 offset = lease_offset(recnum, self._alignment) 594 record = Record(lease_id, offset, updating=True) 595 self._write_record(recnum, record) 596 597 sanlock.write_resource( 598 self.lockspace.encode("utf-8"), 599 lease_id.encode("utf-8"), 600 [(self._file.name, offset)], 601 align=self._alignment, 602 sector=self._block_size) 603 604 record = Record(lease_id, offset) 605 self._write_record(recnum, record) 7. But user reports the VM now has the lease set and can be removed again. 8. The next remove succeeds if there is no Sanlock exception. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Add a *fake* sanlock exception in VDSM code during remove 609 def remove(self, lease_id): ... 631 raise sanlock.SanlockException("some error") 632 sanlock.write_resource( 633 b"", 634 b"", 635 [(self._file.name, offset)], 636 align=self._alignment, 637 sector=self._block_size) 638 639 self._write_record(recnum, EMPTY_RECORD) 2. Create VM with lease, run it 3. Remove lease 2020-11-27 11:48:59,966+1000 ERROR (tasks/7) [storage.TaskManager.Task] (Task='4cbb496f-7546-4745-8c64-31fef208b61b') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 2013, in delete_lease dom.delete_lease(lease.lease_id) File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 1431, in delete_lease vol.remove(lease_id) File "/usr/lib/python3.6/site-packages/vdsm/storage/xlease.py", line 631, in remove raise sanlock.SanlockException("some error") sanlock.SanlockException: some error 4. Lease is removed from VM in DB 5. Add lease again from engine 6. Info fails 2020-11-27 11:49:25,718+1000 ERROR (jsonrpc/2) [storage.TaskManager.Task] (Task='c0fd02f0-5da3-4161-90f2-0a1a540d8162') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "<decorator-gen-215>", line 2, in lease_info File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 3745, in lease_info info = dom.lease_info(lease.lease_id) File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 735, in lease_info return vol.lookup(lease_id) File "/usr/lib/python3.6/site-packages/vdsm/storage/xlease.py", line 562, in lookup raise LeaseUpdating(lease_id) vdsm.storage.xlease.LeaseUpdating: Lease 3012f7fa-fad8-4aa0-bc04-5b2749f7f3fe is updating 7. Engine proceeds with create, fails too 2020-11-27 11:49:26,048+1000 ERROR (tasks/1) [storage.TaskManager.Task] (Task='adb5df89-069e-468d-85e1-944be48028a5') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1999, in create_lease dom.create_lease(lease.lease_id) File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 1421, in create_lease vol.add(lease_id) File "/usr/lib/python3.6/site-packages/vdsm/storage/xlease.py", line 585, in add raise LeaseUpdating(lease_id) vdsm.storage.xlease.LeaseUpdating: Lease 3012f7fa-fad8-4aa0-bc04-5b2749f7f3fe is updating 8. According to engine VM has a lease. It doesn't really. Actual results: * Engine status for the lease of the VM is out of sync from storage point of view * User needs to persist and understand what he is doing to be able to get out of this. Expected results: * Better experience for the user when this fails.
(In reply to Germano Veit Michel from comment #0) > Version-Release number of selected component (if applicable): > > Oops. Customer is on 4.3 and I "reproduced it" (as above) on: rhvm-4.4.2.6-0.2.el8ev.noarch vdsm-4.40.26.3-1.el8ev.x86_64
Customer: rhvm-4.3.11.4-0.1.el7.noarch vdsm-4.30.50-1.el7ev.x86_64
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.
Verification should consist of testing lease add and remove flows within a mixed versions setup. In one DC there should be one host - host A with old version of vdsm and host B with vdms version that contains a bug fix. To create an external lease edit VM properties in GUI and go to High Availability, check "Highly Available" checkbox and for "Target Storage Domain for VM lease" select storage domain from the dropdown menu. To remove the lease select "No VM Lease" from the dropdown menu. Verification steps: 1) Prepare host B to fake sanlock failure, similarly to what's done in the reproducer in initial comment. Place a fake exception right before sanlock writes a resource in xleases.py - LeasesVolume.remove(). Restart vdsmd after editing the file. 650 # There is no way to remove a resource, so we write an invalid resource 651 # with empty resource and lockspace values. 652 # TODO: Use SANLK_WRITE_CLEAR, expected in rhel 7.4. 653 raise sanlock.SanlockException("some error") 654 try: 655 sanlock.write_resource( 656 b"", 657 b"", 2) Continue with reproducing the issue on host A using the reproducer in initial comment. Make sure host A is SPM and that you're creating a highly available VM (described above). Lease removal operation should leave the lease in updating state because the removal process set the updating flag and then failed to write sanlock resource, this is the old behavior. Attempting to add a lease will fail and SPM role will move to host B where the task gets finished regardless of the updating flag presence, this is the new behavior. See vdsm log for a new warning (host B) about partially created lease: 2021-03-09 08:20:30,029-0500 INFO (tasks/3) [storage.xlease] Adding lease 'd03e63ed-1bd6-46d2-a8e6-337ec10b64c7' in lockspace '40b0c8e1-e646-4e56-8420-827252743f93' (xlease:585) 2021-03-09 08:20:30,029-0500 WARN (tasks/3) [storage.xlease] Ignoring partially created lease in updating state recnum=5 resource=d03e63ed-1bd6-46d2-a8e6-337ec10b64c7 offset=8388608 (xlease:594) 2021-03-09 08:20:30,095-0500 INFO (jsonrpc/7) [vdsm.api] FINISH getStoragePoolInfo return={'info': {'domains': '40b0c8e1-e646-4e56-8420-827252743f93:Active', 'isoprefix': '', 'lver': 54, 'maste 3) Verify that SPM role moved to host B and that the vm has a lease now. To do so check sanlock status on the host running the vm, you should see xleases entry there: # sanlock status | grep xleases r 40b0c8e1-e646-4e56-8420-827252743f93:d03e63ed-1bd6-46d2-a8e6-337ec10b64c7:/dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases:8388608:1 p 9009 4) using the lease id verify that lookup works as well: # echo '{"lease":{"sd_id":"40b0c8e1-e646-4e56-8420-827252743f93","lease_id":"d03e63ed-1bd6-46d2-a8e6-337ec10b64c7"}}' | vdsm-client -f - Lease info { "lease_id": "d03e63ed-1bd6-46d2-a8e6-337ec10b64c7", "offset": 8388608, "path": "/dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases", "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93" } 5) Remove the lease using engine GUI, host B is still SPM, this will test new behavior of lease removal. 6) Verify that the lease is now gone (use sanlock status like in step 3). 7) Add the lease again, this will test adding a lease with fixed version (make sure host B is still SPM while doing so). This will tests new behavior of adding a lease. The GUI is now consistent with lease info in vdsm (in contrast to step 8 of reproducer in initial commit where inconsistency occurred).
Hi Roman, could you please specify the file paths which needs to be modified? If possible, specify what exactly needs to be modified. I have searched in those files [1] on rhv-release-4.4.5-11 with vdsm-4.40.50.10-1 but couldnt find the right place. Also, could you please specify which 'old' release version should the 'old' host have, and which release version should be installed on the upgraded host. The engine release version should be aligned with 'old' host or the 'new' one? The reproduction steps are now broken into two separated comments (yours from comment #5 and the initial description by Germano). Could you please align them to a one flow which would be easier to follow? As I find this a bit confusing to verify. [1]: /usr/lib/python3.6/site-packages/vdsm/tool/xleases.py /usr/lib/python3.6/site-packages/vdsm/vdsmd.py
(In reply to Ilan Zuckerman from comment #6) > Hi Roman, could you please specify the file paths which needs to be modified? > If possible, specify what exactly needs to be modified. > I have searched in those files [1] on rhv-release-4.4.5-11 with > vdsm-4.40.50.10-1 but couldnt find the right place. Sorry, I wrote xleases.py while it should be xlease.py. Path should be: "/usr/lib/python3.6/site-packages/vdsm/storage/xlease.py" The modification is just to add a line to raise exception before sanlock write, nothing else. It's the same thing as shown in the initial comment or Comment 5 - step 1 - line 653 > > Also, could you please specify which 'old' release version should the 'old' > host have, and which release version should be installed on the upgraded > host. Old version just means any version before this patch is included in vdsm package, or to be more specific you can use the one for which this bug was reproduced as the "old" one - Comment 1 - vdsm-4.40.26.3-1.el8ev.x86_64 > The engine release version should be aligned with 'old' host or the 'new' > one? Good question, I don't think it matters though, for the sake of verification. The patch affects only vdsm, not the engine. However I'm not sure if we have any recommended procedure for updating ovirt cluster but I think you should end up in a situation where you have engine version aligned with the "new" version on one host plus one host with "old" version which is not updated yet. In other words - engine gets updated first and then hosts one by one. Any failover between hosts at a point where there are mixed versions in the cluster must not fail. > > The reproduction steps are now broken into two separated comments (yours > from comment #5 and the initial description by Germano). > Could you please align them to a one flow which would be easier to follow? > As I find this a bit confusing to verify. The initial description does not consider the mixed version scenario and lacks some details imho, that's the reason I rewrote it to keep it simple. I'd suggest you follow the procedure in Comment 5, no need to combine it with the initial one.
Due to the nature of the patch the feature for rebuilding index had to be redesigned as well. Sanlock resources used to be the source of truth, now it's the index of xleases volume. There are several cases which can occur and should be verified. To see all of them check docstring of rebuild_index() in xlease.py Verification procedure shown below describes only one case as an example in which we do a recovery from missing/deleted sanlock resource using the index (now source of truth). 1) create a lease: # cat <<EOF | vdsm-client -f - Lease create { "lease": { "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93", "lease_id": "b3f6fa00-b315-4ad4-8108-f73da817b5c5" } } EOF 2) find lease offset: # cat <<EOF | vdsm-client -f - Lease info { "lease": { "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93", "lease_id": "b3f6fa00-b315-4ad4-8108-f73da817b5c5" } } EOF { "lease_id": "b3f6fa00-b315-4ad4-8108-f73da817b5c5", "offset": 10485760, "path": "/dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases", "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93" } 3) check sanlock resource created successfully: # sanlock direct dump /dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases:10485760 offset lockspace resource timestamp own gen lver 10485760 40b0c8e1-e646-4e56-8420-827252743f93 b3f6fa00-b315-4ad4-8108-f73da817b5c5 0000000000 0000 0000 0 11534336 0000000000 0000 0000 0 4) lease is now present in the index of xlease logical volume, now delete sanlock resource directly from storage: # sanlock direct format -x "40b0c8e1-e646-4e56-8420-827252743f93:/dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases:10485760" format done 0 (the value of -x argument is: lockspace_name:path:offset - see "man 8 sanlock" and look for RINDEX for more details) 5) check sanlock resource is gone: # sanlock direct dump /dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases:10485760 offset lockspace resource timestamp own gen lver 10485760 40b0c8e1-e646-4e56-8420-827252743f93 rindex_header 0x1 512 4096 10485760 11534336 40b0c8e1-e646-4e56-8420-827252743f93 6) check lease can still be found, this is because we still have a record in the index: # cat <<EOF | vdsm-client -f - Lease info { "lease": { "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93", "lease_id": "b3f6fa00-b315-4ad4-8108-f73da817b5c5" } } EOF { "lease_id": "b3f6fa00-b315-4ad4-8108-f73da817b5c5", "offset": 10485760, "path": "/dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases", "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93" } 7) OPTIONAL: enable debug logging for storage and check vdsm.log after running rebuild to see detailed flow: # vdsm-client Host setLogLevel level=DEBUG name=storage true 8) we can now rebuild the lease which recreates missing sanlock resource from the index record: # vdsm-client Lease rebuild_leases sd_id=40b0c8e1-e646-4e56-8420-827252743f93 9) check sanlock - resource should be present again: # sanlock direct dump /dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases:10485760 offset lockspace resource timestamp own gen lver 10485760 40b0c8e1-e646-4e56-8420-827252743f93 b3f6fa00-b315-4ad4-8108-f73da817b5c5 0000000000 0000 0000 0 11534336 0000000000 0000 0000 0
I wasnt able to get the expected results by following the steps from comment #5. I used a engine with Host A (older vdsm) and host B (newer vdsm) engine: rhv-release-4.4.5-11 Host A: vdsm-4.40.38-1.el8ev.x86_64 Host B: vdsm-4.40.60.6-1.el8ev.x86_64 I did exactly the same manipulations, but at step 2) I wasnt able to get the same kind of behavior. As soon as I removed the lease, the SPM role didnt swith to Host B, No warn seen in the vdsm log as described. I also tried the move the HA vm between the hosts (as i wasnt sure on which host should it run at first), but either way - got the same result.
Verified on the following setup, according those steps. Big thanks to Roman on assisting with this! : engine: rhv-release-4.4.5-11 Host A: vdsm-4.40.38-1.el8ev.x86_64 Host B: vdsm-4.40.60.6-1.el8ev.x86_64 1) Select Host A as Storage Pool Manager and migrate VM to Host A 2) Create external lease: Edit Virtual Machine > High Availability > Target Storage Domain for VM Lease == <SD_ID> 2) Host A - xlease added successfully: [root@storage-ge15-vdsm1 ~]# sanlock status | grep xleases r bfe05dca-0832-44d7-93c2-eb816e05a15e:3df73ff2-5a63-4b2e-a76f-ee7fdbf003d7:/dev/bfe05dca-0832-44d7-93c2-eb816e05a15e/xleases:3145728:1 p 956752 3) Host A - insert fake sanlock error: [root@storage-ge15-vdsm1 ~]# grep -nC 5 "some error" /usr/lib/python3.6/site-packages/vdsm/storage/xlease.py 625- self._write_record(recnum, record) 626- 627- # There is no way to remove a resource, so we write an invalid resource 628- # with empty resource and lockspace values. 629- # TODO: Use SANLK_WRITE_CLEAR, expected in rhel 7.4. 630: raise sanlock.SanlockException("some error") 631- sanlock.write_resource( 632- b"", 633- b"", 634- [(self._file.name, offset)], 635- align=self._alignment, 4) Host A - restart vdsmd #systemctl restart vdsmd 5) ovirt GUI - switch SPM role back to Host A 6) Host A - fails to remove a lease Edit Virtual Machine > High Availability > Target Storage Domain for VM Lease == No VM Lease 2021-05-10 13:52:59,680+0300 INFO (tasks/3) [storage.xlease] Removing lease '3df73ff2-5a63-4b2e-a76f-ee7fdbf003d7' in lockspace 'bfe05dca-0832-44d7-93c2-eb816e05a15e' (xlease:618) 2021-05-10 13:52:59,681+0300 DEBUG (tasks/3) [storage.xlease] Closing index for lockspace 'bfe05dca-0832-44d7-93c2-eb816e05a15e' (xlease:671) 2021-05-10 13:52:59,681+0300 ERROR (tasks/3) [storage.TaskManager.Task] (Task='66b93ad2-4ae2-4570-83a1-80c89a78b34b') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 2036, in delete_lease dom.delete_lease(lease.lease_id) File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 1454, in delete_lease vol.remove(lease_id) File "/usr/lib/python3.6/site-packages/vdsm/storage/xlease.py", line 630, in remove raise sanlock.SanlockException("some error") sanlock.SanlockException: some error 2021-05-10 13:52:59,681+0300 DEBUG (tasks/3) [storage.TaskManager.Task] (Task='66b93ad2-4ae2-4570-83a1-80c89a78b34b') Task._run: 66b93ad2-4ae2-4570-83a1-80c89a78b34b () {} failed - stopping task (task:899) 2021-05-10 13:52:59,681+0300 DEBUG (tasks/3) [storage.TaskManager.Task] (Task='66b93ad2-4ae2-4570-83a1-80c89a78b34b') stopping in state aborting (force False) (task:1265) 2021-05-10 13:52:59,681+0300 DEBUG (tasks/3) [storage.TaskManager.Task] (Task='66b93ad2-4ae2-4570-83a1-80c89a78b34b') ref 1 aborting True (task:1008) 2021-05-10 13:52:59,681+0300 DEBUG (tasks/3) [storage.TaskManager.Task] (Task='66b93ad2-4ae2-4570-83a1-80c89a78b34b') aborting: Task is aborted: 'value=some error abortedcode=100' (task:935) 7) Host A - attemp to add a lease again: 2021-05-10 15:53:24,566+0300 INFO (tasks/2) [storage.xlease] Adding lease '3df73ff2-5a63-4b2e-a76f-ee7fdbf003d7' in lockspace 'bfe05dca-0832-44d7-93c2-eb816e05a15e' (xlease:579) 2021-05-10 15:53:24,566+0300 DEBUG (tasks/2) [storage.xlease] Closing index for lockspace 'bfe05dca-0832-44d7-93c2-eb816e05a15e' (xlease:671) 2021-05-10 15:53:24,566+0300 ERROR (tasks/2) [storage.TaskManager.Task] (Task='b3adf7e0-7221-40be-80ef-40df80064a40') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 2022, in create_lease dom.create_lease(lease.lease_id) File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 1444, in create_lease vol.add(lease_id) File "/usr/lib/python3.6/site-packages/vdsm/storage/xlease.py", line 585, in add raise LeaseUpdating(lease_id) vdsm.storage.xlease.LeaseUpdating: Lease 3df73ff2-5a63-4b2e-a76f-ee7fdbf003d7 is updating 8) in ovirt gui it looks like the VM still has a lease but there is not any - this is the behavior before fix 9) Select Host B as a Storage Pool Manager and migrate VM to Host B 10) Remove a lease: Edit Virtual Machine > High Availability > Target Storage Domain for VM Lease == No VM Lease 11) Add lease again and check it's presence on Host B: [root@storage-ge15-vdsm2 ~]# sanlock status | grep xleases r bfe05dca-0832-44d7-93c2-eb816e05a15e:3df73ff2-5a63-4b2e-a76f-ee7fdbf003d7:/dev/bfe05dca-0832-44d7-93c2-eb816e05a15e/xleases:3145728:1 p 161153
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.4.6]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2178
Due to QE capacity, we are not going to cover this issue in our automation