Bug 1902127
Summary: | Sanlock exception during Lease.delete leaves lease UPDATING_FLAG in updating state, failing the next Lease.create and engine updates VM lease incorrectly. | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> |
Component: | vdsm | Assignee: | Roman Bednář <rbednar> |
Status: | CLOSED ERRATA | QA Contact: | Ilan Zuckerman <izuckerm> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.4.3 | CC: | dfodor, eshenitz, klaas, lsurette, mavital, rbednar, sfishbai, srevivo, tnisan, ycui |
Target Milestone: | ovirt-4.4.6 | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-06-01 13:21:11 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Germano Veit Michel
2020-11-27 01:57:05 UTC
(In reply to Germano Veit Michel from comment #0) > Version-Release number of selected component (if applicable): > > Oops. Customer is on 4.3 and I "reproduced it" (as above) on: rhvm-4.4.2.6-0.2.el8ev.noarch vdsm-4.40.26.3-1.el8ev.x86_64 Customer: rhvm-4.3.11.4-0.1.el7.noarch vdsm-4.30.50-1.el7ev.x86_64 The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again. Verification should consist of testing lease add and remove flows within a mixed versions setup. In one DC there should be one host - host A with old version of vdsm and host B with vdms version that contains a bug fix. To create an external lease edit VM properties in GUI and go to High Availability, check "Highly Available" checkbox and for "Target Storage Domain for VM lease" select storage domain from the dropdown menu. To remove the lease select "No VM Lease" from the dropdown menu. Verification steps: 1) Prepare host B to fake sanlock failure, similarly to what's done in the reproducer in initial comment. Place a fake exception right before sanlock writes a resource in xleases.py - LeasesVolume.remove(). Restart vdsmd after editing the file. 650 # There is no way to remove a resource, so we write an invalid resource 651 # with empty resource and lockspace values. 652 # TODO: Use SANLK_WRITE_CLEAR, expected in rhel 7.4. 653 raise sanlock.SanlockException("some error") 654 try: 655 sanlock.write_resource( 656 b"", 657 b"", 2) Continue with reproducing the issue on host A using the reproducer in initial comment. Make sure host A is SPM and that you're creating a highly available VM (described above). Lease removal operation should leave the lease in updating state because the removal process set the updating flag and then failed to write sanlock resource, this is the old behavior. Attempting to add a lease will fail and SPM role will move to host B where the task gets finished regardless of the updating flag presence, this is the new behavior. See vdsm log for a new warning (host B) about partially created lease: 2021-03-09 08:20:30,029-0500 INFO (tasks/3) [storage.xlease] Adding lease 'd03e63ed-1bd6-46d2-a8e6-337ec10b64c7' in lockspace '40b0c8e1-e646-4e56-8420-827252743f93' (xlease:585) 2021-03-09 08:20:30,029-0500 WARN (tasks/3) [storage.xlease] Ignoring partially created lease in updating state recnum=5 resource=d03e63ed-1bd6-46d2-a8e6-337ec10b64c7 offset=8388608 (xlease:594) 2021-03-09 08:20:30,095-0500 INFO (jsonrpc/7) [vdsm.api] FINISH getStoragePoolInfo return={'info': {'domains': '40b0c8e1-e646-4e56-8420-827252743f93:Active', 'isoprefix': '', 'lver': 54, 'maste 3) Verify that SPM role moved to host B and that the vm has a lease now. To do so check sanlock status on the host running the vm, you should see xleases entry there: # sanlock status | grep xleases r 40b0c8e1-e646-4e56-8420-827252743f93:d03e63ed-1bd6-46d2-a8e6-337ec10b64c7:/dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases:8388608:1 p 9009 4) using the lease id verify that lookup works as well: # echo '{"lease":{"sd_id":"40b0c8e1-e646-4e56-8420-827252743f93","lease_id":"d03e63ed-1bd6-46d2-a8e6-337ec10b64c7"}}' | vdsm-client -f - Lease info { "lease_id": "d03e63ed-1bd6-46d2-a8e6-337ec10b64c7", "offset": 8388608, "path": "/dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases", "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93" } 5) Remove the lease using engine GUI, host B is still SPM, this will test new behavior of lease removal. 6) Verify that the lease is now gone (use sanlock status like in step 3). 7) Add the lease again, this will test adding a lease with fixed version (make sure host B is still SPM while doing so). This will tests new behavior of adding a lease. The GUI is now consistent with lease info in vdsm (in contrast to step 8 of reproducer in initial commit where inconsistency occurred). Hi Roman, could you please specify the file paths which needs to be modified? If possible, specify what exactly needs to be modified. I have searched in those files [1] on rhv-release-4.4.5-11 with vdsm-4.40.50.10-1 but couldnt find the right place. Also, could you please specify which 'old' release version should the 'old' host have, and which release version should be installed on the upgraded host. The engine release version should be aligned with 'old' host or the 'new' one? The reproduction steps are now broken into two separated comments (yours from comment #5 and the initial description by Germano). Could you please align them to a one flow which would be easier to follow? As I find this a bit confusing to verify. [1]: /usr/lib/python3.6/site-packages/vdsm/tool/xleases.py /usr/lib/python3.6/site-packages/vdsm/vdsmd.py (In reply to Ilan Zuckerman from comment #6) > Hi Roman, could you please specify the file paths which needs to be modified? > If possible, specify what exactly needs to be modified. > I have searched in those files [1] on rhv-release-4.4.5-11 with > vdsm-4.40.50.10-1 but couldnt find the right place. Sorry, I wrote xleases.py while it should be xlease.py. Path should be: "/usr/lib/python3.6/site-packages/vdsm/storage/xlease.py" The modification is just to add a line to raise exception before sanlock write, nothing else. It's the same thing as shown in the initial comment or Comment 5 - step 1 - line 653 > > Also, could you please specify which 'old' release version should the 'old' > host have, and which release version should be installed on the upgraded > host. Old version just means any version before this patch is included in vdsm package, or to be more specific you can use the one for which this bug was reproduced as the "old" one - Comment 1 - vdsm-4.40.26.3-1.el8ev.x86_64 > The engine release version should be aligned with 'old' host or the 'new' > one? Good question, I don't think it matters though, for the sake of verification. The patch affects only vdsm, not the engine. However I'm not sure if we have any recommended procedure for updating ovirt cluster but I think you should end up in a situation where you have engine version aligned with the "new" version on one host plus one host with "old" version which is not updated yet. In other words - engine gets updated first and then hosts one by one. Any failover between hosts at a point where there are mixed versions in the cluster must not fail. > > The reproduction steps are now broken into two separated comments (yours > from comment #5 and the initial description by Germano). > Could you please align them to a one flow which would be easier to follow? > As I find this a bit confusing to verify. The initial description does not consider the mixed version scenario and lacks some details imho, that's the reason I rewrote it to keep it simple. I'd suggest you follow the procedure in Comment 5, no need to combine it with the initial one. Due to the nature of the patch the feature for rebuilding index had to be redesigned as well. Sanlock resources used to be the source of truth, now it's the index of xleases volume. There are several cases which can occur and should be verified. To see all of them check docstring of rebuild_index() in xlease.py Verification procedure shown below describes only one case as an example in which we do a recovery from missing/deleted sanlock resource using the index (now source of truth). 1) create a lease: # cat <<EOF | vdsm-client -f - Lease create { "lease": { "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93", "lease_id": "b3f6fa00-b315-4ad4-8108-f73da817b5c5" } } EOF 2) find lease offset: # cat <<EOF | vdsm-client -f - Lease info { "lease": { "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93", "lease_id": "b3f6fa00-b315-4ad4-8108-f73da817b5c5" } } EOF { "lease_id": "b3f6fa00-b315-4ad4-8108-f73da817b5c5", "offset": 10485760, "path": "/dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases", "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93" } 3) check sanlock resource created successfully: # sanlock direct dump /dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases:10485760 offset lockspace resource timestamp own gen lver 10485760 40b0c8e1-e646-4e56-8420-827252743f93 b3f6fa00-b315-4ad4-8108-f73da817b5c5 0000000000 0000 0000 0 11534336 0000000000 0000 0000 0 4) lease is now present in the index of xlease logical volume, now delete sanlock resource directly from storage: # sanlock direct format -x "40b0c8e1-e646-4e56-8420-827252743f93:/dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases:10485760" format done 0 (the value of -x argument is: lockspace_name:path:offset - see "man 8 sanlock" and look for RINDEX for more details) 5) check sanlock resource is gone: # sanlock direct dump /dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases:10485760 offset lockspace resource timestamp own gen lver 10485760 40b0c8e1-e646-4e56-8420-827252743f93 rindex_header 0x1 512 4096 10485760 11534336 40b0c8e1-e646-4e56-8420-827252743f93 6) check lease can still be found, this is because we still have a record in the index: # cat <<EOF | vdsm-client -f - Lease info { "lease": { "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93", "lease_id": "b3f6fa00-b315-4ad4-8108-f73da817b5c5" } } EOF { "lease_id": "b3f6fa00-b315-4ad4-8108-f73da817b5c5", "offset": 10485760, "path": "/dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases", "sd_id": "40b0c8e1-e646-4e56-8420-827252743f93" } 7) OPTIONAL: enable debug logging for storage and check vdsm.log after running rebuild to see detailed flow: # vdsm-client Host setLogLevel level=DEBUG name=storage true 8) we can now rebuild the lease which recreates missing sanlock resource from the index record: # vdsm-client Lease rebuild_leases sd_id=40b0c8e1-e646-4e56-8420-827252743f93 9) check sanlock - resource should be present again: # sanlock direct dump /dev/40b0c8e1-e646-4e56-8420-827252743f93/xleases:10485760 offset lockspace resource timestamp own gen lver 10485760 40b0c8e1-e646-4e56-8420-827252743f93 b3f6fa00-b315-4ad4-8108-f73da817b5c5 0000000000 0000 0000 0 11534336 0000000000 0000 0000 0 I wasnt able to get the expected results by following the steps from comment #5. I used a engine with Host A (older vdsm) and host B (newer vdsm) engine: rhv-release-4.4.5-11 Host A: vdsm-4.40.38-1.el8ev.x86_64 Host B: vdsm-4.40.60.6-1.el8ev.x86_64 I did exactly the same manipulations, but at step 2) I wasnt able to get the same kind of behavior. As soon as I removed the lease, the SPM role didnt swith to Host B, No warn seen in the vdsm log as described. I also tried the move the HA vm between the hosts (as i wasnt sure on which host should it run at first), but either way - got the same result. Verified on the following setup, according those steps. Big thanks to Roman on assisting with this! : engine: rhv-release-4.4.5-11 Host A: vdsm-4.40.38-1.el8ev.x86_64 Host B: vdsm-4.40.60.6-1.el8ev.x86_64 1) Select Host A as Storage Pool Manager and migrate VM to Host A 2) Create external lease: Edit Virtual Machine > High Availability > Target Storage Domain for VM Lease == <SD_ID> 2) Host A - xlease added successfully: [root@storage-ge15-vdsm1 ~]# sanlock status | grep xleases r bfe05dca-0832-44d7-93c2-eb816e05a15e:3df73ff2-5a63-4b2e-a76f-ee7fdbf003d7:/dev/bfe05dca-0832-44d7-93c2-eb816e05a15e/xleases:3145728:1 p 956752 3) Host A - insert fake sanlock error: [root@storage-ge15-vdsm1 ~]# grep -nC 5 "some error" /usr/lib/python3.6/site-packages/vdsm/storage/xlease.py 625- self._write_record(recnum, record) 626- 627- # There is no way to remove a resource, so we write an invalid resource 628- # with empty resource and lockspace values. 629- # TODO: Use SANLK_WRITE_CLEAR, expected in rhel 7.4. 630: raise sanlock.SanlockException("some error") 631- sanlock.write_resource( 632- b"", 633- b"", 634- [(self._file.name, offset)], 635- align=self._alignment, 4) Host A - restart vdsmd #systemctl restart vdsmd 5) ovirt GUI - switch SPM role back to Host A 6) Host A - fails to remove a lease Edit Virtual Machine > High Availability > Target Storage Domain for VM Lease == No VM Lease 2021-05-10 13:52:59,680+0300 INFO (tasks/3) [storage.xlease] Removing lease '3df73ff2-5a63-4b2e-a76f-ee7fdbf003d7' in lockspace 'bfe05dca-0832-44d7-93c2-eb816e05a15e' (xlease:618) 2021-05-10 13:52:59,681+0300 DEBUG (tasks/3) [storage.xlease] Closing index for lockspace 'bfe05dca-0832-44d7-93c2-eb816e05a15e' (xlease:671) 2021-05-10 13:52:59,681+0300 ERROR (tasks/3) [storage.TaskManager.Task] (Task='66b93ad2-4ae2-4570-83a1-80c89a78b34b') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 2036, in delete_lease dom.delete_lease(lease.lease_id) File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 1454, in delete_lease vol.remove(lease_id) File "/usr/lib/python3.6/site-packages/vdsm/storage/xlease.py", line 630, in remove raise sanlock.SanlockException("some error") sanlock.SanlockException: some error 2021-05-10 13:52:59,681+0300 DEBUG (tasks/3) [storage.TaskManager.Task] (Task='66b93ad2-4ae2-4570-83a1-80c89a78b34b') Task._run: 66b93ad2-4ae2-4570-83a1-80c89a78b34b () {} failed - stopping task (task:899) 2021-05-10 13:52:59,681+0300 DEBUG (tasks/3) [storage.TaskManager.Task] (Task='66b93ad2-4ae2-4570-83a1-80c89a78b34b') stopping in state aborting (force False) (task:1265) 2021-05-10 13:52:59,681+0300 DEBUG (tasks/3) [storage.TaskManager.Task] (Task='66b93ad2-4ae2-4570-83a1-80c89a78b34b') ref 1 aborting True (task:1008) 2021-05-10 13:52:59,681+0300 DEBUG (tasks/3) [storage.TaskManager.Task] (Task='66b93ad2-4ae2-4570-83a1-80c89a78b34b') aborting: Task is aborted: 'value=some error abortedcode=100' (task:935) 7) Host A - attemp to add a lease again: 2021-05-10 15:53:24,566+0300 INFO (tasks/2) [storage.xlease] Adding lease '3df73ff2-5a63-4b2e-a76f-ee7fdbf003d7' in lockspace 'bfe05dca-0832-44d7-93c2-eb816e05a15e' (xlease:579) 2021-05-10 15:53:24,566+0300 DEBUG (tasks/2) [storage.xlease] Closing index for lockspace 'bfe05dca-0832-44d7-93c2-eb816e05a15e' (xlease:671) 2021-05-10 15:53:24,566+0300 ERROR (tasks/2) [storage.TaskManager.Task] (Task='b3adf7e0-7221-40be-80ef-40df80064a40') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 2022, in create_lease dom.create_lease(lease.lease_id) File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 1444, in create_lease vol.add(lease_id) File "/usr/lib/python3.6/site-packages/vdsm/storage/xlease.py", line 585, in add raise LeaseUpdating(lease_id) vdsm.storage.xlease.LeaseUpdating: Lease 3df73ff2-5a63-4b2e-a76f-ee7fdbf003d7 is updating 8) in ovirt gui it looks like the VM still has a lease but there is not any - this is the behavior before fix 9) Select Host B as a Storage Pool Manager and migrate VM to Host B 10) Remove a lease: Edit Virtual Machine > High Availability > Target Storage Domain for VM Lease == No VM Lease 11) Add lease again and check it's presence on Host B: [root@storage-ge15-vdsm2 ~]# sanlock status | grep xleases r bfe05dca-0832-44d7-93c2-eb816e05a15e:3df73ff2-5a63-4b2e-a76f-ee7fdbf003d7:/dev/bfe05dca-0832-44d7-93c2-eb816e05a15e/xleases:3145728:1 p 161153 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.4.6]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2178 Due to QE capacity, we are not going to cover this issue in our automation |