Description of problem: In bug 1952345 we had 2 hosts running as SPM for 3 month. This led to the expected corruption of LVM metadata since both hosts were trying to modify the same VG. The root cause was incorrect usage of sanlock connection socket on vdsm side, triggering bug in sanlock closing the socket and releasing the leases if sanlock received invalid message overt the socket. The vdsm issue was fixed and will be released in 4.4.7. The sanlock issue is fixed upstream and may be released in RHEL 8.4.z. We don't have a way to ensure that similar issues do not exist in sanlock and future bug will not lead to similar disaster. To avoid future failure vdsm need to monitor the SPM lease actively and detect the (impossible) case when the SPM lease was lost. If this impossible case happen, vdsm should panic. Vdsm will lose the SPM role, and engine will select a new SPM. To implement this we depend on sanlock.inquire() API which not exposed yet in sanlock python binding. Adding this API is tracked in bug 1961748.
Suggesting for 4.4.7, but since this depends on sanlock, we may need to defer this until the sanlock bug 1961748 is released.
An easy way to reproduce this issue is to release the SPM lease behind vdsm back. Turns out that any process that can access sanlock socket can release leases owned by other processes. To reproduce the issue: 1. Connect to SPM host 2. Find vdsm pid (239533 in this example) 3. Start python3 shell # python3 Python 3.6.8 (default, Mar 18 2021, 08:58:41) [GCC 8.4.1 20200928 (Red Hat 8.4.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. 4. Find the SPM lease details >>> import sanlock >>> import pprint >>> spm_lease = sanlock.inquire(pid=239533)[0] >>> pprint.pprint(spm_lease) {'disks': [('/dev/8ef0cc64-77c5-49e0-a0f5-03cee9bbee20/leases', 1048576)], 'flags': 1, 'lockspace': b'8ef0cc64-77c5-49e0-a0f5-03cee9bbee20', 'resource': b'SDM', 'version': 32} At this point vdsm logs: 2021-05-19 22:37:30,920+0300 INFO (jsonrpc/1) [vdsm.api] FINISH getSpmStatus return={'spm_st': {'spmStatus': 'SPM', 'spmLver': 32, 'spmId': 2}} from=::ffff:192.168.122.11,49764, task_id=dd58caed-a3f0-448d-986c-d3a3c42b6433 (api:54) 5. Release the SPM lease - behind vdsm back >>> sanlock.release(spm_lease["lockspace"], spm_lease["resource"], spm_lease["disks"], pid=239533) >>> sanlock.inquire(pid=239533) [] Vdsm lost the SPM lease, but it does not know about this. After the SPM lease was lost, vdsm logs: 2021-05-19 22:37:41,008+0300 INFO (jsonrpc/4) [vdsm.api] FINISH getSpmStatus return={'spm_st': {'spmStatus': 'SPM', 'spmLver': -1, 'spmId': -1}} from=::ffff:192.168.122.11,49764, task_id=8c306117-933f-4b02-98c4-2b5f6e28bd02 (api:54) But vdsm continues to report the SPM role, and serve as the SPM (for example trying to extend volumes, or creating new volumes). When this bug is fixed, vdsm will detect this condition and panic, aborting the SPM role, and engine will select a new SPM.
Created attachment 1785695 [details] Script for releasing SPM lease This script requires sanlock-3.8.3-2 for using sanlock.inquire() To release the spm lease, run this on the spm host: python3 release-spm-lease.py
## How SPM lease is monitored When SPM is started, a new thread is started, watching the SPM lease status. We see this new log: 2021-06-01 21:34:15,426+0300 INFO (jsonrpc/6) [storage.spwd] Start watching cluster lock Lease(lockspace='71702e1e-82d2-41d2-93df-678f2415da2a', resource='SDM', disk=('/dev/71702e1e-82d2-41d2-93df-678f2415da2a/leases', 1048576)) (spwd:79) The watchdog thread (spwd) check the lease status every spm:watchdog_interval seconds (20 seconds by default). If DEBUG level is enabled for the "storage.spwd" logger, we will see this log each time we check the lease: 2021-06-01 21:34:27,456+0300 DEBUG (spwd) [storage.spwd] Found cluster lock {'lockspace': '71702e1e-82d2-41d2-93df-678f2415da2a', 'resource': 'SDM', 'flags': 1, 'version': 2, 'disks': [('/dev/71702e1e-82d2-41d2-93df-678f2415da2a/leases', 1048576)]} (spwd:148) When SPM is stopped, the watchdog is stopped, showing this log: 2021-06-01 21:35:02,057+0300 INFO (jsonrpc/2) [storage.spwd] Stop watching cluster lock Lease(lockspace='71702e1e-82d2-41d2-93df-678f2415da2a', resource='SDM', disk=('/dev/71702e1e-82d2-41d2-93df-678f2415da2a/leases', 1048576)) (spwd:88) ## Panic Issues with the SPM lease status are *not* expected. Every issue is likely a bug n sanlock or in vdsm. The watchdog panics in these conditions: 1. SPM lease is not found 2. SPM lease has wrong disk 3. Checking lease status failed with temporary error (EBUSY) more than 3 times (lease status unknown for spm:watchdog_internal * 4 seconds) 4. Checking lease status fails with unexpected error When the watchdog panics, we will see an error like: 2021-06-01 21:23:25,987+0300 ERROR (spwd) [root] Panic: Cluster lock Lease(lockspace='f53d7919-d275-4ad4-ba57-f280000de397', resource='SDM', disk=('/dev/f53d7919-d275-4ad4-ba57-f280000de397/leases', 1048576)) was lost (panic:31) 2 Once this log is flushed to storage, vdsm kill the entire process group. The next log should be the vdsm start log: 2021-06-01 21:23:32,881+0300 INFO (MainThread) [vds] (PID: 11627) I am the actual vdsm 4.40.70.1.11.gite15891f17 host3 (4.18.0-305.el8.x86_64) (vdsmd:163) When vdsm starts up, it is no longer the SPM. engine will select a new SPM, or start the SPM again if this is the only host that can serve as the SPM. ## Configuration New "spm" section added: [spm] # If enabled, montior the SPM lease status and panic if the lease # status is not expected. The SPM host will lose the SPM role, and # engine will select a new SPM host. (default true) # watchdog_enable = true # Watchdog check internal in seconds. The recommended value is # sanlock:io_timeout * 2. (default 20) # watchdog_interval = 20 Disabling the watching is not recommended, but it may be needed if this watchdog causes trouble in production. For stress testing, it is recommended to use very low watchdog_interval, to increase the chance of triggering races and unexpected conditions, both in sanlock and in vdsm. To change the configuration add this drop-in configuration file on all hosts: $ cat /etc/vdsm/vdsm.conf.d/99-local.conf [spm] watchdog_enable = true # Use very low interval for testing. watchdog_interval = 1 ## How to test 0. Install sanlock >= 3.8.3-4. When using older sanlock version, the watchdog is not used since sanlock does not support the inquire() API. 1. Start SPM on a host 2. Release the SPM lease outside of vdsm using the script from attachment 1785695 [details]. # python3 relase-spm-lease.py Looking up vdsm pid Inquiring process 14407 resoruces Releasing SPM lease {'lockspace': b'2b5b232f-4a59-4ace-9e18-3841c63505de', 'resource': b'SDM', 'flags': 1, 'version': 4, 'disks': [('/rhev/data-center/mnt/storage:_export_16/2b5b232f-4a59-4ace-9e18-3841c63505de/dom_md/leases', 1048576)]} 3. Vdsm should panic and restart: 2021-06-01 22:34:18,111+0300 ERROR (spwd) [root] Panic: Cluster lock Lease(lockspace='2b5b232f-4a59-4ace-9e18-3841c63505de', resource='SDM', disk=('/rhev/data-center/mnt/storage:_export_16/2b5b232f-4a59-4ace-9e18-3841c63505de/dom_md/leases', 1048576)) was lost (panic:31) NoneType: None 2021-06-01 22:34:32,547+0300 INFO (MainThread) [vds] (PID: 109880) I am the actual vdsm 4.40.70.1.11.gite15891f17 host3 (4.18.0-305.el8.x86_64) (vdsmd:163) 4. Engine should select a new SPM, and the host should move to UP state. 5. Disable the watchdog int the configuration and restart vdsm. When the SPM host runs on this host, the watchdog should not be started (check logs).
The new monitor was merged and can be tested in vdsm-4.40.70.3. The monitor will be used when using sanlock >= 3.8.3-3.el8_4. When using older sanlock version, the monitor will not be started.
Verified on: rhv-4.4.7-3 vdsm-4.40.70.3-1.el8ev.x86_64 sanlock-3.8.3-3.el8_4.x86_64 - Execute [root@storage-ge5-vdsm1 ~]# python3 release-spm-lease.py Looking up vdsm pid Inquiring process 25772 resoruces Releasing SPM lease {'lockspace': b'12c8a4ce-1fd8-413a-9d5a-bb41fcf005b7', 'resource': b'SDM', 'flags': 1, 'version': 4, 'disks': [('/rhev/data-center/mnt/mantis-nfs-xxx.com:_nas01_ge__storage5__nfs__1/12c8a4ce-1fd8-413a-9d5a-bb41fcf005b7/dom_md/leases', 1048576)]} - VDSM panics: 2021-06-14 11:43:56,775+0300 ERROR (spwd) [root] Panic: Cluster lock Lease(lockspace='12c8a4ce-1fd8-413a-9d5a-bb41fcf005b7', resource='SDM', disk=('/rhev/data-center/mnt/mantis-nfs-xxx.com:_nas01_ge__storage5__nfs__1/12c8a4ce-1fd8-413a-9d5a-bb41fcf005b7/dom_md/leases', 1048576)) was lost (panic:31) - Another host gets the SPM role: 2021-06-14 11:44:17,163+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (EE-ManagedScheduledExecutorService-en gineScheduledThreadPool-Thread-36) [32386c3] FINISH, SpmStartVDSCommand, return: SpmStatusResult:{SPM Id='3', SPM LVER='5', SPM Status ='SPM'}, log id: 131aeed5 - Host is in 'up' state
This bugzilla is included in oVirt 4.4.7 release, published on July 6th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.7 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.