1961752 – Panic if SPM lease is lost

Bug 1961752 - Panic if SPM lease is lost

Summary: Panic if SPM lease is lost

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.40.60.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.4.7
Target Release:	---
Assignee:	Nir Soffer
QA Contact:	Lukas Svaty
Docs Contact:
URL:
Whiteboard:
Depends On:	1961748
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-18 16:00 UTC by Nir Soffer
Modified:	2021-11-04 19:28 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2021-07-06 07:28:20 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.4+

Attachments	(Terms of Use)
Script for releasing SPM lease (535 bytes, text/plain) 2021-05-21 21:10 UTC, Nir Soffer	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	114853	master	MERGED	clusterlock: Rename inquire() to inspect()	2021-05-25 15:44:31 UTC
oVirt gerrit	114854	master	MERGED	fakesanlock: Add inquire()	2021-05-25 15:44:33 UTC
oVirt gerrit	114905	master	MERGED	clusterlock: Add inquire()	2021-06-01 18:17:47 UTC
oVirt gerrit	114906	master	MERGED	spwd: Introduce storage pool watchdog	2021-06-03 09:11:58 UTC
oVirt gerrit	114960	master	MERGED	fakesanlock: Fix simulated sanlock errors	2021-06-01 18:17:43 UTC
oVirt gerrit	114961	master	MERGED	fakesanlock: Simulate EBUSY	2021-06-01 18:17:45 UTC
oVirt gerrit	115005	master	MERGED	sp: Watch the SPM lease	2021-06-03 09:12:01 UTC
oVirt gerrit	115028	master	MERGED	spec: Require sanlock 3.8.3-3	2021-06-06 10:47:57 UTC

Description Nir Soffer 2021-05-18 16:00:17 UTC

Description of problem:

In bug 1952345 we had 2 hosts running as SPM for 3 month. This led to the 
expected corruption of LVM metadata since both hosts were trying to modify
the same VG.

The root cause was incorrect usage of sanlock connection socket on vdsm side,
triggering bug in sanlock closing the socket and releasing the leases if 
sanlock received invalid message overt the socket. The vdsm issue was fixed
and will be released in 4.4.7. The sanlock issue is fixed upstream and may
be released in RHEL 8.4.z.

We don't have a way to ensure that similar issues do not exist in sanlock
and future bug will not lead to similar disaster. To avoid future failure
vdsm need to monitor the SPM lease actively and detect the (impossible) case
when the SPM lease was lost.

If this impossible case happen, vdsm should panic. Vdsm will lose the SPM 
role, and engine will select a new SPM.

To implement this we depend on sanlock.inquire() API which not exposed yet
in sanlock python binding. Adding this API is tracked in bug 1961748.

Comment 1 Nir Soffer 2021-05-18 16:04:53 UTC

Suggesting for 4.4.7, but since this depends on sanlock, we may need to defer this
until the sanlock bug 1961748 is released.

Comment 2 Nir Soffer 2021-05-19 19:48:13 UTC

An easy way to reproduce this issue is to release the SPM lease behind
vdsm back. Turns out that any process that can access sanlock socket
can release leases owned by other processes.

To reproduce the issue:

1. Connect to SPM host
2. Find vdsm pid (239533 in this example)
3. Start python3 shell

# python3
Python 3.6.8 (default, Mar 18 2021, 08:58:41) 
[GCC 8.4.1 20200928 (Red Hat 8.4.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.


4. Find the SPM lease details

>>> import sanlock
>>> import pprint
>>> spm_lease = sanlock.inquire(pid=239533)[0]
>>> pprint.pprint(spm_lease)
{'disks': [('/dev/8ef0cc64-77c5-49e0-a0f5-03cee9bbee20/leases', 1048576)],
 'flags': 1,
 'lockspace': b'8ef0cc64-77c5-49e0-a0f5-03cee9bbee20',
 'resource': b'SDM',
 'version': 32}

At this point vdsm logs:

2021-05-19 22:37:30,920+0300 INFO  (jsonrpc/1) [vdsm.api] FINISH getSpmStatus return={'spm_st': {'spmStatus': 'SPM', 'spmLver': 
32, 'spmId': 2}} from=::ffff:192.168.122.11,49764, task_id=dd58caed-a3f0-448d-986c-d3a3c42b6433 (api:54)


5. Release the SPM lease - behind vdsm back

>>> sanlock.release(spm_lease["lockspace"], spm_lease["resource"], spm_lease["disks"], pid=239533)
>>> sanlock.inquire(pid=239533)
[]

Vdsm lost the SPM lease, but it does not know about this.

After the SPM lease was lost, vdsm logs:

2021-05-19 22:37:41,008+0300 INFO  (jsonrpc/4) [vdsm.api] FINISH getSpmStatus return={'spm_st': {'spmStatus': 'SPM', 'spmLver': -1, 'spmId': -1}} from=::ffff:192.168.122.11,49764, task_id=8c306117-933f-4b02-98c4-2b5f6e28bd02 (api:54)

But vdsm continues to report the SPM role, and serve as the SPM
(for example trying to extend volumes, or creating new volumes).

When this bug is fixed, vdsm will detect this condition and panic,
aborting the SPM role, and engine will select a new SPM.

Comment 3 Nir Soffer 2021-05-21 21:10:43 UTC

Created attachment 1785695 [details]
Script for releasing SPM lease

This script requires sanlock-3.8.3-2 for using sanlock.inquire()

To release the spm lease, run this on the spm host:

    python3 release-spm-lease.py

Comment 4 Nir Soffer 2021-06-01 19:47:19 UTC

## How SPM lease is monitored

When SPM is started, a new thread is started, watching the SPM
lease status. We see this new log:

2021-06-01 21:34:15,426+0300 INFO  (jsonrpc/6) [storage.spwd] Start watching cluster lock Lease(lockspace='71702e1e-82d2-41d2-93df-678f2415da2a', resource='SDM', disk=('/dev/71702e1e-82d2-41d2-93df-678f2415da2a/leases', 1048576)) (spwd:79)

The watchdog thread (spwd) check the lease status every
spm:watchdog_interval seconds (20 seconds by default).

If DEBUG level is enabled for the "storage.spwd" logger, we will see this
log each time we check the lease:

2021-06-01 21:34:27,456+0300 DEBUG (spwd) [storage.spwd] Found cluster lock {'lockspace': '71702e1e-82d2-41d2-93df-678f2415da2a', 'resource': 'SDM', 'flags': 1, 'version': 2, 'disks': [('/dev/71702e1e-82d2-41d2-93df-678f2415da2a/leases', 1048576)]} (spwd:148)

When SPM is stopped, the watchdog is stopped, showing this log:

2021-06-01 21:35:02,057+0300 INFO  (jsonrpc/2) [storage.spwd] Stop watching cluster lock Lease(lockspace='71702e1e-82d2-41d2-93df-678f2415da2a', resource='SDM', disk=('/dev/71702e1e-82d2-41d2-93df-678f2415da2a/leases', 1048576)) (spwd:88)

## Panic

Issues with the SPM lease status are *not* expected. Every issue is likely
a bug n sanlock or in vdsm.

The watchdog panics in these conditions:
1. SPM lease is not found
2. SPM lease has wrong disk
3. Checking lease status failed with temporary error (EBUSY) more than
   3 times (lease status unknown for spm:watchdog_internal * 4 seconds)
4. Checking lease status fails with unexpected error

When the watchdog panics, we will see an error like:

2021-06-01 21:23:25,987+0300 ERROR (spwd) [root] Panic: Cluster lock Lease(lockspace='f53d7919-d275-4ad4-ba57-f280000de397', resource='SDM', disk=('/dev/f53d7919-d275-4ad4-ba57-f280000de397/leases', 1048576)) was lost (panic:31)
2

Once this log is flushed to storage, vdsm kill the entire process group.
The next log should be the vdsm start log:

2021-06-01 21:23:32,881+0300 INFO  (MainThread) [vds] (PID: 11627) I am the actual vdsm 4.40.70.1.11.gite15891f17 host3 (4.18.0-305.el8.x86_64) (vdsmd:163)

When vdsm starts up, it is no longer the SPM. engine will select a new SPM,
or start the SPM again if this is the only host that can serve as the SPM.

## Configuration

New "spm" section added:

[spm]

# If enabled, montior the SPM lease status and panic if the lease
# status is not expected. The SPM host will lose the SPM role, and
# engine will select a new SPM host. (default true)
# watchdog_enable = true

# Watchdog check internal in seconds. The recommended value is
# sanlock:io_timeout * 2. (default 20)
# watchdog_interval = 20

Disabling the watching is not recommended, but it may be needed if 
this watchdog causes trouble in production.

For stress testing, it is recommended to use very low watchdog_interval,
to increase the chance of triggering races and unexpected conditions,
both in sanlock and in vdsm.

To change the configuration add this drop-in configuration file on all hosts:

    $ cat /etc/vdsm/vdsm.conf.d/99-local.conf 
    [spm]
    watchdog_enable = true
    # Use very low interval for testing.
    watchdog_interval = 1

## How to test

0. Install sanlock >= 3.8.3-4.

When using older sanlock version, the watchdog is not used since sanlock does
not support the inquire() API.

1. Start SPM on a host

2. Release the SPM lease outside of vdsm using the script from attachment 1785695 [details].

    # python3 relase-spm-lease.py 
    Looking up vdsm pid
    Inquiring process 14407 resoruces
    Releasing SPM lease {'lockspace': b'2b5b232f-4a59-4ace-9e18-3841c63505de', 'resource': b'SDM', 'flags': 1, 'version': 4,
    'disks': [('/rhev/data-center/mnt/storage:_export_16/2b5b232f-4a59-4ace-9e18-3841c63505de/dom_md/leases', 1048576)]}

3. Vdsm should panic and restart:

    2021-06-01 22:34:18,111+0300 ERROR (spwd) [root] Panic: Cluster lock Lease(lockspace='2b5b232f-4a59-4ace-9e18-3841c63505de',
    resource='SDM', disk=('/rhev/data-center/mnt/storage:_export_16/2b5b232f-4a59-4ace-9e18-3841c63505de/dom_md/leases', 1048576)) 
    was lost (panic:31)
    NoneType: None
    2021-06-01 22:34:32,547+0300 INFO  (MainThread) [vds] (PID: 109880) I am the actual vdsm 4.40.70.1.11.gite15891f17 host3
    (4.18.0-305.el8.x86_64) (vdsmd:163)

4. Engine should select a new SPM, and the host should move to UP state.

5. Disable the watchdog int the configuration and restart vdsm. When the
   SPM host runs on this host, the watchdog should not be started (check logs).

Comment 5 Nir Soffer 2021-06-03 11:23:15 UTC

The new monitor was merged and can be tested in vdsm-4.40.70.3.

The monitor will be used when using sanlock >= 3.8.3-3.el8_4. When using
older sanlock version, the monitor will not be started.

Comment 7 Ilan Zuckerman 2021-06-14 09:00:12 UTC

Verified on:
rhv-4.4.7-3
vdsm-4.40.70.3-1.el8ev.x86_64
sanlock-3.8.3-3.el8_4.x86_64

- Execute [root@storage-ge5-vdsm1 ~]# python3 release-spm-lease.py 

Looking up vdsm pid
Inquiring process 25772 resoruces
Releasing SPM lease {'lockspace': b'12c8a4ce-1fd8-413a-9d5a-bb41fcf005b7', 'resource': b'SDM', 'flags': 1, 'version': 4, 'disks': [('/rhev/data-center/mnt/mantis-nfs-xxx.com:_nas01_ge__storage5__nfs__1/12c8a4ce-1fd8-413a-9d5a-bb41fcf005b7/dom_md/leases', 1048576)]}

- VDSM panics:
2021-06-14 11:43:56,775+0300 ERROR (spwd) [root] Panic: Cluster lock Lease(lockspace='12c8a4ce-1fd8-413a-9d5a-bb41fcf005b7', resource='SDM', disk=('/rhev/data-center/mnt/mantis-nfs-xxx.com:_nas01_ge__storage5__nfs__1/12c8a4ce-1fd8-413a-9d5a-bb41fcf005b7/dom_md/leases', 1048576)) was lost (panic:31)

- Another host gets the SPM role:
2021-06-14 11:44:17,163+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (EE-ManagedScheduledExecutorService-en
gineScheduledThreadPool-Thread-36) [32386c3] FINISH, SpmStartVDSCommand, return: SpmStatusResult:{SPM Id='3', SPM LVER='5', SPM Status
='SPM'}, log id: 131aeed5

- Host is in 'up' state

Comment 9 Sandro Bonazzola 2021-07-06 07:28:20 UTC

This bugzilla is included in oVirt 4.4.7 release, published on July 6th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.