1902208 – LVM-activate: Node is fenced during reboot when a cluster-managed VG uses an iSCSI-attached PV

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1902208 - LVM-activate: Node is fenced during reboot when a cluster-managed VG uses an iSCSI-attached PV

Summary: LVM-activate: Node is fenced during reboot when a cluster-managed VG uses an ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	8.3
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	8.4
Assignee:	Oyvind Albrigtsen
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-27 11:07 UTC by Reid Wahl
Modified:	2024-03-25 17:16 UTC (History)
CC List:	11 users (show)
Fixed In Version:	resource-agents-4.1.1-79.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1901688
Environment:
Last Closed:	2021-05-18 15:12:05 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ClusterLabs resource-agents pull 1592	0	None	closed	LVM-activate: Stop before blk-availability.service	2021-01-29 18:11:01 UTC
Red Hat Knowledge Base (Solution)	5605931	0	None	None	None	2020-11-28 09:00:33 UTC

Description Reid Wahl 2020-11-27 11:07:58 UTC

+++ This bug was initially created as a clone of Bug #1901688 +++

Description of problem:

When an LVM-activate resource manages a VG backed by an iSCSI-attached disk, it fails to stop during a graceful reboot or shutdown. This causes fencing by default.

    Nov 24 17:47:25 fastvm-rhel-8-0-24 systemd[1]: Stopped Logout off all iSCSI sessions on shutdown.
    Nov 24 17:47:25 fastvm-rhel-8-0-24 LVM-activate(lvm)[1989]: ERROR:  Volume group "test_vg1" not found. Cannot process volume group test_vg1
    Nov 24 17:47:25 fastvm-rhel-8-0-24 LVM-activate(lvm)[1989]: ERROR: test_vg1: failed to deactivate.
    Nov 24 17:47:25 fastvm-rhel-8-0-24 pacemaker-execd[1875]: notice: lvm_stop_0[1989] error output [ ocf-exit-reason:test_vg1: failed to deactivate. ]
    Nov 24 17:47:25 fastvm-rhel-8-0-24 pacemaker-controld[1878]: notice: Result of stop operation for lvm on node2: error


Two possible user-noticed effects come to mind, though I'm sure there are others in less common cases:
1. If the user intends to shut down the node, they will find that it rebooted.
2. This basically breaks shutdown-lock=true. When the node is fenced, the shutdown-lock is cleared, and resources get recovered on surviving nodes.


I propose that we ship an LVM-activate.conf file in /etc/systemd/system/resource-agents-deps.target.d/ that contains an `After=iscsi-shutdown.service` directive.

We can of course ask users to configure this themselves, but I would like to proactively avoid this issue. The average user is unlikely to find out they need to configure the resource-agents-deps.target.d file until they've already encountered an issue. This is a resource agent that we ship, and this dependency exists for a relatively common use case (LVM PVs presented via iSCSI).

The ExecStart for iscsi-shutdown.service is /bin/true, and the ExecStop logs out of all iSCSI sessions if any exist. Furthermore, iscsi-shutdown.service is enabled by default. So I can't think of any downsides here.

-----

Version-Release number of selected component (if applicable):

resource-agents-4.1.1-68.el8.x86_64

-----

How reproducible:

Most of the time.

-----

Steps to Reproduce:
1. Create a shared volume group whose backing PV is an iSCSI-attached disk.
2. Create an LVM-activate resource to manage this VG.
3. Perform a graceful reboot or shutdown (e.g., `systemctl reboot`) on the node where the LVM-activate resource is running.

-----

Actual results:

The LVM-activate resource fails to stop during pacemaker.service shutdown, and the node is fenced instead of rebooted gracefully.

-----

Expected results:

LVM-activate resource stops successfully, and the node reboots without issue.

Comment 1 Reid Wahl 2020-11-28 01:04:19 UTC

Oyvind suggested installing a systemd drop-in "After=blk-availability.service" directive to resource-agents-deps.target during the LVM-activate start operation, as the legacy LVM agent did:
  - https://github.com/ClusterLabs/resource-agents/blob/8f7e35455453e8cb355fccf895d7e07b7c64eb30/heartbeat/LVM#L232-L234

This sounded like a good idea, considering the definition of blk-availability, and I think this does put us on the right track. The definition includes iscsi-shutdown.service as well as some other storage presentation services, so it should be more versatile in preventing issues like the one reported in this bug.
~~~
[Unit]
Description=Availability of block devices
Before=shutdown.target
After=lvm2-activation.service iscsi-shutdown.service iscsi.service iscsid.service fcoe.service rbdmap.service
DefaultDependencies=no
Conflicts=shutdown.target

[Service]
Type=oneshot
ExecStart=/usr/bin/true
ExecStop=/usr/sbin/blkdeactivate -u -l wholevg -m disablequeueing -r wait
RemainAfterExit=yes
~~~

Surprisingly, it didn't work. This is because blk-availability.service is not enabled by default and thus does not start automatically. So it doesn't get stopped during shutdown.

[root@fastvm-rhel-8-0-24 ~]# systemctl status blk-availability.service
● blk-availability.service - Availability of block devices
   Loaded: loaded (/usr/lib/systemd/system/blk-availability.service; disabled; vendor preset: disabled)
   Active: inactive (dead)


I checked my RHEL 7 system and found that blk-availability.service was active there despite being disabled.

[root@fastvm-rhel-7-6-22 ~]# systemctl status blk-availability.service
● blk-availability.service - Availability of block devices
   Loaded: loaded (/usr/lib/systemd/system/blk-availability.service; disabled; vendor preset: disabled)
   Active: active (exited) since Fri 2020-11-27 16:40:31 PST; 16s ago
   ...

As it turns out, this is because multipathd.service is enabled on my RHEL 7 system:

[root@fastvm-rhel-7-6-22 ~]# systemctl show blk-availability | egrep '(Wanted|Required)By='
WantedBy=multipathd.service
[root@fastvm-rhel-7-6-22 ~]# systemctl is-enabled multipathd
enabled


So I suspect that the systemd_drop_in() approach for blk-availability works on RHEL 7 **only if** blk-availability.service already gets started as a dependency of another service like multipathd.service. In other words, **this is probably a bug in the legacy LVM resource agent**, but probably not one that's worth fixing at this point.

Note that adding a "Wants=blk-availability.service" directive as shown below also doesn't work.
~~~
        if systemd_is_running; then
                systemd_drop_in "99-LVM-activate-after" "After" \
                        "blk-availability.service"
                systemd_drop_in "99-LVM-activate-wants" "Wants" \
                        "blk-availability.service"
        fi
~~~

This is because the directive is only added **after** pacemaker.service has started and the LVM-activate resource starts.

I think we have two pretty straightforward options here:
(1) Configure an "After=/Wants=" dependency on blk-availability.service **before** pacemaker.service gets started (and thus before the LVM-activate resource agent runs). We could ship an /etc/systemd/system/resource-agents-deps.target.d/99-LVM-activate.conf that includes these directives.
(2) Run `systemctl start blk-availability.service` from within the resource agent at the time when we install the dependencies.


@Oyvind: If this sounds good to you, let me know which of those approaches sounds better to you (I'm guessing #2, to keep it within the agent). One of us can submit the PR next week.

Comment 10 errata-xmlrpc 2021-05-18 15:12:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (resource-agents bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1736

Comment 11 Oyvind Albrigtsen 2021-06-15 10:19:50 UTC

*** Bug 1972035 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.