Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
+++ This bug was initially created as a clone of Bug #1901688 +++
Description of problem:
When an LVM-activate resource manages a VG backed by an iSCSI-attached disk, it fails to stop during a graceful reboot or shutdown. This causes fencing by default.
Nov 24 17:47:25 fastvm-rhel-8-0-24 systemd[1]: Stopped Logout off all iSCSI sessions on shutdown.
Nov 24 17:47:25 fastvm-rhel-8-0-24 LVM-activate(lvm)[1989]: ERROR: Volume group "test_vg1" not found. Cannot process volume group test_vg1
Nov 24 17:47:25 fastvm-rhel-8-0-24 LVM-activate(lvm)[1989]: ERROR: test_vg1: failed to deactivate.
Nov 24 17:47:25 fastvm-rhel-8-0-24 pacemaker-execd[1875]: notice: lvm_stop_0[1989] error output [ ocf-exit-reason:test_vg1: failed to deactivate. ]
Nov 24 17:47:25 fastvm-rhel-8-0-24 pacemaker-controld[1878]: notice: Result of stop operation for lvm on node2: error
Two possible user-noticed effects come to mind, though I'm sure there are others in less common cases:
1. If the user intends to shut down the node, they will find that it rebooted.
2. This basically breaks shutdown-lock=true. When the node is fenced, the shutdown-lock is cleared, and resources get recovered on surviving nodes.
I propose that we ship an LVM-activate.conf file in /etc/systemd/system/resource-agents-deps.target.d/ that contains an `After=iscsi-shutdown.service` directive.
We can of course ask users to configure this themselves, but I would like to proactively avoid this issue. The average user is unlikely to find out they need to configure the resource-agents-deps.target.d file until they've already encountered an issue. This is a resource agent that we ship, and this dependency exists for a relatively common use case (LVM PVs presented via iSCSI).
The ExecStart for iscsi-shutdown.service is /bin/true, and the ExecStop logs out of all iSCSI sessions if any exist. Furthermore, iscsi-shutdown.service is enabled by default. So I can't think of any downsides here.
-----
Version-Release number of selected component (if applicable):
resource-agents-4.1.1-68.el8.x86_64
-----
How reproducible:
Most of the time.
-----
Steps to Reproduce:
1. Create a shared volume group whose backing PV is an iSCSI-attached disk.
2. Create an LVM-activate resource to manage this VG.
3. Perform a graceful reboot or shutdown (e.g., `systemctl reboot`) on the node where the LVM-activate resource is running.
-----
Actual results:
The LVM-activate resource fails to stop during pacemaker.service shutdown, and the node is fenced instead of rebooted gracefully.
-----
Expected results:
LVM-activate resource stops successfully, and the node reboots without issue.
Oyvind suggested installing a systemd drop-in "After=blk-availability.service" directive to resource-agents-deps.target during the LVM-activate start operation, as the legacy LVM agent did:
- https://github.com/ClusterLabs/resource-agents/blob/8f7e35455453e8cb355fccf895d7e07b7c64eb30/heartbeat/LVM#L232-L234
This sounded like a good idea, considering the definition of blk-availability, and I think this does put us on the right track. The definition includes iscsi-shutdown.service as well as some other storage presentation services, so it should be more versatile in preventing issues like the one reported in this bug.
~~~
[Unit]
Description=Availability of block devices
Before=shutdown.target
After=lvm2-activation.service iscsi-shutdown.service iscsi.service iscsid.service fcoe.service rbdmap.service
DefaultDependencies=no
Conflicts=shutdown.target
[Service]
Type=oneshot
ExecStart=/usr/bin/true
ExecStop=/usr/sbin/blkdeactivate -u -l wholevg -m disablequeueing -r wait
RemainAfterExit=yes
~~~
Surprisingly, it didn't work. This is because blk-availability.service is not enabled by default and thus does not start automatically. So it doesn't get stopped during shutdown.
[root@fastvm-rhel-8-0-24 ~]# systemctl status blk-availability.service
● blk-availability.service - Availability of block devices
Loaded: loaded (/usr/lib/systemd/system/blk-availability.service; disabled; vendor preset: disabled)
Active: inactive (dead)
I checked my RHEL 7 system and found that blk-availability.service was active there despite being disabled.
[root@fastvm-rhel-7-6-22 ~]# systemctl status blk-availability.service
● blk-availability.service - Availability of block devices
Loaded: loaded (/usr/lib/systemd/system/blk-availability.service; disabled; vendor preset: disabled)
Active: active (exited) since Fri 2020-11-27 16:40:31 PST; 16s ago
...
As it turns out, this is because multipathd.service is enabled on my RHEL 7 system:
[root@fastvm-rhel-7-6-22 ~]# systemctl show blk-availability | egrep '(Wanted|Required)By='
WantedBy=multipathd.service
[root@fastvm-rhel-7-6-22 ~]# systemctl is-enabled multipathd
enabled
So I suspect that the systemd_drop_in() approach for blk-availability works on RHEL 7 **only if** blk-availability.service already gets started as a dependency of another service like multipathd.service. In other words, **this is probably a bug in the legacy LVM resource agent**, but probably not one that's worth fixing at this point.
Note that adding a "Wants=blk-availability.service" directive as shown below also doesn't work.
~~~
if systemd_is_running; then
systemd_drop_in "99-LVM-activate-after" "After" \
"blk-availability.service"
systemd_drop_in "99-LVM-activate-wants" "Wants" \
"blk-availability.service"
fi
~~~
This is because the directive is only added **after** pacemaker.service has started and the LVM-activate resource starts.
I think we have two pretty straightforward options here:
(1) Configure an "After=/Wants=" dependency on blk-availability.service **before** pacemaker.service gets started (and thus before the LVM-activate resource agent runs). We could ship an /etc/systemd/system/resource-agents-deps.target.d/99-LVM-activate.conf that includes these directives.
(2) Run `systemctl start blk-availability.service` from within the resource agent at the time when we install the dependencies.
@Oyvind: If this sounds good to you, let me know which of those approaches sounds better to you (I'm guessing #2, to keep it within the agent). One of us can submit the PR next week.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (resource-agents bug fix and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2021:1736
Comment 11Oyvind Albrigtsen
2021-06-15 10:19:50 UTC
*** Bug 1972035 has been marked as a duplicate of this bug. ***