1972035 – LVM-activate: Start operation always recreates drop-in file and runs systemctl daemon-reload

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1972035 - LVM-activate: Start operation always recreates drop-in file and runs systemctl daemon-reload

Summary: LVM-activate: Start operation always recreates drop-in file and runs systemct...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	8.4
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	beta
Target Release:	---
Assignee:	Oyvind Albrigtsen
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1972236
TreeView+	depends on / blocked

Reported:	2021-06-15 06:13 UTC by Reid Wahl
Modified:	2024-12-20 20:15 UTC (History)
CC List:	6 users (show)
Fixed In Version:	resource-agents-4.1.1-96.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1972236 (view as bug list)
Environment:
Last Closed:	2021-11-09 17:27:32 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ClusterLabs resource-agents pull 1652	None	closed	LVM-activate: fix drop-in check to avoid re-creating drop-in file when it already exists	2021-06-15 23:41:44 UTC
Red Hat Knowledge Base (Solution)	6119121	None	None	None	2021-06-15 10:31:49 UTC
Red Hat Product Errata	RHSA-2021:4139	None	None	None	2021-11-09 17:28:00 UTC

Description Reid Wahl 2021-06-15 06:13:16 UTC

Description of problem:

Due to a typo in upstream commit 79fb4b2d [1], LVM-activate calls systemd_drop_in every time it starts, which recreates the 99-LVM-activate.conf file and runs `systemctl daemon-reload`. See issue #1647 [2] and pull #1652 [3].

Somehow (this behavior continues to baffle me), this breaks the nfsserver resource's failover.

Consider the usual NFS cluster stack. (Note: the export seems to be required in order to reproduce this issue.)

  * Resource Group: nfsgroup:
    * vg	(ocf::heartbeat:LVM-activate):	 Started node2
    * fs	(ocf::heartbeat:Filesystem):	 Started node2
    * share_fs	(ocf::heartbeat:Filesystem):	 Started node2
    * nfsdaemon	(ocf::heartbeat:nfsserver):	 Started node2
    * export	(ocf::heartbeat:exportfs):	 Started node2
    * vip	(ocf::heartbeat:IPaddr2):	 Started node2

where "fs" is the nfs_shared_infodir filesystem.


If I move nfsgroup from node2 to node1 and then back to node2, the nfsserver start operation times out the **second** time it attempts to start on node2.

This problem disappears if I incorporate commit 34e75a57 [4] and reboot both nodes.


I haven't been able to figure out **why** the call to systemd_drop_in during every LVM-activate startup causes a problem for the nfsserver resource. But somehow it's affecting the rpc_pipefs mount and/or the var-lib-nfs-rpc_pipefs.mount systemd unit. Here are logs from a second startup on node2, which timed out.

Jun 14 22:47:33 fastvm-rhel-8-0-24 pacemaker-controld[1699]: notice: Requesting local execution of start operation for nfsdaemon on node2
Jun 14 22:47:33 fastvm-rhel-8-0-24 pacemaker-controld[1699]: notice: Result of monitor operation for share_fs on node2: ok
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: ERROR: nfs-mountd is not running
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: /var/lib/nfs/rpc_pipefs already mounted. Unmounting in preparation to bind mount nfs dir
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Stopped target rpc_pipefs.target.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Unmounting RPC Pipe File System...
Jun 14 22:47:33 fastvm-rhel-8-0-24 umount[3586]: umount: /var/lib/nfs/rpc_pipefs: not mounted.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to set up mount unit: Device or resource busy
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to set up mount unit: Device or resource busy
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Mount process exited, code=exited status=32
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Failed with result 'exit-code'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Unmounted RPC Pipe File System.
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Starting NFS server ...
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: rpcbind i: 1
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Starting Preprocess NFS configuration convertion...
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfs-convert.service: Succeeded.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Started Preprocess NFS configuration convertion.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Starting NFS status monitor for NFSv2/3 locking....
Jun 14 22:47:33 fastvm-rhel-8-0-24 rpc.statd[3641]: Version 2.3.3 starting
Jun 14 22:47:33 fastvm-rhel-8-0-24 rpc.statd[3641]: Flags: TI-RPC
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Started NFS status monitor for NFSv2/3 locking..
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Directory /var/lib/nfs/rpc_pipefs to mount over is not empty, mounting anyway.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Mounting RPC Pipe File System...
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Starting NFS Mount Daemon...
Jun 14 22:47:33 fastvm-rhel-8-0-24 mount[3645]: mount: /var/lib/nfs/rpc_pipefs: sunrpc already mounted on /var/lib/nfs/rpc_pipefs.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to set up mount unit: Device or resource busy
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to set up mount unit: Device or resource busy
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Mount process exited, code=exited status=32
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Failed with result 'exit-code'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to mount RPC Pipe File System.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for rpc_pipefs.target.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for NFSv4 ID-name mapping service.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfs-idmapd.service: Job nfs-idmapd.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for NFSv4 Client Tracking Daemon.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfsdcld.service: Job nfsdcld.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for RPC security service for NFS client and server.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: rpc-gssd.service: Job rpc-gssd.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: rpc_pipefs.target: Job rpc_pipefs.target/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 rpc.mountd[3648]: Version 2.3.3 starting
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Started NFS Mount Daemon.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Starting NFS server and services...
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-mountd i: 1
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Directory /var/lib/nfs/rpc_pipefs to mount over is not empty, mounting anyway.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Mounting RPC Pipe File System...
Jun 14 22:47:33 fastvm-rhel-8-0-24 mount[3664]: mount: /var/lib/nfs/rpc_pipefs: sunrpc already mounted on /var/lib/nfs/rpc_pipefs.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Mount process exited, code=exited status=32
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Failed with result 'exit-code'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to mount RPC Pipe File System.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for rpc_pipefs.target.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for NFSv4 ID-name mapping service.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfs-idmapd.service: Job nfs-idmapd.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for NFSv4 Client Tracking Daemon.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfsdcld.service: Job nfsdcld.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for RPC security service for NFS client and server.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: rpc-gssd.service: Job rpc-gssd.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: rpc_pipefs.target: Job rpc_pipefs.target/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 1
Jun 14 22:47:34 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 2
Jun 14 22:47:34 fastvm-rhel-8-0-24 kernel: NFSD: Using UMH upcall client tracking operations.
Jun 14 22:47:34 fastvm-rhel-8-0-24 kernel: NFSD: starting 90-second grace period (net f0000098)
Jun 14 22:47:34 fastvm-rhel-8-0-24 systemd[1]: Reloading GSSAPI Proxy Daemon.
Jun 14 22:47:34 fastvm-rhel-8-0-24 systemd[1]: Reloaded GSSAPI Proxy Daemon.
Jun 14 22:47:34 fastvm-rhel-8-0-24 systemd[1]: Started NFS server and services.
Jun 14 22:47:35 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 3
Jun 14 22:47:36 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 4
Jun 14 22:47:37 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 5
Jun 14 22:47:38 fastvm-rhel-8-0-24 pacemaker-execd[1696]: warning: nfsdaemon_start_0 process (PID 3481) timed out
Jun 14 22:47:38 fastvm-rhel-8-0-24 pacemaker-execd[1696]: warning: nfsdaemon_start_0[3481] timed out after 5000ms


Note the "Failed to set up mount unit: Device or resource busy" message. This seems to be where the trouble begins. This doesn't happen if I use the updated LVM-activate with the fix.

I've done some investigation into what systemd units are in what state at what time, and what is mounted at any given time -- but none of it has been particularly enlightening. I figure it doesn't really matter why it's happening, as long as we know how to fix it.

[1] LVM-activate: Stop before blk-availability.service (https://github.com/ClusterLabs/resource-agents/commit/79fb4b2d)
[2] [Question] About confirmation of "resource-agents-deps.target" at the start of LVM-activate resource. (https://github.com/ClusterLabs/resource-agents/issues/1647)
[3] LVM-activate: fix drop-in check to avoid re-creating drop-in file when it already exists (https://github.com/ClusterLabs/resource-agents/pull/1652)
[4] Merge pull request #1652 from oalbrigt/LVM-activate-fix-drop-in-check (https://github.com/ClusterLabs/resource-agents/commit/34e75a57)

-----

Version-Release number of selected component (if applicable):

resource-agents-4.1.1-90.el8.x86_64

-----

How reproducible:

Seemingly always

-----

Steps to Reproduce:
1. Disable and stop nfs-server.service on all nodes.
2. Configure an NFS server resource group with an export. (Operations omitted below.)

Resources:
 Group: nfsgroup
  Resource: vg (class=ocf provider=heartbeat type=LVM-activate)
   Attributes: activation_mode=exclusive vg_access_mode=system_id vgname=cluster_vg
  Resource: fs (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/cluster_vg/cluster_lv1 directory=/mnt/nfsinfo fstype=ext4
  Resource: share_fs (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/cluster_vg/cluster_lv2 directory=/mnt/share fstype=ext4
  Resource: nfsdaemon (class=ocf provider=heartbeat type=nfsserver)
   Attributes: nfs_shared_infodir=/mnt/nfsinfo
  Resource: export (class=ocf provider=heartbeat type=exportfs)
   Attributes: clientspec=* directory=/mnt/share fsid=1
  Resource: vip (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=192.168.22.61

3. Move the group from its initial node to the other node and then back to its initial node. If this doesn't reproduce the issue immediately, try a couple more moves. If it still doesn't reproduce for some reason, reboot both nodes and try again. Testing was finicky for me at times, but I think that was due to some other things that were going on.

-----

Actual results:

[root@fastvm-rhel-8-0-23 ~]# pcs resource status
  * Resource Group: nfsgroup:
    * vg	(ocf::heartbeat:LVM-activate):	 Started node2
    * fs	(ocf::heartbeat:Filesystem):	 Started node2
    * share_fs	(ocf::heartbeat:Filesystem):	 Started node2
    * nfsdaemon	(ocf::heartbeat:nfsserver):	 Started node2
    * export	(ocf::heartbeat:exportfs):	 Started node2
    * vip	(ocf::heartbeat:IPaddr2):	 Started node2

[root@fastvm-rhel-8-0-23 ~]# pcs resource clear nfsgroup && pcs resource move nfsgroup

[root@fastvm-rhel-8-0-23 ~]# pcs resource status
  * Resource Group: nfsgroup:
    * vg	(ocf::heartbeat:LVM-activate):	 Started node1
    * fs	(ocf::heartbeat:Filesystem):	 Started node1
    * share_fs	(ocf::heartbeat:Filesystem):	 Started node1
    * nfsdaemon	(ocf::heartbeat:nfsserver):	 Started node1
    * export	(ocf::heartbeat:exportfs):	 Started node1
    * vip	(ocf::heartbeat:IPaddr2):	 Started node1

[root@fastvm-rhel-8-0-23 ~]# pcs resource clear nfsgroup && pcs resource move nfsgroup

[root@fastvm-rhel-8-0-23 ~]# pcs resource status
  ...
  * Resource Group: nfsgroup:
    * vg	(ocf::heartbeat:LVM-activate):	 Started node2
    * fs	(ocf::heartbeat:Filesystem):	 Started node2
    * share_fs	(ocf::heartbeat:Filesystem):	 Started node2
    * nfsdaemon	(ocf::heartbeat:nfsserver):	 Stopped
    * export	(ocf::heartbeat:exportfs):	 Stopped
    * vip	(ocf::heartbeat:IPaddr2):	 Stopped

Failed Resource Actions:
  * nfsdaemon_start_0 on node2 'error' (1): call=59, status='Timed Out', exitreason='', last-rc-change='2021-06-14 22:56:08 -07:00', queued=0ms, exec=5002ms

-----

Expected results:

Same as above, except end state is 

  * Resource Group: nfsgroup:
    * vg	(ocf::heartbeat:LVM-activate):	 Started node2
    * fs	(ocf::heartbeat:Filesystem):	 Started node2
    * share_fs	(ocf::heartbeat:Filesystem):	 Started node2
    * nfsdaemon	(ocf::heartbeat:nfsserver):	 Started node2
    * export	(ocf::heartbeat:exportfs):	 Started node2
    * vip	(ocf::heartbeat:IPaddr2):	 Started node2

-----

Additional info:

We're going to need 8.4.z on this.

I edited the description because I mistakenly said that it happens when **only** NFSv3 is enabled. I think NFSv3 does need to be enabled (so that RPC is used), but it doesn't have to be **only** NFSv3. I must have gotten confused during testing.

Comment 2 Reid Wahl 2021-06-15 09:53:40 UTC

I edited the description because I mistakenly said that it happens when **only** NFSv3 is enabled. I think NFSv3 does need to be enabled (so that RPC is used), but it doesn't have to be **only** NFSv3. I must have gotten confused during testing.

Comment 3 Oyvind Albrigtsen 2021-06-15 10:19:51 UTC


*** This bug has been marked as a duplicate of bug 1902208 ***

Comment 13 errata-xmlrpc 2021-11-09 17:27:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: resource-agents security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4139

Note You need to log in before you can comment on or make changes to this bug.