Bug 1972035

Summary:	LVM-activate: Start operation always recreates drop-in file and runs systemctl daemon-reload
Product:	Red Hat Enterprise Linux 8	Reporter:	Reid Wahl <nwahl>
Component:	resource-agents	Assignee:	Oyvind Albrigtsen <oalbrigt>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	high	Docs Contact:
Priority:	high
Version:	8.4	CC:	agk, cfeist, cluster-maint, fdinitto, mjuricek, phagara
Target Milestone:	beta	Keywords:	Regression, Reopened, Triaged, ZStream
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	resource-agents-4.1.1-96.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1972236 (view as bug list)		Environment:
Last Closed:	2021-11-09 17:27:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1972236

Description Reid Wahl 2021-06-15 06:13:16 UTC

Description of problem:

Due to a typo in upstream commit 79fb4b2d [1], LVM-activate calls systemd_drop_in every time it starts, which recreates the 99-LVM-activate.conf file and runs `systemctl daemon-reload`. See issue #1647 [2] and pull #1652 [3].

Somehow (this behavior continues to baffle me), this breaks the nfsserver resource's failover.

Consider the usual NFS cluster stack. (Note: the export seems to be required in order to reproduce this issue.)

  * Resource Group: nfsgroup:
    * vg	(ocf::heartbeat:LVM-activate):	 Started node2
    * fs	(ocf::heartbeat:Filesystem):	 Started node2
    * share_fs	(ocf::heartbeat:Filesystem):	 Started node2
    * nfsdaemon	(ocf::heartbeat:nfsserver):	 Started node2
    * export	(ocf::heartbeat:exportfs):	 Started node2
    * vip	(ocf::heartbeat:IPaddr2):	 Started node2

where "fs" is the nfs_shared_infodir filesystem.


If I move nfsgroup from node2 to node1 and then back to node2, the nfsserver start operation times out the **second** time it attempts to start on node2.

This problem disappears if I incorporate commit 34e75a57 [4] and reboot both nodes.


I haven't been able to figure out **why** the call to systemd_drop_in during every LVM-activate startup causes a problem for the nfsserver resource. But somehow it's affecting the rpc_pipefs mount and/or the var-lib-nfs-rpc_pipefs.mount systemd unit. Here are logs from a second startup on node2, which timed out.

Jun 14 22:47:33 fastvm-rhel-8-0-24 pacemaker-controld[1699]: notice: Requesting local execution of start operation for nfsdaemon on node2
Jun 14 22:47:33 fastvm-rhel-8-0-24 pacemaker-controld[1699]: notice: Result of monitor operation for share_fs on node2: ok
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: ERROR: nfs-mountd is not running
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: /var/lib/nfs/rpc_pipefs already mounted. Unmounting in preparation to bind mount nfs dir
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Stopped target rpc_pipefs.target.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Unmounting RPC Pipe File System...
Jun 14 22:47:33 fastvm-rhel-8-0-24 umount[3586]: umount: /var/lib/nfs/rpc_pipefs: not mounted.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to set up mount unit: Device or resource busy
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to set up mount unit: Device or resource busy
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Mount process exited, code=exited status=32
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Failed with result 'exit-code'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Unmounted RPC Pipe File System.
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Starting NFS server ...
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: rpcbind i: 1
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Starting Preprocess NFS configuration convertion...
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfs-convert.service: Succeeded.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Started Preprocess NFS configuration convertion.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Starting NFS status monitor for NFSv2/3 locking....
Jun 14 22:47:33 fastvm-rhel-8-0-24 rpc.statd[3641]: Version 2.3.3 starting
Jun 14 22:47:33 fastvm-rhel-8-0-24 rpc.statd[3641]: Flags: TI-RPC
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Started NFS status monitor for NFSv2/3 locking..
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Directory /var/lib/nfs/rpc_pipefs to mount over is not empty, mounting anyway.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Mounting RPC Pipe File System...
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Starting NFS Mount Daemon...
Jun 14 22:47:33 fastvm-rhel-8-0-24 mount[3645]: mount: /var/lib/nfs/rpc_pipefs: sunrpc already mounted on /var/lib/nfs/rpc_pipefs.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to set up mount unit: Device or resource busy
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to set up mount unit: Device or resource busy
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Mount process exited, code=exited status=32
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Failed with result 'exit-code'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to mount RPC Pipe File System.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for rpc_pipefs.target.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for NFSv4 ID-name mapping service.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfs-idmapd.service: Job nfs-idmapd.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for NFSv4 Client Tracking Daemon.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfsdcld.service: Job nfsdcld.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for RPC security service for NFS client and server.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: rpc-gssd.service: Job rpc-gssd.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: rpc_pipefs.target: Job rpc_pipefs.target/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 rpc.mountd[3648]: Version 2.3.3 starting
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Started NFS Mount Daemon.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Starting NFS server and services...
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-mountd i: 1
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Directory /var/lib/nfs/rpc_pipefs to mount over is not empty, mounting anyway.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Mounting RPC Pipe File System...
Jun 14 22:47:33 fastvm-rhel-8-0-24 mount[3664]: mount: /var/lib/nfs/rpc_pipefs: sunrpc already mounted on /var/lib/nfs/rpc_pipefs.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Mount process exited, code=exited status=32
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Failed with result 'exit-code'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to mount RPC Pipe File System.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for rpc_pipefs.target.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for NFSv4 ID-name mapping service.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfs-idmapd.service: Job nfs-idmapd.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for NFSv4 Client Tracking Daemon.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfsdcld.service: Job nfsdcld.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for RPC security service for NFS client and server.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: rpc-gssd.service: Job rpc-gssd.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: rpc_pipefs.target: Job rpc_pipefs.target/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 1
Jun 14 22:47:34 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 2
Jun 14 22:47:34 fastvm-rhel-8-0-24 kernel: NFSD: Using UMH upcall client tracking operations.
Jun 14 22:47:34 fastvm-rhel-8-0-24 kernel: NFSD: starting 90-second grace period (net f0000098)
Jun 14 22:47:34 fastvm-rhel-8-0-24 systemd[1]: Reloading GSSAPI Proxy Daemon.
Jun 14 22:47:34 fastvm-rhel-8-0-24 systemd[1]: Reloaded GSSAPI Proxy Daemon.
Jun 14 22:47:34 fastvm-rhel-8-0-24 systemd[1]: Started NFS server and services.
Jun 14 22:47:35 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 3
Jun 14 22:47:36 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 4
Jun 14 22:47:37 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 5
Jun 14 22:47:38 fastvm-rhel-8-0-24 pacemaker-execd[1696]: warning: nfsdaemon_start_0 process (PID 3481) timed out
Jun 14 22:47:38 fastvm-rhel-8-0-24 pacemaker-execd[1696]: warning: nfsdaemon_start_0[3481] timed out after 5000ms


Note the "Failed to set up mount unit: Device or resource busy" message. This seems to be where the trouble begins. This doesn't happen if I use the updated LVM-activate with the fix.

I've done some investigation into what systemd units are in what state at what time, and what is mounted at any given time -- but none of it has been particularly enlightening. I figure it doesn't really matter why it's happening, as long as we know how to fix it.

[1] LVM-activate: Stop before blk-availability.service (https://github.com/ClusterLabs/resource-agents/commit/79fb4b2d)
[2] [Question] About confirmation of "resource-agents-deps.target" at the start of LVM-activate resource. (https://github.com/ClusterLabs/resource-agents/issues/1647)
[3] LVM-activate: fix drop-in check to avoid re-creating drop-in file when it already exists (https://github.com/ClusterLabs/resource-agents/pull/1652)
[4] Merge pull request #1652 from oalbrigt/LVM-activate-fix-drop-in-check (https://github.com/ClusterLabs/resource-agents/commit/34e75a57)

-----

Version-Release number of selected component (if applicable):

resource-agents-4.1.1-90.el8.x86_64

-----

How reproducible:

Seemingly always

-----

Steps to Reproduce:
1. Disable and stop nfs-server.service on all nodes.
2. Configure an NFS server resource group with an export. (Operations omitted below.)

Resources:
 Group: nfsgroup
  Resource: vg (class=ocf provider=heartbeat type=LVM-activate)
   Attributes: activation_mode=exclusive vg_access_mode=system_id vgname=cluster_vg
  Resource: fs (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/cluster_vg/cluster_lv1 directory=/mnt/nfsinfo fstype=ext4
  Resource: share_fs (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/cluster_vg/cluster_lv2 directory=/mnt/share fstype=ext4
  Resource: nfsdaemon (class=ocf provider=heartbeat type=nfsserver)
   Attributes: nfs_shared_infodir=/mnt/nfsinfo
  Resource: export (class=ocf provider=heartbeat type=exportfs)
   Attributes: clientspec=* directory=/mnt/share fsid=1
  Resource: vip (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=192.168.22.61

3. Move the group from its initial node to the other node and then back to its initial node. If this doesn't reproduce the issue immediately, try a couple more moves. If it still doesn't reproduce for some reason, reboot both nodes and try again. Testing was finicky for me at times, but I think that was due to some other things that were going on.

-----

Actual results:

[root@fastvm-rhel-8-0-23 ~]# pcs resource status
  * Resource Group: nfsgroup:
    * vg	(ocf::heartbeat:LVM-activate):	 Started node2
    * fs	(ocf::heartbeat:Filesystem):	 Started node2
    * share_fs	(ocf::heartbeat:Filesystem):	 Started node2
    * nfsdaemon	(ocf::heartbeat:nfsserver):	 Started node2
    * export	(ocf::heartbeat:exportfs):	 Started node2
    * vip	(ocf::heartbeat:IPaddr2):	 Started node2

[root@fastvm-rhel-8-0-23 ~]# pcs resource clear nfsgroup && pcs resource move nfsgroup

[root@fastvm-rhel-8-0-23 ~]# pcs resource status
  * Resource Group: nfsgroup:
    * vg	(ocf::heartbeat:LVM-activate):	 Started node1
    * fs	(ocf::heartbeat:Filesystem):	 Started node1
    * share_fs	(ocf::heartbeat:Filesystem):	 Started node1
    * nfsdaemon	(ocf::heartbeat:nfsserver):	 Started node1
    * export	(ocf::heartbeat:exportfs):	 Started node1
    * vip	(ocf::heartbeat:IPaddr2):	 Started node1

[root@fastvm-rhel-8-0-23 ~]# pcs resource clear nfsgroup && pcs resource move nfsgroup

[root@fastvm-rhel-8-0-23 ~]# pcs resource status
  ...
  * Resource Group: nfsgroup:
    * vg	(ocf::heartbeat:LVM-activate):	 Started node2
    * fs	(ocf::heartbeat:Filesystem):	 Started node2
    * share_fs	(ocf::heartbeat:Filesystem):	 Started node2
    * nfsdaemon	(ocf::heartbeat:nfsserver):	 Stopped
    * export	(ocf::heartbeat:exportfs):	 Stopped
    * vip	(ocf::heartbeat:IPaddr2):	 Stopped

Failed Resource Actions:
  * nfsdaemon_start_0 on node2 'error' (1): call=59, status='Timed Out', exitreason='', last-rc-change='2021-06-14 22:56:08 -07:00', queued=0ms, exec=5002ms

-----

Expected results:

Same as above, except end state is 

  * Resource Group: nfsgroup:
    * vg	(ocf::heartbeat:LVM-activate):	 Started node2
    * fs	(ocf::heartbeat:Filesystem):	 Started node2
    * share_fs	(ocf::heartbeat:Filesystem):	 Started node2
    * nfsdaemon	(ocf::heartbeat:nfsserver):	 Started node2
    * export	(ocf::heartbeat:exportfs):	 Started node2
    * vip	(ocf::heartbeat:IPaddr2):	 Started node2

-----

Additional info:

We're going to need 8.4.z on this.

I edited the description because I mistakenly said that it happens when **only** NFSv3 is enabled. I think NFSv3 does need to be enabled (so that RPC is used), but it doesn't have to be **only** NFSv3. I must have gotten confused during testing.

Comment 2 Reid Wahl 2021-06-15 09:53:40 UTC

I edited the description because I mistakenly said that it happens when **only** NFSv3 is enabled. I think NFSv3 does need to be enabled (so that RPC is used), but it doesn't have to be **only** NFSv3. I must have gotten confused during testing.

Comment 3 Oyvind Albrigtsen 2021-06-15 10:19:51 UTC


*** This bug has been marked as a duplicate of bug 1902208 ***

Comment 13 errata-xmlrpc 2021-11-09 17:27:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: resource-agents security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4139