Bug 1972035

Summary: LVM-activate: Start operation always recreates drop-in file and runs systemctl daemon-reload
Product: Red Hat Enterprise Linux 8 Reporter: Reid Wahl <nwahl>
Component: resource-agentsAssignee: Oyvind Albrigtsen <oalbrigt>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: high    
Version: 8.4CC: agk, cfeist, cluster-maint, fdinitto, mjuricek, phagara
Target Milestone: betaKeywords: Regression, Reopened, Triaged, ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: resource-agents-4.1.1-96.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1972236 (view as bug list) Environment:
Last Closed: 2021-11-09 17:27:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1972236    

Description Reid Wahl 2021-06-15 06:13:16 UTC
Description of problem:

Due to a typo in upstream commit 79fb4b2d [1], LVM-activate calls systemd_drop_in every time it starts, which recreates the 99-LVM-activate.conf file and runs `systemctl daemon-reload`. See issue #1647 [2] and pull #1652 [3].

Somehow (this behavior continues to baffle me), this breaks the nfsserver resource's failover.

Consider the usual NFS cluster stack. (Note: the export seems to be required in order to reproduce this issue.)

  * Resource Group: nfsgroup:
    * vg	(ocf::heartbeat:LVM-activate):	 Started node2
    * fs	(ocf::heartbeat:Filesystem):	 Started node2
    * share_fs	(ocf::heartbeat:Filesystem):	 Started node2
    * nfsdaemon	(ocf::heartbeat:nfsserver):	 Started node2
    * export	(ocf::heartbeat:exportfs):	 Started node2
    * vip	(ocf::heartbeat:IPaddr2):	 Started node2

where "fs" is the nfs_shared_infodir filesystem.


If I move nfsgroup from node2 to node1 and then back to node2, the nfsserver start operation times out the **second** time it attempts to start on node2.

This problem disappears if I incorporate commit 34e75a57 [4] and reboot both nodes.


I haven't been able to figure out **why** the call to systemd_drop_in during every LVM-activate startup causes a problem for the nfsserver resource. But somehow it's affecting the rpc_pipefs mount and/or the var-lib-nfs-rpc_pipefs.mount systemd unit. Here are logs from a second startup on node2, which timed out.

Jun 14 22:47:33 fastvm-rhel-8-0-24 pacemaker-controld[1699]: notice: Requesting local execution of start operation for nfsdaemon on node2
Jun 14 22:47:33 fastvm-rhel-8-0-24 pacemaker-controld[1699]: notice: Result of monitor operation for share_fs on node2: ok
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: ERROR: nfs-mountd is not running
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: /var/lib/nfs/rpc_pipefs already mounted. Unmounting in preparation to bind mount nfs dir
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Stopped target rpc_pipefs.target.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Unmounting RPC Pipe File System...
Jun 14 22:47:33 fastvm-rhel-8-0-24 umount[3586]: umount: /var/lib/nfs/rpc_pipefs: not mounted.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to set up mount unit: Device or resource busy
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to set up mount unit: Device or resource busy
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Mount process exited, code=exited status=32
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Failed with result 'exit-code'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Unmounted RPC Pipe File System.
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Starting NFS server ...
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: rpcbind i: 1
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Starting Preprocess NFS configuration convertion...
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfs-convert.service: Succeeded.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Started Preprocess NFS configuration convertion.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Starting NFS status monitor for NFSv2/3 locking....
Jun 14 22:47:33 fastvm-rhel-8-0-24 rpc.statd[3641]: Version 2.3.3 starting
Jun 14 22:47:33 fastvm-rhel-8-0-24 rpc.statd[3641]: Flags: TI-RPC
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Started NFS status monitor for NFSv2/3 locking..
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Directory /var/lib/nfs/rpc_pipefs to mount over is not empty, mounting anyway.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Mounting RPC Pipe File System...
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Starting NFS Mount Daemon...
Jun 14 22:47:33 fastvm-rhel-8-0-24 mount[3645]: mount: /var/lib/nfs/rpc_pipefs: sunrpc already mounted on /var/lib/nfs/rpc_pipefs.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to set up mount unit: Device or resource busy
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to set up mount unit: Device or resource busy
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Mount process exited, code=exited status=32
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Failed with result 'exit-code'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to mount RPC Pipe File System.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for rpc_pipefs.target.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for NFSv4 ID-name mapping service.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfs-idmapd.service: Job nfs-idmapd.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for NFSv4 Client Tracking Daemon.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfsdcld.service: Job nfsdcld.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for RPC security service for NFS client and server.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: rpc-gssd.service: Job rpc-gssd.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: rpc_pipefs.target: Job rpc_pipefs.target/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 rpc.mountd[3648]: Version 2.3.3 starting
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Started NFS Mount Daemon.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Starting NFS server and services...
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-mountd i: 1
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Directory /var/lib/nfs/rpc_pipefs to mount over is not empty, mounting anyway.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Mounting RPC Pipe File System...
Jun 14 22:47:33 fastvm-rhel-8-0-24 mount[3664]: mount: /var/lib/nfs/rpc_pipefs: sunrpc already mounted on /var/lib/nfs/rpc_pipefs.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Mount process exited, code=exited status=32
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Failed with result 'exit-code'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to mount RPC Pipe File System.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for rpc_pipefs.target.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for NFSv4 ID-name mapping service.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfs-idmapd.service: Job nfs-idmapd.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for NFSv4 Client Tracking Daemon.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfsdcld.service: Job nfsdcld.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for RPC security service for NFS client and server.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: rpc-gssd.service: Job rpc-gssd.service/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: rpc_pipefs.target: Job rpc_pipefs.target/start failed with result 'dependency'.
Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 1
Jun 14 22:47:34 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 2
Jun 14 22:47:34 fastvm-rhel-8-0-24 kernel: NFSD: Using UMH upcall client tracking operations.
Jun 14 22:47:34 fastvm-rhel-8-0-24 kernel: NFSD: starting 90-second grace period (net f0000098)
Jun 14 22:47:34 fastvm-rhel-8-0-24 systemd[1]: Reloading GSSAPI Proxy Daemon.
Jun 14 22:47:34 fastvm-rhel-8-0-24 systemd[1]: Reloaded GSSAPI Proxy Daemon.
Jun 14 22:47:34 fastvm-rhel-8-0-24 systemd[1]: Started NFS server and services.
Jun 14 22:47:35 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 3
Jun 14 22:47:36 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 4
Jun 14 22:47:37 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 5
Jun 14 22:47:38 fastvm-rhel-8-0-24 pacemaker-execd[1696]: warning: nfsdaemon_start_0 process (PID 3481) timed out
Jun 14 22:47:38 fastvm-rhel-8-0-24 pacemaker-execd[1696]: warning: nfsdaemon_start_0[3481] timed out after 5000ms


Note the "Failed to set up mount unit: Device or resource busy" message. This seems to be where the trouble begins. This doesn't happen if I use the updated LVM-activate with the fix.

I've done some investigation into what systemd units are in what state at what time, and what is mounted at any given time -- but none of it has been particularly enlightening. I figure it doesn't really matter why it's happening, as long as we know how to fix it.

[1] LVM-activate: Stop before blk-availability.service (https://github.com/ClusterLabs/resource-agents/commit/79fb4b2d)
[2] [Question] About confirmation of "resource-agents-deps.target" at the start of LVM-activate resource. (https://github.com/ClusterLabs/resource-agents/issues/1647)
[3] LVM-activate: fix drop-in check to avoid re-creating drop-in file when it already exists (https://github.com/ClusterLabs/resource-agents/pull/1652)
[4] Merge pull request #1652 from oalbrigt/LVM-activate-fix-drop-in-check (https://github.com/ClusterLabs/resource-agents/commit/34e75a57)

-----

Version-Release number of selected component (if applicable):

resource-agents-4.1.1-90.el8.x86_64

-----

How reproducible:

Seemingly always

-----

Steps to Reproduce:
1. Disable and stop nfs-server.service on all nodes.
2. Configure an NFS server resource group with an export. (Operations omitted below.)

Resources:
 Group: nfsgroup
  Resource: vg (class=ocf provider=heartbeat type=LVM-activate)
   Attributes: activation_mode=exclusive vg_access_mode=system_id vgname=cluster_vg
  Resource: fs (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/cluster_vg/cluster_lv1 directory=/mnt/nfsinfo fstype=ext4
  Resource: share_fs (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/cluster_vg/cluster_lv2 directory=/mnt/share fstype=ext4
  Resource: nfsdaemon (class=ocf provider=heartbeat type=nfsserver)
   Attributes: nfs_shared_infodir=/mnt/nfsinfo
  Resource: export (class=ocf provider=heartbeat type=exportfs)
   Attributes: clientspec=* directory=/mnt/share fsid=1
  Resource: vip (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=192.168.22.61

3. Move the group from its initial node to the other node and then back to its initial node. If this doesn't reproduce the issue immediately, try a couple more moves. If it still doesn't reproduce for some reason, reboot both nodes and try again. Testing was finicky for me at times, but I think that was due to some other things that were going on.

-----

Actual results:

[root@fastvm-rhel-8-0-23 ~]# pcs resource status
  * Resource Group: nfsgroup:
    * vg	(ocf::heartbeat:LVM-activate):	 Started node2
    * fs	(ocf::heartbeat:Filesystem):	 Started node2
    * share_fs	(ocf::heartbeat:Filesystem):	 Started node2
    * nfsdaemon	(ocf::heartbeat:nfsserver):	 Started node2
    * export	(ocf::heartbeat:exportfs):	 Started node2
    * vip	(ocf::heartbeat:IPaddr2):	 Started node2

[root@fastvm-rhel-8-0-23 ~]# pcs resource clear nfsgroup && pcs resource move nfsgroup

[root@fastvm-rhel-8-0-23 ~]# pcs resource status
  * Resource Group: nfsgroup:
    * vg	(ocf::heartbeat:LVM-activate):	 Started node1
    * fs	(ocf::heartbeat:Filesystem):	 Started node1
    * share_fs	(ocf::heartbeat:Filesystem):	 Started node1
    * nfsdaemon	(ocf::heartbeat:nfsserver):	 Started node1
    * export	(ocf::heartbeat:exportfs):	 Started node1
    * vip	(ocf::heartbeat:IPaddr2):	 Started node1

[root@fastvm-rhel-8-0-23 ~]# pcs resource clear nfsgroup && pcs resource move nfsgroup

[root@fastvm-rhel-8-0-23 ~]# pcs resource status
  ...
  * Resource Group: nfsgroup:
    * vg	(ocf::heartbeat:LVM-activate):	 Started node2
    * fs	(ocf::heartbeat:Filesystem):	 Started node2
    * share_fs	(ocf::heartbeat:Filesystem):	 Started node2
    * nfsdaemon	(ocf::heartbeat:nfsserver):	 Stopped
    * export	(ocf::heartbeat:exportfs):	 Stopped
    * vip	(ocf::heartbeat:IPaddr2):	 Stopped

Failed Resource Actions:
  * nfsdaemon_start_0 on node2 'error' (1): call=59, status='Timed Out', exitreason='', last-rc-change='2021-06-14 22:56:08 -07:00', queued=0ms, exec=5002ms

-----

Expected results:

Same as above, except end state is 

  * Resource Group: nfsgroup:
    * vg	(ocf::heartbeat:LVM-activate):	 Started node2
    * fs	(ocf::heartbeat:Filesystem):	 Started node2
    * share_fs	(ocf::heartbeat:Filesystem):	 Started node2
    * nfsdaemon	(ocf::heartbeat:nfsserver):	 Started node2
    * export	(ocf::heartbeat:exportfs):	 Started node2
    * vip	(ocf::heartbeat:IPaddr2):	 Started node2

-----

Additional info:

We're going to need 8.4.z on this.

I edited the description because I mistakenly said that it happens when **only** NFSv3 is enabled. I think NFSv3 does need to be enabled (so that RPC is used), but it doesn't have to be **only** NFSv3. I must have gotten confused during testing.

Comment 2 Reid Wahl 2021-06-15 09:53:40 UTC
I edited the description because I mistakenly said that it happens when **only** NFSv3 is enabled. I think NFSv3 does need to be enabled (so that RPC is used), but it doesn't have to be **only** NFSv3. I must have gotten confused during testing.

Comment 3 Oyvind Albrigtsen 2021-06-15 10:19:51 UTC

*** This bug has been marked as a duplicate of bug 1902208 ***

Comment 13 errata-xmlrpc 2021-11-09 17:27:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: resource-agents security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4139