Bug 1972035
| Summary: | LVM-activate: Start operation always recreates drop-in file and runs systemctl daemon-reload | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Reid Wahl <nwahl> | |
| Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> | |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 8.4 | CC: | agk, cfeist, cluster-maint, fdinitto, mjuricek, phagara | |
| Target Milestone: | beta | Keywords: | Regression, Reopened, Triaged, ZStream | |
| Target Release: | --- | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | resource-agents-4.1.1-96.el8 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1972236 (view as bug list) | Environment: | ||
| Last Closed: | 2021-11-09 17:27:32 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1972236 | |||
I edited the description because I mistakenly said that it happens when **only** NFSv3 is enabled. I think NFSv3 does need to be enabled (so that RPC is used), but it doesn't have to be **only** NFSv3. I must have gotten confused during testing. *** This bug has been marked as a duplicate of bug 1902208 *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: resource-agents security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4139 |
Description of problem: Due to a typo in upstream commit 79fb4b2d [1], LVM-activate calls systemd_drop_in every time it starts, which recreates the 99-LVM-activate.conf file and runs `systemctl daemon-reload`. See issue #1647 [2] and pull #1652 [3]. Somehow (this behavior continues to baffle me), this breaks the nfsserver resource's failover. Consider the usual NFS cluster stack. (Note: the export seems to be required in order to reproduce this issue.) * Resource Group: nfsgroup: * vg (ocf::heartbeat:LVM-activate): Started node2 * fs (ocf::heartbeat:Filesystem): Started node2 * share_fs (ocf::heartbeat:Filesystem): Started node2 * nfsdaemon (ocf::heartbeat:nfsserver): Started node2 * export (ocf::heartbeat:exportfs): Started node2 * vip (ocf::heartbeat:IPaddr2): Started node2 where "fs" is the nfs_shared_infodir filesystem. If I move nfsgroup from node2 to node1 and then back to node2, the nfsserver start operation times out the **second** time it attempts to start on node2. This problem disappears if I incorporate commit 34e75a57 [4] and reboot both nodes. I haven't been able to figure out **why** the call to systemd_drop_in during every LVM-activate startup causes a problem for the nfsserver resource. But somehow it's affecting the rpc_pipefs mount and/or the var-lib-nfs-rpc_pipefs.mount systemd unit. Here are logs from a second startup on node2, which timed out. Jun 14 22:47:33 fastvm-rhel-8-0-24 pacemaker-controld[1699]: notice: Requesting local execution of start operation for nfsdaemon on node2 Jun 14 22:47:33 fastvm-rhel-8-0-24 pacemaker-controld[1699]: notice: Result of monitor operation for share_fs on node2: ok Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: ERROR: nfs-mountd is not running Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: /var/lib/nfs/rpc_pipefs already mounted. Unmounting in preparation to bind mount nfs dir Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Stopped target rpc_pipefs.target. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Unmounting RPC Pipe File System... Jun 14 22:47:33 fastvm-rhel-8-0-24 umount[3586]: umount: /var/lib/nfs/rpc_pipefs: not mounted. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to set up mount unit: Device or resource busy Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to set up mount unit: Device or resource busy Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Mount process exited, code=exited status=32 Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Failed with result 'exit-code'. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Unmounted RPC Pipe File System. Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Starting NFS server ... Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: rpcbind i: 1 Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Starting Preprocess NFS configuration convertion... Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfs-convert.service: Succeeded. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Started Preprocess NFS configuration convertion. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Starting NFS status monitor for NFSv2/3 locking.... Jun 14 22:47:33 fastvm-rhel-8-0-24 rpc.statd[3641]: Version 2.3.3 starting Jun 14 22:47:33 fastvm-rhel-8-0-24 rpc.statd[3641]: Flags: TI-RPC Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Started NFS status monitor for NFSv2/3 locking.. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Directory /var/lib/nfs/rpc_pipefs to mount over is not empty, mounting anyway. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Mounting RPC Pipe File System... Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Starting NFS Mount Daemon... Jun 14 22:47:33 fastvm-rhel-8-0-24 mount[3645]: mount: /var/lib/nfs/rpc_pipefs: sunrpc already mounted on /var/lib/nfs/rpc_pipefs. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to set up mount unit: Device or resource busy Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to set up mount unit: Device or resource busy Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Mount process exited, code=exited status=32 Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Failed with result 'exit-code'. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to mount RPC Pipe File System. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for rpc_pipefs.target. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for NFSv4 ID-name mapping service. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfs-idmapd.service: Job nfs-idmapd.service/start failed with result 'dependency'. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for NFSv4 Client Tracking Daemon. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfsdcld.service: Job nfsdcld.service/start failed with result 'dependency'. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for RPC security service for NFS client and server. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: rpc-gssd.service: Job rpc-gssd.service/start failed with result 'dependency'. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: rpc_pipefs.target: Job rpc_pipefs.target/start failed with result 'dependency'. Jun 14 22:47:33 fastvm-rhel-8-0-24 rpc.mountd[3648]: Version 2.3.3 starting Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Started NFS Mount Daemon. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Starting NFS server and services... Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-mountd i: 1 Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Directory /var/lib/nfs/rpc_pipefs to mount over is not empty, mounting anyway. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Mounting RPC Pipe File System... Jun 14 22:47:33 fastvm-rhel-8-0-24 mount[3664]: mount: /var/lib/nfs/rpc_pipefs: sunrpc already mounted on /var/lib/nfs/rpc_pipefs. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Mount process exited, code=exited status=32 Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: var-lib-nfs-rpc_pipefs.mount: Failed with result 'exit-code'. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Failed to mount RPC Pipe File System. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for rpc_pipefs.target. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for NFSv4 ID-name mapping service. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfs-idmapd.service: Job nfs-idmapd.service/start failed with result 'dependency'. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for NFSv4 Client Tracking Daemon. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: nfsdcld.service: Job nfsdcld.service/start failed with result 'dependency'. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: Dependency failed for RPC security service for NFS client and server. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: rpc-gssd.service: Job rpc-gssd.service/start failed with result 'dependency'. Jun 14 22:47:33 fastvm-rhel-8-0-24 systemd[1]: rpc_pipefs.target: Job rpc_pipefs.target/start failed with result 'dependency'. Jun 14 22:47:33 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 1 Jun 14 22:47:34 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 2 Jun 14 22:47:34 fastvm-rhel-8-0-24 kernel: NFSD: Using UMH upcall client tracking operations. Jun 14 22:47:34 fastvm-rhel-8-0-24 kernel: NFSD: starting 90-second grace period (net f0000098) Jun 14 22:47:34 fastvm-rhel-8-0-24 systemd[1]: Reloading GSSAPI Proxy Daemon. Jun 14 22:47:34 fastvm-rhel-8-0-24 systemd[1]: Reloaded GSSAPI Proxy Daemon. Jun 14 22:47:34 fastvm-rhel-8-0-24 systemd[1]: Started NFS server and services. Jun 14 22:47:35 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 3 Jun 14 22:47:36 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 4 Jun 14 22:47:37 fastvm-rhel-8-0-24 nfsserver(nfsdaemon)[3481]: INFO: Start: nfs-idmapd i: 5 Jun 14 22:47:38 fastvm-rhel-8-0-24 pacemaker-execd[1696]: warning: nfsdaemon_start_0 process (PID 3481) timed out Jun 14 22:47:38 fastvm-rhel-8-0-24 pacemaker-execd[1696]: warning: nfsdaemon_start_0[3481] timed out after 5000ms Note the "Failed to set up mount unit: Device or resource busy" message. This seems to be where the trouble begins. This doesn't happen if I use the updated LVM-activate with the fix. I've done some investigation into what systemd units are in what state at what time, and what is mounted at any given time -- but none of it has been particularly enlightening. I figure it doesn't really matter why it's happening, as long as we know how to fix it. [1] LVM-activate: Stop before blk-availability.service (https://github.com/ClusterLabs/resource-agents/commit/79fb4b2d) [2] [Question] About confirmation of "resource-agents-deps.target" at the start of LVM-activate resource. (https://github.com/ClusterLabs/resource-agents/issues/1647) [3] LVM-activate: fix drop-in check to avoid re-creating drop-in file when it already exists (https://github.com/ClusterLabs/resource-agents/pull/1652) [4] Merge pull request #1652 from oalbrigt/LVM-activate-fix-drop-in-check (https://github.com/ClusterLabs/resource-agents/commit/34e75a57) ----- Version-Release number of selected component (if applicable): resource-agents-4.1.1-90.el8.x86_64 ----- How reproducible: Seemingly always ----- Steps to Reproduce: 1. Disable and stop nfs-server.service on all nodes. 2. Configure an NFS server resource group with an export. (Operations omitted below.) Resources: Group: nfsgroup Resource: vg (class=ocf provider=heartbeat type=LVM-activate) Attributes: activation_mode=exclusive vg_access_mode=system_id vgname=cluster_vg Resource: fs (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/cluster_vg/cluster_lv1 directory=/mnt/nfsinfo fstype=ext4 Resource: share_fs (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/cluster_vg/cluster_lv2 directory=/mnt/share fstype=ext4 Resource: nfsdaemon (class=ocf provider=heartbeat type=nfsserver) Attributes: nfs_shared_infodir=/mnt/nfsinfo Resource: export (class=ocf provider=heartbeat type=exportfs) Attributes: clientspec=* directory=/mnt/share fsid=1 Resource: vip (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=192.168.22.61 3. Move the group from its initial node to the other node and then back to its initial node. If this doesn't reproduce the issue immediately, try a couple more moves. If it still doesn't reproduce for some reason, reboot both nodes and try again. Testing was finicky for me at times, but I think that was due to some other things that were going on. ----- Actual results: [root@fastvm-rhel-8-0-23 ~]# pcs resource status * Resource Group: nfsgroup: * vg (ocf::heartbeat:LVM-activate): Started node2 * fs (ocf::heartbeat:Filesystem): Started node2 * share_fs (ocf::heartbeat:Filesystem): Started node2 * nfsdaemon (ocf::heartbeat:nfsserver): Started node2 * export (ocf::heartbeat:exportfs): Started node2 * vip (ocf::heartbeat:IPaddr2): Started node2 [root@fastvm-rhel-8-0-23 ~]# pcs resource clear nfsgroup && pcs resource move nfsgroup [root@fastvm-rhel-8-0-23 ~]# pcs resource status * Resource Group: nfsgroup: * vg (ocf::heartbeat:LVM-activate): Started node1 * fs (ocf::heartbeat:Filesystem): Started node1 * share_fs (ocf::heartbeat:Filesystem): Started node1 * nfsdaemon (ocf::heartbeat:nfsserver): Started node1 * export (ocf::heartbeat:exportfs): Started node1 * vip (ocf::heartbeat:IPaddr2): Started node1 [root@fastvm-rhel-8-0-23 ~]# pcs resource clear nfsgroup && pcs resource move nfsgroup [root@fastvm-rhel-8-0-23 ~]# pcs resource status ... * Resource Group: nfsgroup: * vg (ocf::heartbeat:LVM-activate): Started node2 * fs (ocf::heartbeat:Filesystem): Started node2 * share_fs (ocf::heartbeat:Filesystem): Started node2 * nfsdaemon (ocf::heartbeat:nfsserver): Stopped * export (ocf::heartbeat:exportfs): Stopped * vip (ocf::heartbeat:IPaddr2): Stopped Failed Resource Actions: * nfsdaemon_start_0 on node2 'error' (1): call=59, status='Timed Out', exitreason='', last-rc-change='2021-06-14 22:56:08 -07:00', queued=0ms, exec=5002ms ----- Expected results: Same as above, except end state is * Resource Group: nfsgroup: * vg (ocf::heartbeat:LVM-activate): Started node2 * fs (ocf::heartbeat:Filesystem): Started node2 * share_fs (ocf::heartbeat:Filesystem): Started node2 * nfsdaemon (ocf::heartbeat:nfsserver): Started node2 * export (ocf::heartbeat:exportfs): Started node2 * vip (ocf::heartbeat:IPaddr2): Started node2 ----- Additional info: We're going to need 8.4.z on this. I edited the description because I mistakenly said that it happens when **only** NFSv3 is enabled. I think NFSv3 does need to be enabled (so that RPC is used), but it doesn't have to be **only** NFSv3. I must have gotten confused during testing.