Bug 1924363

Summary:	nfsserver: Failure to unmount /var/lib/nfs doesn't cause stop failure
Product:	Red Hat Enterprise Linux 8	Reporter:	Reid Wahl <nwahl>
Component:	resource-agents	Assignee:	Oyvind Albrigtsen <oalbrigt>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	8.5	CC:	agk, cluster-maint, fdinitto, mjuricek, pbhoite, phagara
Target Milestone:	rc	Keywords:	Triaged
Target Release:	8.5
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	resource-agents-4.1.1-91.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-11-09 17:26:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Reid Wahl 2021-02-03 01:42:55 UTC

Description of problem:

If the unbind_tree() function fails to unmount /var/lib/nfs and the rest of the stop operation succeeds, the stop operation as a whole is declared a success. The operation should fail if the RA fails to unmount /var/lib/nfs.

The stop operation should arguably also fail if unbind_tree() fails to unmount the rpcpipefs_dir.
~~~
unbind_tree ()
{
        local i=1
        while `mount | grep -q " on $OCF_RESKEY_rpcpipefs_dir "` && [ "$i" -le 10 ]; do
                ocf_log info "Stop: umount ($i/10 attempts)"
                umount -t rpc_pipefs $OCF_RESKEY_rpcpipefs_dir
                sleep 1
                i=$((i + 1))
        done
        # # Insert error check here, probably
        if is_bound /var/lib/nfs; then
                umount /var/lib/nfs
        fi
        # # Insert another error check here
}
~~~

Since the shared infodir is mounted on /var/lib/nfs and should reside on shared, cluster-managed storage, a resource closer to the base of the resource group is likely to fail to stop during recovery. For example, an LVM-activate resource may fail because the LV is busy (because it's mounted on /var/lib/nfs).

In practical terms, this nfsserver RA misbehavior is unlikely to cause any additional impact. The desired behavior for nfsserver is a stop failure. If the stop failure doesn't occur in the nfsserver resource, then it's likely to occur farther up the chain.

-----

Version-Release number of selected component (if applicable):

resource-agents-4.1.1-68.el8

-----

How reproducible:

Always

-----

Steps to Reproduce:

With /var/lib/nfs:

1. Create an ocf:heartbeat:nfsserver resource.

 Resource: nfs-daemon (class=ocf provider=heartbeat type=nfsserver)
  Attributes: nfs_shared_infodir=/mnt/nfs_shared_infodir
  Operations: monitor interval=10s timeout=20s (nfs-daemon-monitor-interval-10s)
              start interval=0s timeout=40s (nfs-daemon-start-interval-0s)
              stop interval=0s timeout=20s (nfs-daemon-stop-interval-0s)

2. Hold open /var/lib/nfs.

    # touch /var/lib/nfs/testfile
    # exec 3>/var/lib/nfs/testfile

3. Stop the resource.

    # pcs resource debug-stop nfs-daemon


With rpcpipefs_dir:

As noted in the description, IMO the resource also should fail to stop if it fails to unmount the rpcpipefs_dir. However, I can't get the resource to start with a custom rpcpipefs_dir at all. It times out during start because nfs-idmapd tries to use /var/lib/nfs/rpc_pipefs instead of the directory specified in OCF_RESKEY_rpcpipefs_dir.

I'm not spending further time trying to get this particular configuration to work right now.

-----

Actual results:

# pcs resource debug-stop nfs-daemon
Operation stop for nfs-daemon (ocf:heartbeat:nfsserver) returned: 'ok' (0)
Feb 02 17:30:13 INFO: Stopping NFS server ...
Feb 02 17:30:13 INFO: Stop: threads
Feb 02 17:30:13 INFO: Stop: rpc-statd
Feb 02 17:30:13 INFO: Stop: nfs-idmapd
Feb 02 17:30:13 INFO: Stop: nfs-mountd
Feb 02 17:30:13 INFO: Stop: nfsdcld
Feb 02 17:30:13 INFO: Stop: rpc-gssd
Feb 02 17:30:13 INFO: Stop: umount (1/10 attempts)
umount: /var/lib/nfs: target is busy.
Feb 02 17:30:14 INFO: NFS server stopped

# mount | grep /var/lib/nfs
/dev/mapper/cluster_vg-cluster_lv1 on /var/lib/nfs type ext4 (rw,relatime,seclabel)

-----

Expected results:

The resource fails to stop.

Comment 3 Dean Jansa 2021-04-08 17:06:11 UTC

ON_QA bug without Verified:Tested should be in the MODIFIED state.

Comment 5 Dean Jansa 2021-04-15 08:00:55 UTC

ON_QA bug without Verified:Tested should be in the MODIFIED state.

Comment 9 errata-xmlrpc 2021-11-09 17:26:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: resource-agents security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4139