Bug 2100395
| Summary: | rpcbind.socket dies upon try-restart, causing related services (e.g. ypbind) to become unavailable | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Renaud Métrich <rmetrich> |
| Component: | nfs-utils | Assignee: | Steve Dickson <steved> |
| Status: | CLOSED ERRATA | QA Contact: | Yongcheng Yang <yoyang> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 8.6 | CC: | anrussel, bcodding, dostwal, dtardon, dwysocha, systemd-maint-list, toneata, xzhou |
| Target Milestone: | rc | Keywords: | EasyFix, Patch, Triaged |
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | nfs-utils-2.3.3-57.el8 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-11-08 10:53:52 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Renaud Métrich
2022-06-23 09:16:14 UTC
This is https://github.com/systemd/systemd/issues/23581, which is itself a duplicate of https://github.com/systemd/systemd/issues/8102. It doesn't look like it's going to be implemented in foreseeable future (and it would likely be quite hard to backport to RHEL-8 anyway), hence I'm moving this to RHEL-9. Meanwhile, rpcbind should switch the order of the units in the try-restart call. That seems to work reliably (or the race window is a lot smaller that way). I seem to have just run into this issue on a RHEL-8.6 to RHEL-8.7 upgrade, though the specifics are a bit different: rpcbind upgraded to rpcbind-1.2.5-9.el8.x86_64, and the upgrade seemed to stop making progress at: Running scriptlet: rpcbind-1.2.5-8.el8.x86_64 340/632 Process tree looked like this: root 5228 915 0 15:01 ? 00:00:00 \_ sshd: root [priv] root 5246 5228 0 15:01 ? 00:00:09 | \_ sshd: root@pts/0 root 5247 5246 0 15:01 pts/0 00:00:00 | \_ -bash root 5595 5247 25 15:56 pts/0 00:02:02 | \_ /usr/libexec/platform-python /usr/bin/dnf update root 26924 5595 0 16:04 pts/0 00:00:00 | \_ /bin/sh /var/tmp/rpm-tmp.kSNBph 2 2 root 26950 26924 0 16:04 pts/0 00:00:00 | \_ /bin/systemctl restart rpcbind.socket .. and the contents of /var/tmp/rpm-tmp.kSNBph was: -------------------------------------------------------------------- if systemctl -q is-enabled rpcbind.socket then /bin/systemctl reenable rpcbind.socket >/dev/null 2>&1 || : /bin/systemctl restart rpcbind.socket >/dev/null 2>&1 || : fi if [ ! -d /var/lib/rpcbind ] then mkdir /var/lib/rpcbind chown rpc:rpc /var/lib/rpcbind [ -x /usr/sbin/restorecon ] && \ /usr/sbin/restorecon /var/lib/rpcbind fi -------------------------------------------------------------------- If I try to run systemctl restart rpcbind.socket directly: [root@bcodding-rhel8 ~]# time systemctl restart rpcbind.socket real 1m30.176s user 0m0.007s sys 0m0.020s .. which produces the logs: Jul 18 16:56:08 bcodding-rhel8 systemd[1]: Stopping NFS status monitor for NFSv2/3 locking.... Jul 18 16:56:08 bcodding-rhel8 systemd[1]: Stopping RPC Bind... Jul 18 16:56:08 bcodding-rhel8 systemd[1]: rpcbind.service: Succeeded. Jul 18 16:56:08 bcodding-rhel8 systemd[1]: Stopped RPC Bind. Jul 18 16:57:38 bcodding-rhel8 systemd[1]: rpc-statd.service: State 'stop-sigterm' timed out. Killing. Jul 18 16:57:38 bcodding-rhel8 systemd[1]: rpc-statd.service: Killing process 101687 (rpc.statd) with signal SIGKILL. Jul 18 16:57:38 bcodding-rhel8 systemd[1]: rpc-statd.service: Main process exited, code=killed, status=9/KILL Jul 18 16:57:38 bcodding-rhel8 systemd[1]: rpc-statd.service: Failed with result 'timeout'. Jul 18 16:57:38 bcodding-rhel8 systemd[1]: Stopped NFS status monitor for NFSv2/3 locking.. Jul 18 16:57:38 bcodding-rhel8 systemd[1]: rpcbind.socket: Succeeded. Jul 18 16:57:38 bcodding-rhel8 systemd[1]: Closed RPCbind Server Activation Socket. Jul 18 16:57:38 bcodding-rhel8 systemd[1]: Stopping RPCbind Server Activation Socket. Jul 18 16:57:38 bcodding-rhel8 systemd[1]: Listening on RPCbind Server Activation Socket. Jul 18 16:57:38 bcodding-rhel8 systemd[1]: Starting RPC Bind... Jul 18 16:57:38 bcodding-rhel8 systemd[1]: Starting NFS status monitor for NFSv2/3 locking.... Jul 18 16:57:38 bcodding-rhel8 systemd[1]: Started RPC Bind. Jul 18 16:57:38 bcodding-rhel8 rpc.statd[101697]: Version 2.3.3 starting Jul 18 16:57:38 bcodding-rhel8 rpc.statd[101697]: Flags: TI-RPC Jul 18 16:57:38 bcodding-rhel8 systemd[1]: Started NFS status monitor for NFSv2/3 locking.. It seems from the logs that rpc-statd is ignoring SIGTERM, but I'm out of time looking at this today. Will try to figure out more and report back. The issue I'm seeing is that when `systemctl stop rpcbind.socket` is run, the dependency means that systemd first sends SIGTERM to rpcbind, then sigterm to rpc.statd. On SIGTERM, rpcbind tears down /var/run/rpcbind.sock. However, rpc-statd on SIGTERM attempts to unregister from rpcbind, thus: [root@bcodding-rhel8 ~]# gdb -p $(pidof rpc.statd) (gdb) break connect Breakpoint 1 at 0x7ffff684dc20 (2 locations) (gdb) continue Continuing. Program received signal SIGTERM, Terminated. 0x00007ffff74dd6eb in select () from /lib64/libc.so.6 (gdb) bt #0 0x00007ffff74dd6eb in select () from /lib64/libc.so.6 #1 0x000055555555ad06 in my_svc_run (sockfd=7) at svc_run.c:110 #2 0x0000555555558d90 in main (argc=1, argv=0x7fffffffe318) at statd.c:527 (gdb) continue Continuing. Breakpoint 1, 0x00007ffff74e6480 in connect () from /lib64/libc.so.6 (gdb) bt #0 0x00007ffff74e6480 in connect () from /lib64/libc.so.6 #1 0x00007ffff79a4150 in clnt_vc_create () from /lib64/libtirpc.so.3 #2 0x00007ffff79a8a3d in local_rpcb () from /lib64/libtirpc.so.3 #3 0x00007ffff79a97c4 in rpcb_unset () from /lib64/libtirpc.so.3 #4 0x000055555555fca5 in nfs_svc_unregister (program=100024, version=1) at svc_create.c:483 #5 0x000055555555a876 in statd_unregister () at statd.c:105 #6 killer (sig=15) at statd.c:105 #7 <signal handler called> #8 0x00007ffff74dd6eb in select () from /lib64/libc.so.6 #9 0x000055555555ad06 in my_svc_run (sockfd=7) at svc_run.c:110 #10 0x0000555555558d90 in main (argc=1, argv=0x7fffffffe318) at statd.c:527 (gdb) continue Continuing. Breakpoint 1, 0x00007ffff74e6480 in connect () from /lib64/libc.so.6 (gdb) bt #0 0x00007ffff74e6480 in connect () from /lib64/libc.so.6 #1 0x00007ffff79a4150 in clnt_vc_create () from /lib64/libtirpc.so.3 #2 0x00007ffff79a8a3d in local_rpcb () from /lib64/libtirpc.so.3 #3 0x00007ffff79a97c4 in rpcb_unset () from /lib64/libtirpc.so.3 #4 0x000055555555fca5 in nfs_svc_unregister (program=100024, version=1) at svc_create.c:483 #5 0x00007ffff740723c in __run_exit_handlers () from /lib64/libc.so.6 #6 0x00007ffff7407370 in exit () from /lib64/libc.so.6 #7 0x000055555555a892 in killer (sig=15) at statd.c:107 #8 <signal handler called> #9 0x00007ffff74dd6eb in select () from /lib64/libc.so.6 #10 0x000055555555ad06 in my_svc_run (sockfd=7) at svc_run.c:110 #11 0x0000555555558d90 in main (argc=1, argv=0x7fffffffe318) at statd.c:527 (gdb) continue Continuing. [Inferior 1 (process 4897) exited normally] We need a way for systemd to wait for rpc.statd to exit before sending SIGTERM to rpcbind. Ok, here's the likely fix: diff --git a/systemd/rpc-statd.service b/systemd/rpc-statd.service index f41ae2084f3c..2eeada60993a 100644 --- a/systemd/rpc-statd.service +++ b/systemd/rpc-statd.service @@ -4,7 +4,7 @@ DefaultDependencies=no Conflicts=umount.target Requires=nss-lookup.target rpcbind.socket Wants=network-online.target -After=network-online.target nss-lookup.target rpcbind.socket +After=network-online.target nss-lookup.target rpcbind.service PartOf=nfs-utils.service Apparently using After=rpcbind.socket isn't sufficient to get the inverse order correct when shutting down, and systemd sends SIGTERM to rpcbind /before/ rpc.statd. After this patch, the ordering is correct and rpc.statd shuts down cleanly: [root@bcodding-rhel8 ~]# time systemctl restart rpcbind.socket real 0m0.026s user 0m0.009s sys 0m0.017s commit 8a835cebb149ba2a54b6518722c79019cf8e3da4
Author: Benjamin Coddington <bcodding>
Date: Mon Aug 1 13:19:04 2022 -0400
rpc-statd.service: Stop rpcbind and rpc.stat in an exit race
Verified with nfs-utils-2.3.3-57.el8 Reproducer in comment #0 ~~~~~~~~~~~~~~~~~~~~~~~~ [root@hp-dl385pg8-01 ~]# rpm -q nfs-utils rpcbind nfs-utils-2.3.3-57.el8.x86_64 rpcbind-1.2.5-9.el8.x86_64 [root@hp-dl385pg8-01 ~]# time systemctl try-restart rpcbind.service rpcbind.socket real 0m0.021s user 0m0.008s sys 0m0.014s [root@hp-dl385pg8-01 ~]# time systemctl try-restart rpcbind.service rpcbind.socket real 0m0.021s user 0m0.009s sys 0m0.012s [root@hp-dl385pg8-01 ~]# time systemctl try-restart rpcbind.service rpcbind.socket real 0m0.021s user 0m0.011s sys 0m0.011s [root@hp-dl385pg8-01 ~]# time systemctl try-restart rpcbind.service rpcbind.socket real 0m0.020s user 0m0.009s sys 0m0.012s [root@hp-dl385pg8-01 ~]# time systemctl try-restart rpcbind.service rpcbind.socket real 0m0.021s user 0m0.005s sys 0m0.016s [root@hp-dl385pg8-01 ~]# time systemctl try-restart rpcbind.service rpcbind.socket real 0m0.020s user 0m0.007s sys 0m0.014s [root@hp-dl385pg8-01 ~]# time systemctl try-restart rpcbind.service rpcbind.socket real 0m0.021s user 0m0.008s sys 0m0.013s [root@hp-dl385pg8-01 ~]# Reproducer in comment #7 ~~~~~~~~~~~~~~~~~~~~~~~~ [root@hp-dl385pg8-01 ~]# time systemctl restart rpcbind.socket real 0m0.022s user 0m0.008s sys 0m0.014s [root@hp-dl385pg8-01 ~]# time systemctl restart rpcbind.socket real 0m0.021s user 0m0.009s sys 0m0.013s [root@hp-dl385pg8-01 ~]# time systemctl restart rpcbind.socket real 0m0.022s user 0m0.008s sys 0m0.014s [root@hp-dl385pg8-01 ~]# time systemctl restart rpcbind.socket real 0m0.023s user 0m0.008s sys 0m0.016s [root@hp-dl385pg8-01 ~]# time systemctl restart rpcbind.socket Job for rpcbind.socket failed. See "systemctl status rpcbind.socket" and "journalctl -xe" for details. real 0m0.021s user 0m0.007s sys 0m0.014s [root@hp-dl385pg8-01 ~]# systemctl status rpcbind.socket ● rpcbind.socket - RPCbind Server Activation Socket Loaded: loaded (/usr/lib/systemd/system/rpcbind.socket; enabled; vendor preset: enabled) Active: failed (Result: start-limit-hit) since Fri 2022-08-05 09:30:08 EDT; 5s ago Listen: /run/rpcbind.sock (Stream) 0.0.0.0:111 (Stream) 0.0.0.0:111 (Datagram) [::]:111 (Stream) [::]:111 (Datagram) Aug 05 09:30:07 hp-dl385pg8-01.rhts.eng.pek2.redhat.com systemd[1]: Listening on RPCbind Server Activation Socket. Aug 05 09:30:08 hp-dl385pg8-01.rhts.eng.pek2.redhat.com systemd[1]: rpcbind.socket: Succeeded. Aug 05 09:30:08 hp-dl385pg8-01.rhts.eng.pek2.redhat.com systemd[1]: Closed RPCbind Server Activation Socket. Aug 05 09:30:08 hp-dl385pg8-01.rhts.eng.pek2.redhat.com systemd[1]: Stopping RPCbind Server Activation Socket. Aug 05 09:30:08 hp-dl385pg8-01.rhts.eng.pek2.redhat.com systemd[1]: rpcbind.socket: Start request repeated too quickly. <<<<< Aug 05 09:30:08 hp-dl385pg8-01.rhts.eng.pek2.redhat.com systemd[1]: rpcbind.socket: Failed with result 'start-limit-hit'. Aug 05 09:30:08 hp-dl385pg8-01.rhts.eng.pek2.redhat.com systemd[1]: Failed to listen on RPCbind Server Activation Socket. [root@hp-dl385pg8-01 ~]# Compared with nfs-utils-2.3.3-56.el8: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [root@hp-dl385pg8-01 ~]# rpm -q nfs-utils rpcbind nfs-utils-2.3.3-56.el8.x86_64 rpcbind-1.2.5-9.el8.x86_64 [root@hp-dl385pg8-01 ~]# time systemctl restart rpcbind.socket real 1m30.276s user 0m0.018s sys 0m0.016s [root@hp-dl385pg8-01 ~]# time systemctl try-restart rpcbind.service rpcbind.socket real 1m30.226s user 0m0.016s sys 0m0.017s [root@hp-dl385pg8-01 ~]# Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (nfs-utils bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:7768 |