Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2100395

Summary:	rpcbind.socket dies upon try-restart, causing related services (e.g. ypbind) to become unavailable
Product:	Red Hat Enterprise Linux 8	Reporter:	Renaud Métrich <rmetrich>
Component:	nfs-utils	Assignee:	Steve Dickson <steved>
Status:	CLOSED ERRATA	QA Contact:	Yongcheng Yang <yoyang>
Severity:	high	Docs Contact:
Priority:	high
Version:	8.6	CC:	anrussel, bcodding, dostwal, dtardon, dwysocha, systemd-maint-list, toneata, xzhou
Target Milestone:	rc	Keywords:	EasyFix, Patch, Triaged
Target Release:	---	Flags:	pm-rhel: mirror+
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	nfs-utils-2.3.3-57.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-11-08 10:53:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Renaud Métrich 2022-06-23 09:16:14 UTC

Description of problem:

Upon rpcbind package update the following command is issued from rpm scriptlet:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
systemctl try-restart rpcbind.service rpcbind.socket
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

This ends up having the rpcbind.socket fail to start, because rpcbind.service is already active and also hijacked the sockets.

The journal shows some kind of race condition between the service and the socket:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
# journalctl -b -u rpcbind.service -u rpcbind.socket -o short-precise
-- Logs begin at Thu 2022-06-23 10:59:56 CEST, end at Thu 2022-06-23 11:01:01 CEST. --
Jun 23 11:00:00.387955 vm-rhel8 systemd[1]: Starting RPC Bind...
Jun 23 11:00:00.454849 vm-rhel8 systemd[1]: Started RPC Bind.
Jun 23 11:00:35.378595 vm-rhel8 systemd[1]: Stopping RPC Bind...
Jun 23 11:00:35.380285 vm-rhel8 systemd[1]: rpcbind.service: Succeeded.
Jun 23 11:00:35.380591 vm-rhel8 systemd[1]: Stopped RPC Bind.

Jun 23 11:00:35.381691 vm-rhel8 systemd[1]: Starting RPC Bind...

--> This is rpcbind.service restart

Jun 23 11:00:35.384161 vm-rhel8 systemd[1]: rpcbind.socket: Succeeded.
Jun 23 11:00:35.384504 vm-rhel8 systemd[1]: Closed RPCbind Server Activation Socket.
Jun 23 11:00:35.384525 vm-rhel8 systemd[1]: Stopping RPCbind Server Activation Socket.

--> This is rpcbind.socket restart, only the stop happens due to rpcbind.service hijacking the sockets

Jun 23 11:00:35.384595 vm-rhel8 systemd[1]: rpcbind.socket: Socket service rpcbind.service already active, refusing.
Jun 23 11:00:35.384618 vm-rhel8 systemd[1]: Failed to listen on RPCbind Server Activation Socket.
Jun 23 11:00:35.384656 vm-rhel8 systemd[1]: Dependency failed for RPC Bind.
Jun 23 11:00:35.384673 vm-rhel8 systemd[1]: rpcbind.service: Job rpcbind.service/start failed with result 'dependency>

# netstat -anlp | egrep -w "rpcbind|111"
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1559/rpcbind        
tcp6       0      0 :::111                  :::*                    LISTEN      1559/rpcbind        
udp        0      0 0.0.0.0:111             0.0.0.0:*                           1559/rpcbind        
udp6       0      0 :::111                  :::*                                1559/rpcbind        
unix  2      [ ACC ]     STREAM     LISTENING     20232    1559/rpcbind         /run/rpcbind.sock

--> everything held by "rpcbind" instead of "systemd"
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Interestingly, when stracing *systemd*, I'm not able to reproduce, which seems to indicate there is some race.

Version-Release number of selected component (if applicable):

Up to rpcbind-1.2.5-8.el8.x86_64 included

How reproducible:

Always

Steps to Reproduce:
1. Execute "systemctl try-restart rpcbind.service rpcbind.socket"

Actual results:

Job for rpcbind.socket failed.
See "systemctl status rpcbind.socket" and "journalctl -xe" for details.
A dependency job for rpcbind.service failed. See 'journalctl -xe' for details.

Expected results:

No error

Comment 1 David Tardon 2022-06-29 13:38:07 UTC

This is https://github.com/systemd/systemd/issues/23581, which is itself a duplicate of https://github.com/systemd/systemd/issues/8102. It doesn't look like it's going to be implemented in foreseeable future (and it would likely be quite hard to backport to RHEL-8 anyway), hence I'm moving this to RHEL-9.

Meanwhile, rpcbind should switch the order of the units in the try-restart call. That seems to work reliably (or the race window is a lot smaller that way).

Comment 7 Benjamin Coddington 2022-07-18 21:09:03 UTC

I seem to have just run into this issue on a RHEL-8.6 to RHEL-8.7 upgrade, though the specifics are a bit different:

rpcbind upgraded to rpcbind-1.2.5-9.el8.x86_64, and the upgrade seemed to stop making progress at:

  Running scriptlet: rpcbind-1.2.5-8.el8.x86_64                             340/632 

Process tree looked like this:

root        5228     915  0 15:01 ?        00:00:00  \_ sshd: root [priv]
root        5246    5228  0 15:01 ?        00:00:09  |   \_ sshd: root@pts/0
root        5247    5246  0 15:01 pts/0    00:00:00  |       \_ -bash
root        5595    5247 25 15:56 pts/0    00:02:02  |           \_ /usr/libexec/platform-python /usr/bin/dnf update
root       26924    5595  0 16:04 pts/0    00:00:00  |               \_ /bin/sh /var/tmp/rpm-tmp.kSNBph 2 2 
root       26950   26924  0 16:04 pts/0    00:00:00  |                   \_ /bin/systemctl restart rpcbind.socket

.. and the contents of /var/tmp/rpm-tmp.kSNBph was:
--------------------------------------------------------------------
if systemctl -q is-enabled rpcbind.socket
then
	/bin/systemctl reenable rpcbind.socket  >/dev/null 2>&1 || :
	/bin/systemctl restart rpcbind.socket >/dev/null 2>&1 || :
fi
if [ ! -d /var/lib/rpcbind ]
then
	mkdir /var/lib/rpcbind
	chown rpc:rpc /var/lib/rpcbind
	[ -x /usr/sbin/restorecon ] && \
		/usr/sbin/restorecon /var/lib/rpcbind
fi
--------------------------------------------------------------------

If I try to run systemctl restart rpcbind.socket directly:
[root@bcodding-rhel8 ~]# time systemctl restart rpcbind.socket

real	1m30.176s
user	0m0.007s
sys	0m0.020s

.. which produces the logs:
Jul 18 16:56:08 bcodding-rhel8 systemd[1]: Stopping NFS status monitor for NFSv2/3 locking....
Jul 18 16:56:08 bcodding-rhel8 systemd[1]: Stopping RPC Bind...
Jul 18 16:56:08 bcodding-rhel8 systemd[1]: rpcbind.service: Succeeded.
Jul 18 16:56:08 bcodding-rhel8 systemd[1]: Stopped RPC Bind.
Jul 18 16:57:38 bcodding-rhel8 systemd[1]: rpc-statd.service: State 'stop-sigterm' timed out. Killing.
Jul 18 16:57:38 bcodding-rhel8 systemd[1]: rpc-statd.service: Killing process 101687 (rpc.statd) with signal SIGKILL.
Jul 18 16:57:38 bcodding-rhel8 systemd[1]: rpc-statd.service: Main process exited, code=killed, status=9/KILL
Jul 18 16:57:38 bcodding-rhel8 systemd[1]: rpc-statd.service: Failed with result 'timeout'.
Jul 18 16:57:38 bcodding-rhel8 systemd[1]: Stopped NFS status monitor for NFSv2/3 locking..
Jul 18 16:57:38 bcodding-rhel8 systemd[1]: rpcbind.socket: Succeeded.
Jul 18 16:57:38 bcodding-rhel8 systemd[1]: Closed RPCbind Server Activation Socket.
Jul 18 16:57:38 bcodding-rhel8 systemd[1]: Stopping RPCbind Server Activation Socket.
Jul 18 16:57:38 bcodding-rhel8 systemd[1]: Listening on RPCbind Server Activation Socket.
Jul 18 16:57:38 bcodding-rhel8 systemd[1]: Starting RPC Bind...
Jul 18 16:57:38 bcodding-rhel8 systemd[1]: Starting NFS status monitor for NFSv2/3 locking....
Jul 18 16:57:38 bcodding-rhel8 systemd[1]: Started RPC Bind.
Jul 18 16:57:38 bcodding-rhel8 rpc.statd[101697]: Version 2.3.3 starting
Jul 18 16:57:38 bcodding-rhel8 rpc.statd[101697]: Flags: TI-RPC
Jul 18 16:57:38 bcodding-rhel8 systemd[1]: Started NFS status monitor for NFSv2/3 locking..

It seems from the logs that rpc-statd is ignoring SIGTERM, but I'm out of time looking at this today.  Will try to figure out more and report back.

Comment 8 Benjamin Coddington 2022-07-26 15:48:08 UTC

The issue I'm seeing is that when `systemctl stop rpcbind.socket` is run, the dependency means that systemd first sends SIGTERM to rpcbind, then sigterm to rpc.statd.  On SIGTERM, rpcbind tears down /var/run/rpcbind.sock.  However, rpc-statd on SIGTERM attempts to unregister from rpcbind, thus:

[root@bcodding-rhel8 ~]# gdb -p $(pidof rpc.statd)
(gdb) break connect
Breakpoint 1 at 0x7ffff684dc20 (2 locations)
(gdb) continue
Continuing.

Program received signal SIGTERM, Terminated.
0x00007ffff74dd6eb in select () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff74dd6eb in select () from /lib64/libc.so.6
#1  0x000055555555ad06 in my_svc_run (sockfd=7) at svc_run.c:110
#2  0x0000555555558d90 in main (argc=1, argv=0x7fffffffe318) at statd.c:527
(gdb) continue
Continuing.

Breakpoint 1, 0x00007ffff74e6480 in connect () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff74e6480 in connect () from /lib64/libc.so.6
#1  0x00007ffff79a4150 in clnt_vc_create () from /lib64/libtirpc.so.3
#2  0x00007ffff79a8a3d in local_rpcb () from /lib64/libtirpc.so.3
#3  0x00007ffff79a97c4 in rpcb_unset () from /lib64/libtirpc.so.3
#4  0x000055555555fca5 in nfs_svc_unregister (program=100024, version=1) at svc_create.c:483
#5  0x000055555555a876 in statd_unregister () at statd.c:105
#6  killer (sig=15) at statd.c:105
#7  <signal handler called>
#8  0x00007ffff74dd6eb in select () from /lib64/libc.so.6
#9  0x000055555555ad06 in my_svc_run (sockfd=7) at svc_run.c:110
#10 0x0000555555558d90 in main (argc=1, argv=0x7fffffffe318) at statd.c:527
(gdb) continue
Continuing.

Breakpoint 1, 0x00007ffff74e6480 in connect () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff74e6480 in connect () from /lib64/libc.so.6
#1  0x00007ffff79a4150 in clnt_vc_create () from /lib64/libtirpc.so.3
#2  0x00007ffff79a8a3d in local_rpcb () from /lib64/libtirpc.so.3
#3  0x00007ffff79a97c4 in rpcb_unset () from /lib64/libtirpc.so.3
#4  0x000055555555fca5 in nfs_svc_unregister (program=100024, version=1) at svc_create.c:483
#5  0x00007ffff740723c in __run_exit_handlers () from /lib64/libc.so.6
#6  0x00007ffff7407370 in exit () from /lib64/libc.so.6
#7  0x000055555555a892 in killer (sig=15) at statd.c:107
#8  <signal handler called>
#9  0x00007ffff74dd6eb in select () from /lib64/libc.so.6
#10 0x000055555555ad06 in my_svc_run (sockfd=7) at svc_run.c:110
#11 0x0000555555558d90 in main (argc=1, argv=0x7fffffffe318) at statd.c:527
(gdb) continue
Continuing.
[Inferior 1 (process 4897) exited normally]

We need a way for systemd to wait for rpc.statd to exit before sending SIGTERM to rpcbind.

Comment 9 Benjamin Coddington 2022-07-26 18:03:47 UTC

Ok, here's the likely fix:

diff --git a/systemd/rpc-statd.service b/systemd/rpc-statd.service
index f41ae2084f3c..2eeada60993a 100644
--- a/systemd/rpc-statd.service
+++ b/systemd/rpc-statd.service
@@ -4,7 +4,7 @@ DefaultDependencies=no
 Conflicts=umount.target
 Requires=nss-lookup.target rpcbind.socket
 Wants=network-online.target
-After=network-online.target nss-lookup.target rpcbind.socket
+After=network-online.target nss-lookup.target rpcbind.service
 
 PartOf=nfs-utils.service

Apparently using After=rpcbind.socket isn't sufficient to get the inverse order correct when shutting down, and systemd sends SIGTERM to rpcbind /before/ rpc.statd.  After this patch, the ordering is correct and rpc.statd shuts down cleanly:

[root@bcodding-rhel8 ~]# time systemctl restart rpcbind.socket

real	0m0.026s
user	0m0.009s
sys	0m0.017s

Comment 14 Steve Dickson 2022-08-01 17:21:24 UTC

commit 8a835cebb149ba2a54b6518722c79019cf8e3da4
Author: Benjamin Coddington <bcodding>
Date:   Mon Aug 1 13:19:04 2022 -0400

    rpc-statd.service: Stop rpcbind and rpc.stat in an exit race

Comment 15 Yongcheng Yang 2022-08-05 13:38:48 UTC

Verified with nfs-utils-2.3.3-57.el8

Reproducer in comment #0
~~~~~~~~~~~~~~~~~~~~~~~~
[root@hp-dl385pg8-01 ~]# rpm -q nfs-utils rpcbind
nfs-utils-2.3.3-57.el8.x86_64
rpcbind-1.2.5-9.el8.x86_64
[root@hp-dl385pg8-01 ~]# time systemctl try-restart rpcbind.service rpcbind.socket

real    0m0.021s
user    0m0.008s
sys     0m0.014s
[root@hp-dl385pg8-01 ~]# time systemctl try-restart rpcbind.service rpcbind.socket

real    0m0.021s
user    0m0.009s
sys     0m0.012s
[root@hp-dl385pg8-01 ~]# time systemctl try-restart rpcbind.service rpcbind.socket

real    0m0.021s
user    0m0.011s
sys     0m0.011s
[root@hp-dl385pg8-01 ~]# time systemctl try-restart rpcbind.service rpcbind.socket

real    0m0.020s
user    0m0.009s
sys     0m0.012s
[root@hp-dl385pg8-01 ~]# time systemctl try-restart rpcbind.service rpcbind.socket

real    0m0.021s
user    0m0.005s
sys     0m0.016s
[root@hp-dl385pg8-01 ~]# time systemctl try-restart rpcbind.service rpcbind.socket

real    0m0.020s
user    0m0.007s
sys     0m0.014s
[root@hp-dl385pg8-01 ~]# time systemctl try-restart rpcbind.service rpcbind.socket

real    0m0.021s
user    0m0.008s
sys     0m0.013s
[root@hp-dl385pg8-01 ~]#

Reproducer in comment #7
~~~~~~~~~~~~~~~~~~~~~~~~
[root@hp-dl385pg8-01 ~]# time systemctl restart rpcbind.socket

real    0m0.022s
user    0m0.008s
sys     0m0.014s
[root@hp-dl385pg8-01 ~]# time systemctl restart rpcbind.socket

real    0m0.021s
user    0m0.009s
sys     0m0.013s
[root@hp-dl385pg8-01 ~]# time systemctl restart rpcbind.socket

real    0m0.022s
user    0m0.008s
sys     0m0.014s
[root@hp-dl385pg8-01 ~]# time systemctl restart rpcbind.socket

real    0m0.023s
user    0m0.008s
sys     0m0.016s
[root@hp-dl385pg8-01 ~]# time systemctl restart rpcbind.socket
Job for rpcbind.socket failed.
See "systemctl status rpcbind.socket" and "journalctl -xe" for details.

real    0m0.021s
user    0m0.007s
sys     0m0.014s
[root@hp-dl385pg8-01 ~]# systemctl status rpcbind.socket
● rpcbind.socket - RPCbind Server Activation Socket
   Loaded: loaded (/usr/lib/systemd/system/rpcbind.socket; enabled; vendor preset: enabled)
   Active: failed (Result: start-limit-hit) since Fri 2022-08-05 09:30:08 EDT; 5s ago
   Listen: /run/rpcbind.sock (Stream)
           0.0.0.0:111 (Stream)
           0.0.0.0:111 (Datagram)
           [::]:111 (Stream)
           [::]:111 (Datagram)

Aug 05 09:30:07 hp-dl385pg8-01.rhts.eng.pek2.redhat.com systemd[1]: Listening on RPCbind Server Activation Socket.
Aug 05 09:30:08 hp-dl385pg8-01.rhts.eng.pek2.redhat.com systemd[1]: rpcbind.socket: Succeeded.
Aug 05 09:30:08 hp-dl385pg8-01.rhts.eng.pek2.redhat.com systemd[1]: Closed RPCbind Server Activation Socket.
Aug 05 09:30:08 hp-dl385pg8-01.rhts.eng.pek2.redhat.com systemd[1]: Stopping RPCbind Server Activation Socket.
Aug 05 09:30:08 hp-dl385pg8-01.rhts.eng.pek2.redhat.com systemd[1]: rpcbind.socket: Start request repeated too quickly.   <<<<<
Aug 05 09:30:08 hp-dl385pg8-01.rhts.eng.pek2.redhat.com systemd[1]: rpcbind.socket: Failed with result 'start-limit-hit'.
Aug 05 09:30:08 hp-dl385pg8-01.rhts.eng.pek2.redhat.com systemd[1]: Failed to listen on RPCbind Server Activation Socket.
[root@hp-dl385pg8-01 ~]#


Compared with nfs-utils-2.3.3-56.el8:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@hp-dl385pg8-01 ~]# rpm -q nfs-utils rpcbind
nfs-utils-2.3.3-56.el8.x86_64
rpcbind-1.2.5-9.el8.x86_64
[root@hp-dl385pg8-01 ~]# time systemctl restart rpcbind.socket

real    1m30.276s
user    0m0.018s
sys     0m0.016s
[root@hp-dl385pg8-01 ~]# time systemctl try-restart rpcbind.service rpcbind.socket

real    1m30.226s
user    0m0.016s
sys     0m0.017s
[root@hp-dl385pg8-01 ~]#

Comment 19 errata-xmlrpc 2022-11-08 10:53:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (nfs-utils bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7768