Bug 1181779

Summary: rpcbind prevents Gluster/NFS from registering itself after a restart/reboot
Product: Red Hat Enterprise Linux 7 Reporter: Marcelo Barbosa "firemanxbr" <marcelo.barbosa>
Component: rpcbindAssignee: Steve Dickson <steved>
Status: CLOSED ERRATA QA Contact: Yongcheng Yang <yoyang>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.0CC: asmarre, eguan, fs-qe, jiyin, joe, ndevos, smayhew, steved, yoyang
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rpcbind-0.2.0-27.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-19 05:32:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marcelo Barbosa "firemanxbr" 2015-01-13 18:07:06 UTC
I'm using RHEL 7.0 + GlusterFS with packages:

glusterfs-libs-3.6.1-1.el7.x86_64
glusterfs-fuse-3.6.1-1.el7.x86_64
vdsm-gluster-4.16.10-0.el7.noarch
glusterfs-cli-3.6.1-1.el7.x86_64
glusterfs-server-3.6.1-1.el7.x86_64
glusterfs-api-3.6.1-1.el7.x86_64
glusterfs-geo-replication-3.6.1-1.el7.x86_64
glusterfs-3.6.1-1.el7.x86_64
glusterfs-rdma-3.6.1-1.el7.x86_64
rpcbind-0.2.0-23.el7.x86_64

My error is:

# systemctl status glusterd.service
glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled)
   Active: active (running) since Tue 2015-01-13 14:34:29 BRST; 5min ago
  Process: 20445 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid (code=exited, status=0/SUCCESS)
 Main PID: 20446 (glusterd)
   CGroup: /system.slice/glusterd.service
           ├─ 3426 /usr/sbin/glusterfsd -s ped-dc02.datacom --volfile-id vol-ctdb.ped-dc02.datacom.gluster-ctdb02 -p /var/lib/glusterd/vols/vol-ctdb/ru...
           ├─ 3432 /usr/sbin/glusterfsd -s ped-dc02.datacom --volfile-id vol-data.ped-dc02.datacom.gluster-data02 -p /var/lib/glusterd/vols/vol-data/ru...
           ├─ 3440 /usr/sbin/glusterfsd -s ped-dc02.datacom --volfile-id vol-export.ped-dc02.datacom.gluster-export02 -p /var/lib/glusterd/vols/vol-exp...
           ├─ 3445 /usr/sbin/glusterfsd -s ped-dc02.datacom --volfile-id vol-iso.ped-dc02.datacom.gluster-iso02 -p /var/lib/glusterd/vols/vol-iso/run/p...
           ├─ 3450 /usr/sbin/glusterfsd -s ped-dc02.datacom --volfile-id vol-unguarded.ped-dc02.datacom.gluster-unguarded02 -p /var/lib/glusterd/vols/v...
           ├─ 3457 /usr/sbin/glusterfsd -s ped-dc02.datacom --volfile-id vol-vm.ped-dc02.datacom.gluster-vm02 -p /var/lib/glusterd/vols/vol-vm/run/ped-...
           ├─20446 /usr/sbin/glusterd -p /var/run/glusterd.pid
           ├─20689 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glus...
           └─20695 /sbin/rpc.statd
 
Jan 13 14:34:38 ped-dc02.datacom nfs[20679]: backtrace 1
Jan 13 14:34:38 ped-dc02.datacom nfs[20679]: dlfcn 1
Jan 13 14:34:38 ped-dc02.datacom nfs[20679]: libpthread 1
Jan 13 14:34:38 ped-dc02.datacom nfs[20679]: llistxattr 1
Jan 13 14:34:38 ped-dc02.datacom nfs[20679]: setfsid 1
Jan 13 14:34:38 ped-dc02.datacom nfs[20679]: spinlock 1
Jan 13 14:34:38 ped-dc02.datacom nfs[20679]: epoll.h 1
Jan 13 14:34:38 ped-dc02.datacom nfs[20679]: xattr.h 1
Jan 13 14:34:38 ped-dc02.datacom nfs[20679]: st_atim.tv_nsec 1
Jan 13 14:34:38 ped-dc02.datacom nfs[20679]: package-string: glusterfs 3.6.1

log:

# tail -f /var/log/glusterfs/nfs.log
[2015-01-13 16:11:40.961035] W [glusterfsd.c:1194:cleanup_and_exit] (--> 0-: received signum (0), shutting down
[2015-01-13 16:27:31.683546] I [MSGID: 100030] [glusterfsd.c:2018:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.6.1 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/d90c692c9430f00aafeb7d6741c1a54b.socket)
[2015-01-13 16:27:32.730510] I [rpcsvc.c:2142:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 16
[2015-01-13 16:27:32.804293] E [rpcsvc.c:1303:rpcsvc_program_register_portmap] 0-rpc-service: Could not register with portmap 100021 4 38468
[2015-01-13 16:27:32.804314] E [nfs.c:331:nfs_init_versions] 0-nfs: Program  NLM4 registration failed
[2015-01-13 16:27:32.804321] E [nfs.c:1341:init] 0-nfs: Failed to initialize protocols
[2015-01-13 16:27:32.804328] E [xlator.c:425:xlator_init] 0-nfs-server: Initialization of volume 'nfs-server' failed, review your volfile again
[2015-01-13 16:27:32.804334] E [graph.c:322:glusterfs_graph_init] 0-nfs-server: initializing translator failed
[2015-01-13 16:27:32.804340] E [graph.c:525:glusterfs_graph_activate] 0-graph: init failed
[2015-01-13 16:27:32.804626] W [glusterfsd.c:1194:cleanup_and_exit] (--> 0-: received signum (0), shutting down

Solution:
'sed "s/ -w//" /usr/lib/systemd/system/rpcbind.service > /etc/systemd/system/rpcbind.service ; systemctl daemon-reload ; systemctl restart rpcbind ; systemctl restart glusterd'

Comment 1 Niels de Vos 2015-01-13 18:13:46 UTC
The problem was that rpcbind always starts with the -w option. This prevents the Gluster/NFS server from registering itself at rpcbind after a reboot.

Removing the -w option from the rpcbind.service does not do a warm-restart on boot, and all RPC-programs should be able to register themselves without problem.

This might be related to the fact that upon reboot the Gluster/NFS service does not (always) unregister itself from rpcbind.

I think the rpcbind.service should only add the -w option on reload, not on (re)start.

Comment 3 Steve Dickson 2015-01-15 14:02:09 UTC
(In reply to Niels de Vos from comment #1)
> The problem was that rpcbind always starts with the -w option. This prevents
> the Gluster/NFS server from registering itself at rpcbind after a reboot.
> 
> Removing the -w option from the rpcbind.service does not do a warm-restart
> on boot, and all RPC-programs should be able to register themselves without
> problem.
> 
> This might be related to the fact that upon reboot the Gluster/NFS service
> does not (always) unregister itself from rpcbind.
> 
> I think the rpcbind.service should only add the -w option on reload, not on
> (re)start.

How do you do this with systemd scripts???

Comment 4 Niels de Vos 2015-01-16 17:48:00 UTC
(In reply to Steve Dickson from comment #3)
> How do you do this with systemd scripts???

Uh, yeah, well, that does not seem as trivial as I thought it would be.

This simple configuration doesn't work, probably because the PID changes:

[Service]
Type=forking
EnvironmentFile=/etc/sysconfig/rpcbind
ExecStart=/sbin/rpcbind ${RPCBIND_ARGS}
ExecReload=-/bin/kill ${MAINPID} ; /sbin/rpcbind -w ${RPCBIND_ARGS}


So, trying to fake a rpcbind.pid in the hope it would do something more (line breaks added in this comment, commands should be on one line):

[Service]
Type=forking
PIDFile=/run/rpcbind.pid
EnvironmentFile=/etc/sysconfig/rpcbind
ExecStart=/bin/sh -c "/sbin/rpcbind ${RPCBIND_ARGS} ; sleep 1 ; \
                      /usr/sbin/pidof rpcbind > /run/rpcbind.pid"
ExecReload=-/bin/kill ${MAINPID} ; /sbin/rpcbind -w ${RPCBIND_ARGS} ; \
           -/usr/sbin/pidof rpcbind > /run/rpcbind.pid


But no luck :-/

The problem is caused because the Gluster/NFS server is not stopped on a shutdown/reboot. It therefore does not unregister at rpcbind. I guess a similar problem would happen when an RPC-service crashes.

Do you have a recommendation on how to handle this kind of issue? I need to look into decently stopping the Gluster/NFS service on systemd environments, but a potential crashing/unclean de-registrations seems unhandled.

(Idea: maybe move the -w status file to /var/run which is cleared upon reboot?)

Comment 5 Steve Dickson 2015-02-26 18:24:26 UTC
(In reply to Niels de Vos from comment #4)
> (Idea: maybe move the -w status file to /var/run which is cleared upon
> reboot?)
Working with Anand at this year's Connectathon, I see what the problem is.
I think moving the warm up file /var/run is a good idea because 
rpcbind needs to remember server over restarts not reboots!

Comment 9 Steve Dickson 2015-05-04 12:38:40 UTC
(In reply to Steve Dickson from comment #5)
> (In reply to Niels de Vos from comment #4)
> > (Idea: maybe move the -w status file to /var/run which is cleared upon
> > reboot?)
> Working with Anand at this year's Connectathon, I see what the problem is.
> I think moving the warm up file /var/run is a good idea because 
> rpcbind needs to remember server over restarts not reboots!

I was playing around with this in Fedora and it turns just moving 
the rpcbind directory to /var/run didn't work because the the
/var/run/rpcbind was being created on reboot... But I just
stumbled over  systemd-tmpfiles which appears create 
directories durning boot which is exactly what is needed
(I think! ;-) )

Comment 10 Steve Dickson 2015-05-04 15:01:15 UTC
*** Bug 1184661 has been marked as a duplicate of this bug. ***

Comment 21 Steve Dickson 2015-09-24 15:42:40 UTC
Hello,

There were some changes made to the latest rpcbind package
Would you mind retesting with 
   http://people.redhat.com/steved/.bz1240817/rpcbind-0.2.0-30.el7.x86_64.rpm

to ensure there are no regressions

Comment 24 errata-xmlrpc 2015-11-19 05:32:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2205.html