Bug 1546192 - start_nfs.sh is ignoring ganesha.nfsd exit code
Summary: start_nfs.sh is ignoring ganesha.nfsd exit code
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Container
Version: 3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z2
: 3.0
Assignee: Sébastien Han
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On:
Blocks: 1450164 1475544 1489934 1548353
TreeView+ depends on / blocked
 
Reported: 2018-02-16 14:48 UTC by Giulio Fidente
Modified: 2022-02-21 18:19 UTC (History)
13 users (show)

Fixed In Version: rhceph:ceph-3.0-rhel-7-docker-candidate-19609-20180315205847
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-26 18:31:07 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-container pull 910 0 None closed Exit CephNFS with error if ganesha daemon fails 2021-01-18 10:48:40 UTC
Red Hat Product Errata RHBA-2018:1260 0 None None None 2018-04-26 18:31:40 UTC

Description Giulio Fidente 2018-02-16 14:48:50 UTC
Description of problem:
the start_nfs.sh [1] script inside the container image is ignoring the ganesha.nfsd exit code, consequently systemd returns success when the deamon inside the container did not start instead

1. https://github.com/ceph/ceph-container/blob/master/ceph-releases/luminous/ubuntu/16.04/daemon/start_nfs.sh#L35

Comment 4 Tom Barron 2018-02-16 22:26:15 UTC
I tried running ganesha in the foreground (with -F) with a '| tee -a /etc/ganesha/ganesha.log' thinking that would give us a true return code (not always successful) and put the log on stdout.  Unfortunately, while the '-F' keeps the daemon from forking off a child and immediately returning success, it doesn't put the log to stdout.

Maybe we need to get the ganesha folks to give us a way to log to stdout and then we can remove the 'tail -f'?

The other ceph daemons are not running 'tail -f' in this way and this commit

https://github.com/ceph/ceph-container/commit/51d17cf7285917b997300f45dea94811915ce85d

tells the story.

Comment 5 Tom Barron 2018-02-16 23:03:25 UTC
The ganesha developers suggested running with '-F -L STDOUT' and sure enough, from inside the current ceph-nfs container, if I kill off the ganesha daemon and run '/usr/bin/ganesha.nfsd -F -L STDOUT' I see the logging on the console and if I look later see that the same log events are in /var/log/ganesha/ganesha.log inside the container.  This suggests that we can remove the 'tail -f' and allow the ganesha daemon to cause the container to exit if it fails, and this in turn would allow external orchestrators (pacemaker, docker restart, kubernetes, etc.) to know that the container and its service have failed and attempt restart of the container.

Comment 6 Tom Barron 2018-02-18 15:11:08 UTC
I made the following change to start_nfs.sh:

[stack@undercloud daemon]$ git diff
diff --git a/ceph-releases/luminous/ubuntu/16.04/daemon/start_nfs.sh b/ceph-releases/luminous/ubuntu/16.04/daemon/start_nfs.sh
index 835f702..bb19f5b 100755
--- a/ceph-releases/luminous/ubuntu/16.04/daemon/start_nfs.sh
+++ b/ceph-releases/luminous/ubuntu/16.04/daemon/start_nfs.sh
@@ -31,7 +31,6 @@ function start_nfs {
   fi
 
   log "SUCCESS"
-  # start ganesha
-  /usr/bin/ganesha.nfsd "${GANESHA_OPTIONS[@]}" -L /var/log/ganesha/ganesha.log "${GANESHA_EPOCH}" || return 0
-  exec tailf /var/log/ganesha/ganesha.log
+  # start ganesha, logging both to STDOUT and to the location specified in the ganesha config file
+  exec /usr/bin/ganesha.nfsd "${GANESHA_OPTIONS[@]}" -F -L STDOUT "${GANESHA_EPOCH}" || return 0
 }

rebuilt the docker container, pushed it to a local docker repository, pulled it to an OpenStack controller node, and ran it there with a configuration known to cause the ganesha daemon to exit.  The container runs and exits with an error code as one would like and the logs go to STDOUT and are visible in journalctl afterwards:

[root@overcloud-controller-0 ~]# /usr/bin/docker run --rm --net=host -v /var/lib/ceph:/var/lib/ceph -v /etc/ceph:/etc/ceph -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha -v /etc/ganesha:/etc/ganesha --privileged -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket -v /etc/localtime:/etc/localtime:ro -e CLUSTER=ceph -e CEPH_DAEMON=NFS --name=ceph-nfs-pacemaker 192.168.24.1:8787/ceph/daemon:latest
2018-02-18 15:04:56  /entrypoint.sh: static: does not generate config
2018-02-18 15:04:56  /entrypoint.sh: SUCCESS
exec: PID 94: spawning /usr/bin/ganesha.nfsd  -F -L STDOUT 
18/02/2018 15:04:56 : epoch 5a899618 : overcloud-controller-0 : ganesha.nfsd-94[main] main :MAIN :EVENT :ganesha.nfsd Starting: Ganesha Version /builddir/build/BUILD/nfs-ganesha-2.5.5-0.1.1-Source, built at Feb 12 2018 22:29:08 on 
18/02/2018 15:04:56 : epoch 5a899618 : overcloud-controller-0 : ganesha.nfsd-94[main] nfs_set_param_from_conf :NFS STARTUP :EVENT :Configuration file successfully parsed
18/02/2018 15:04:56 : epoch 5a899618 : overcloud-controller-0 : ganesha.nfsd-94[main] init_server_pkgs :NFS STARTUP :EVENT :Initializing ID Mapper.
18/02/2018 15:04:56 : epoch 5a899618 : overcloud-controller-0 : ganesha.nfsd-94[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper successfully initialized.
18/02/2018 15:04:56 : epoch 5a899618 : overcloud-controller-0 : ganesha.nfsd-94[main] main :NFS STARTUP :WARN :No export entries found in configuration file !!!
18/02/2018 15:04:56 : epoch 5a899618 : overcloud-controller-0 : ganesha.nfsd-94[main] lower_my_caps :NFS STARTUP :EVENT :CAP_SYS_RESOURCE was successfully removed for proper quota management in FSAL
18/02/2018 15:04:56 : epoch 5a899618 : overcloud-controller-0 : ganesha.nfsd-94[main] lower_my_caps :NFS STARTUP :EVENT :currenty set capabilities are: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap+eip
18/02/2018 15:04:56 : epoch 5a899618 : overcloud-controller-0 : ganesha.nfsd-94[main] Bind_sockets_V6 :DISP :WARN :Cannot bind NFS udp6 socket, error 99 (Cannot assign requested address)
18/02/2018 15:04:56 : epoch 5a899618 : overcloud-controller-0 : ganesha.nfsd-94[main] Bind_sockets :DISP :FATAL :Error binding to V6 interface. Cannot continue.
exec: PID 94: exit 2
[root@overcloud-controller-0 ~]#

Comment 8 Sébastien Han 2018-02-23 11:43:55 UTC
Tom, it's available upstream and downstream already. tag-build-master-luminous-ubuntu-16.04 or tag-build-master-luminous-centos-7, same for jewel.

Comment 9 Tom Barron 2018-02-23 12:21:25 UTC
(In reply to leseb from comment #8)
> Tom, it's available upstream and downstream already.
> tag-build-master-luminous-ubuntu-16.04 or
> tag-build-master-luminous-centos-7, same for jewel.

Terrific - Thanks, Seb!

Comment 10 Tom Barron 2018-02-23 14:27:11 UTC
(In reply to leseb from comment #8)
> Tom, it's available upstream and downstream already.
> tag-build-master-luminous-ubuntu-16.04 or
> tag-build-master-luminous-centos-7, same for jewel.

Actually upstream OpenStack (RDO and TripleO) use tag-stable-3.0-luminous-centos-7 (which is two weeks old) and downstream (in RHOSP) I think we won't be using a centos-7 container, but one built for RHCS, right?

Comment 13 Sébastien Han 2018-03-08 17:21:01 UTC
If this is upstream you will use tag-stable-3.0-luminous-centos-7.

Comment 23 Yogev Rabl 2018-04-13 12:16:17 UTC
Verified by Tom Barron

Comment 26 errata-xmlrpc 2018-04-26 18:31:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1260


Note You need to log in before you can comment on or make changes to this bug.