Description of problem: Ceph can be deployed alongside an OpenStack deployment with TripleO (Director) and Ceph Ansible on dedicated "ceph" storage nodes. Some ceph processes (MDS, nfs-ganesha) are deployed on OpenStack controller nodes. sosreport tooling currently does not capture any of the ceph log files in such environments. Version-Release number of selected component (if applicable): Version 3.7 of the sos package was being used when this deficiency was discovered: $ rpm -q sos sos-3.7-5.el7.noarch OSP 13, a long-life release is deployed with RHEL 7. OSP 16 (upcoming) will be deployed with RHEL 8. So any fix for this issue is applicable to both platforms (RHEL 7 and RHEL 8) How reproducible: Always Steps to Reproduce: On an RHEL OSP controller (or ceph) node, execute: $ sosreport --all-logs Examine the sosreport logs, no ceph logs are included. When ceph is deployed alongside OSP, it is installed via ceph-ansible in containers. These containers can be controlled with systemd, and log files, while persisted in the local container filesystems, are written to the system journal. They can be read from the (OpenStack overcloud controller) host like follows: $ sudo journalctl CONTAINER_NAME=ceph-mds-controller-0 $ sudo journalctl CONTAINER_NAME=ceph-mon-controller-0 $ sudo journalctl CONTAINER_NAME=ceph-mgr-controller-0 $ sudo journalctl CONTAINER_NAME=ceph-nfs-pacemaker For example: [heat-admin@controller-0 ~]$ sudo journalctl CONTAINER_NAME=ceph-nfs-pacemaker -- Logs begin at Mon 2019-08-19 22:06:21 UTC, end at Mon 2019-08-19 23:16:19 UTC. -- Aug 19 23:01:22 controller-0 dockerd-current[20176]: 2019-08-19 23:01:22 /entrypoint.sh: static: does not generate config Aug 19 23:01:22 controller-0 dockerd-current[20176]: HEALTH_OK Aug 19 23:01:23 controller-0 dockerd-current[20176]: 2019-08-19 23:01:23 /entrypoint.sh: SUCCESS Aug 19 23:01:23 controller-0 dockerd-current[20176]: exec: PID 138: spawning /usr/bin/ganesha.nfsd -F -L STDOUT Aug 19 23:01:23 controller-0 dockerd-current[20176]: exec: Waiting 138 to quit Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] main :MAIN :EVENT :ganesha.nfsd Starting: Ganesha Version 2.7.1 Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_set_param_from_conf :NFS STARTUP :EVENT :Configuration file successfully p Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] init_server_pkgs :NFS STARTUP :EVENT :Initializing ID Mapper. Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper successfully initialized. Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] rados_kv_init :CLIENT ID :EVENT :Rados kv store init done Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90 Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] main :NFS STARTUP :WARN :No export entries found in configuration file !!! Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] config_errs_to_log :CONFIG :WARN :Config File (/etc/ganesha/ganesha.conf:14): Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] config_errs_to_log :CONFIG :WARN :Config File (/etc/ganesha/ganesha.conf:17): Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] lower_my_caps :NFS STARTUP :EVENT :CAP_SYS_RESOURCE was successfully removed f Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] lower_my_caps :NFS STARTUP :EVENT :currenty set capabilities are: = cap_chown, Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_Init_svc :DISP :CRIT :Cannot acquire credentials for principal nfs Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_Init_admin_thread :NFS CB :EVENT :Admin thread initialized Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_rpc_cb_init_ccache :NFS STARTUP :EVENT :Callback creds directory (/var/run Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN :gssd_refresh_krb5_machine_credentia Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_Start_threads :THREAD :EVENT :Starting delayed executor. Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_Start_threads :THREAD :EVENT :9P/TCP dispatcher thread was started success Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_Start_threads :THREAD :EVENT :gsh_dbusthread was started successfully Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_Start_threads :THREAD :EVENT :admin thread was started successfully Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_Start_threads :THREAD :EVENT :reaper thread was started successfully Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[_9p_disp] _9p_dispatcher_thread :9P DISP :EVENT :9P dispatcher started Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] rpc :TIRPC :EVENT :svc_rqst_hook_events: 0x5624f7bd54a0 fd 1024 xp_refcnt 1 sr Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nsm_connect :NLM :CRIT :connect to statd failed: RPC: Unknown protocol Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nsm_unmonitor_all :NLM :CRIT :Unmonitor all nsm_connect failed Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_start :NFS STARTUP :EVENT :----------------------------------------------- Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_start :NFS STARTUP :EVENT : NFS SERVER INITIALIZED Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_start :NFS STARTUP :EVENT :----------------------------------------------- Aug 19 23:02:53 controller-0 dockerd-current[20176]: 19/08/2019 23:02:53 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE
Here's a related bug on the OpenStack side to persist logs onto the host ceph/controller nodes: https://bugs.launchpad.net/tripleo/+bug/1721841 By default, system journal logs are not persistent; which makes things harder if the logs have been purged between an incident, and the sos report collection.
As I dont speak Ceph/OpenStack/.. language, could you please clarify: - what logs are missing to be collected? Cf. with "add_copy_spec" from https://github.com/sosreport/sos/blob/master/sos/plugins/ceph.py . - what should trigger collection of those logs (but does not do so now)? Is there some package on "Some ceph processes (MDS, nfs-ganesha) are deployed on OpenStack controller nodes." that's presence can be used as the trigger? (sosreport consists of plugins like ceph. Each plugin collects independent data on other plugins and the plugin is automatically triggered by 1) presence of a file, 2) presence of a package, 3) kernel module loaded)
Hi Pavel, (In reply to Pavel Moravec from comment #3) > As I dont speak Ceph/OpenStack/.. language, could you please clarify: > > - what logs are missing to be collected? Cf. with "add_copy_spec" from > https://github.com/sosreport/sos/blob/master/sos/plugins/ceph.py . Ceph processes deployed with ceph-ansible do not persist log files in local storage (yet, please see https://bugs.launchpad.net/tripleo/+bug/1721841). I think we'll need to "add_journal" and grab the logs for now, since they're being written to the host's journal: $ sudo journalctl CONTAINER_NAME=ceph-mds-$HOSTNAME $ sudo journalctl CONTAINER_NAME=ceph-mon-$HOSTNAME $ sudo journalctl CONTAINER_NAME=ceph-mgr-$HOSTNAME $ sudo journalctl CONTAINER_NAME=ceph-nfs-pacemaker > - what should trigger collection of those logs (but does not do so now)? Is > there some package on "Some ceph processes (MDS, nfs-ganesha) are deployed > on OpenStack controller nodes." that's presence can be used as the trigger? > (sosreport consists of plugins like ceph. Each plugin collects independent > data on other plugins and the plugin is automatically triggered by 1) > presence of a file, 2) presence of a package, 3) kernel module loaded) Yes, I think the trigger would be testing the presence of the systemd units: $ sudo systemctl status ceph-nfs@pacemaker $ sudo systemctl status ceph-mds@$HOSTNAME $ sudo systemctl status ceph-mon@$HOSTNAME $ sudo systemctl status ceph-mgr@$HOSTNAME
Thanks for prompt feedback. So to confirm the change: ceph plugin will newly: - be enabled (i.e. automatically run when ..) _also_ by presence of _either_ of the service ceph-nfs@pacemaker or ceph-mds@$HOSTNAME or ceph-mon@$HOSTNAME or ceph-mgr@$HOSTNAME - collect those four journalctl commands - is the command "journalctl CONTAINER_NAME=ceph-nfs-pacemaker" equivalent to "journalctl --unit CONTAINER_NAME=ceph-nfs-pacemaker" or similar? Can be the CONTAINER_NAME=.. rephrased by either option --unit / --boot / --since / --until / --lines / --output / --identifier (this is what add_journal method supports as arguments) ?
(In reply to Pavel Moravec from comment #5) > Thanks for prompt feedback. So to confirm the change: ceph plugin will newly: > > - be enabled (i.e. automatically run when ..) _also_ by presence of _either_ > of the service ceph-nfs@pacemaker or ceph-mds@$HOSTNAME or > ceph-mon@$HOSTNAME or ceph-mgr@$HOSTNAME > - collect those four journalctl commands > - is the command "journalctl CONTAINER_NAME=ceph-nfs-pacemaker" equivalent > to "journalctl --unit CONTAINER_NAME=ceph-nfs-pacemaker" or similar? Can be > the CONTAINER_NAME=.. rephrased by either option --unit / --boot / --since / > --until / --lines / --output / --identifier (this is what add_journal method > supports as arguments) ? The systemd units are however named *slightly* differently, for --unit, we'll need to use: $ sudo journanctl --unit ceph-nfs@pacemaker $ sudo journanctl --unit ceph-mds@$HOSTNAME $ sudo journanctl --unit ceph-mon@$HOSTNAME $ sudo journanctl --unit ceph-mgr@$HOSTNAME
You might want to look at bug 1710548
Upstream PR raised, tentatively scheduled to 7.9.
Hello, thanks for the availability to test the fix. Please use below repository / package: A yum repository for the build of sos-3.9-2.el7 (task 28860092) is available at: http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9/2.el7/ You can install the rpms locally by putting this .repo file in your /etc/yum.repos.d/ directory: http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9/2.el7/sos-3.9-2.el7.repo RPMs and build logs can be found in the following locations: http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9/2.el7/noarch/ The full list of available rpms is: http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9/2.el7/noarch/sos-3.9-2.el7.src.rpm http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9/2.el7/noarch/sos-3.9-2.el7.noarch.rpm The repository will be available for the next 60 days. Scratch build output will be deleted earlier, based on the Brew scratch build retention policy.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (sos bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4034