RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1743425 - Ceph logs not captured in an OpenStack deployment
Summary: Ceph logs not captured in an OpenStack deployment
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: sos
Version: 7.7
Hardware: All
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Pavel Moravec
QA Contact: Miroslav Hradílek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-19 23:19 UTC by Goutham Pacha Ravi
Modified: 2020-09-29 20:55 UTC (History)
13 users (show)

Fixed In Version: sos-3.9-2.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-29 20:55:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github sosreport sos pull 1776 0 'None' closed [ceph] collect journal logs for ceph services 2020-08-04 07:34:18 UTC
Red Hat Bugzilla 1710548 0 high CLOSED ceph container logging to respective daemon log file 2024-03-25 15:17:35 UTC
Red Hat Product Errata RHEA-2020:4034 0 None None None 2020-09-29 20:55:50 UTC

Description Goutham Pacha Ravi 2019-08-19 23:19:48 UTC
Description of problem:

Ceph can be deployed alongside an OpenStack deployment with TripleO (Director) and Ceph Ansible on dedicated "ceph" storage nodes. Some ceph processes (MDS, nfs-ganesha) are deployed on OpenStack controller nodes. sosreport tooling currently does not capture any of the ceph log files in such environments. 


Version-Release number of selected component (if applicable): Version 3.7 of the sos package was being used when this deficiency was discovered:

$ rpm -q sos
sos-3.7-5.el7.noarch

OSP 13, a long-life release is deployed with RHEL 7. OSP 16 (upcoming) will be deployed with RHEL 8. So any fix for this issue is applicable to both platforms (RHEL 7 and RHEL 8)

How reproducible: Always

Steps to Reproduce:

On an RHEL OSP controller (or ceph) node, execute:

$ sosreport --all-logs

Examine the sosreport logs, no ceph logs are included. 

When ceph is deployed alongside OSP, it is installed via ceph-ansible in containers. These containers can be controlled with systemd, and log files, while persisted in the local container filesystems, are written to the system journal. They can be read from the (OpenStack overcloud controller) host like follows:

$ sudo journalctl CONTAINER_NAME=ceph-mds-controller-0
$ sudo journalctl CONTAINER_NAME=ceph-mon-controller-0
$ sudo journalctl CONTAINER_NAME=ceph-mgr-controller-0
$ sudo journalctl CONTAINER_NAME=ceph-nfs-pacemaker


For example:
[heat-admin@controller-0 ~]$ sudo journalctl CONTAINER_NAME=ceph-nfs-pacemaker
-- Logs begin at Mon 2019-08-19 22:06:21 UTC, end at Mon 2019-08-19 23:16:19 UTC. --
Aug 19 23:01:22 controller-0 dockerd-current[20176]: 2019-08-19 23:01:22  /entrypoint.sh: static: does not generate config
Aug 19 23:01:22 controller-0 dockerd-current[20176]: HEALTH_OK
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 2019-08-19 23:01:23  /entrypoint.sh: SUCCESS
Aug 19 23:01:23 controller-0 dockerd-current[20176]: exec: PID 138: spawning /usr/bin/ganesha.nfsd  -F -L STDOUT
Aug 19 23:01:23 controller-0 dockerd-current[20176]: exec: Waiting 138 to quit
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] main :MAIN :EVENT :ganesha.nfsd Starting: Ganesha Version 2.7.1
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_set_param_from_conf :NFS STARTUP :EVENT :Configuration file successfully p
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] init_server_pkgs :NFS STARTUP :EVENT :Initializing ID Mapper.
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper successfully initialized.
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] rados_kv_init :CLIENT ID :EVENT :Rados kv store init done
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] main :NFS STARTUP :WARN :No export entries found in configuration file !!!
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] config_errs_to_log :CONFIG :WARN :Config File (/etc/ganesha/ganesha.conf:14):
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] config_errs_to_log :CONFIG :WARN :Config File (/etc/ganesha/ganesha.conf:17):
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] lower_my_caps :NFS STARTUP :EVENT :CAP_SYS_RESOURCE was successfully removed f
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] lower_my_caps :NFS STARTUP :EVENT :currenty set capabilities are: = cap_chown,
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_Init_svc :DISP :CRIT :Cannot acquire credentials for principal nfs
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_Init_admin_thread :NFS CB :EVENT :Admin thread initialized
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_rpc_cb_init_ccache :NFS STARTUP :EVENT :Callback creds directory (/var/run
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN :gssd_refresh_krb5_machine_credentia
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_Start_threads :THREAD :EVENT :Starting delayed executor.
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_Start_threads :THREAD :EVENT :9P/TCP dispatcher thread was started success
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_Start_threads :THREAD :EVENT :gsh_dbusthread was started successfully
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_Start_threads :THREAD :EVENT :admin thread was started successfully
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_Start_threads :THREAD :EVENT :reaper thread was started successfully
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[_9p_disp] _9p_dispatcher_thread :9P DISP :EVENT :9P dispatcher started
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] rpc :TIRPC :EVENT :svc_rqst_hook_events: 0x5624f7bd54a0 fd 1024 xp_refcnt 1 sr
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nsm_connect :NLM :CRIT :connect to statd failed: RPC: Unknown protocol
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nsm_unmonitor_all :NLM :CRIT :Unmonitor all nsm_connect failed
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_start :NFS STARTUP :EVENT :-----------------------------------------------
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
Aug 19 23:01:23 controller-0 dockerd-current[20176]: 19/08/2019 23:01:23 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[main] nfs_start :NFS STARTUP :EVENT :-----------------------------------------------
Aug 19 23:02:53 controller-0 dockerd-current[20176]: 19/08/2019 23:02:53 : epoch 5d5b2a43 : controller-0 : ganesha.nfsd-138[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE

Comment 2 Goutham Pacha Ravi 2019-08-19 23:29:42 UTC
Here's a related bug on the OpenStack side to persist logs onto the host ceph/controller nodes: https://bugs.launchpad.net/tripleo/+bug/1721841 

By default, system journal logs are not persistent; which makes things harder if the logs have been purged between an incident, and the sos report collection.

Comment 3 Pavel Moravec 2019-08-21 13:44:54 UTC
As I dont speak Ceph/OpenStack/.. language, could you please clarify:

- what logs are missing to be collected? Cf. with "add_copy_spec" from https://github.com/sosreport/sos/blob/master/sos/plugins/ceph.py .

- what should trigger collection of those logs (but does not do so now)? Is there some package on "Some ceph processes (MDS, nfs-ganesha) are deployed on OpenStack controller nodes." that's presence can be used as the trigger? (sosreport consists of plugins like ceph. Each plugin collects independent data on other plugins and the plugin is automatically triggered by 1) presence of a file, 2) presence of a package, 3) kernel module loaded)

Comment 4 Goutham Pacha Ravi 2019-08-21 21:21:34 UTC
Hi Pavel,

(In reply to Pavel Moravec from comment #3)
> As I dont speak Ceph/OpenStack/.. language, could you please clarify:
> 
> - what logs are missing to be collected? Cf. with "add_copy_spec" from
> https://github.com/sosreport/sos/blob/master/sos/plugins/ceph.py .

Ceph processes deployed with ceph-ansible do not persist log files in local storage (yet, please see https://bugs.launchpad.net/tripleo/+bug/1721841). 

I think we'll need to "add_journal" and grab the logs for now, since they're being written to the host's journal:

 $ sudo journalctl CONTAINER_NAME=ceph-mds-$HOSTNAME
 $ sudo journalctl CONTAINER_NAME=ceph-mon-$HOSTNAME
 $ sudo journalctl CONTAINER_NAME=ceph-mgr-$HOSTNAME
 $ sudo journalctl CONTAINER_NAME=ceph-nfs-pacemaker


> - what should trigger collection of those logs (but does not do so now)? Is
> there some package on "Some ceph processes (MDS, nfs-ganesha) are deployed
> on OpenStack controller nodes." that's presence can be used as the trigger?
> (sosreport consists of plugins like ceph. Each plugin collects independent
> data on other plugins and the plugin is automatically triggered by 1)
> presence of a file, 2) presence of a package, 3) kernel module loaded)

Yes, I think the trigger would be testing the presence of the systemd units:

 $ sudo systemctl status ceph-nfs@pacemaker
 $ sudo systemctl status ceph-mds@$HOSTNAME
 $ sudo systemctl status ceph-mon@$HOSTNAME
 $ sudo systemctl status ceph-mgr@$HOSTNAME

Comment 5 Pavel Moravec 2019-08-22 11:23:36 UTC
Thanks for prompt feedback. So to confirm the change: ceph plugin will newly:

- be enabled (i.e. automatically run when ..) _also_ by presence of _either_ of the service ceph-nfs@pacemaker or ceph-mds@$HOSTNAME or ceph-mon@$HOSTNAME or ceph-mgr@$HOSTNAME
- collect those four journalctl commands 
  - is the command "journalctl CONTAINER_NAME=ceph-nfs-pacemaker" equivalent to "journalctl --unit CONTAINER_NAME=ceph-nfs-pacemaker" or similar? Can be the CONTAINER_NAME=.. rephrased by either option --unit / --boot / --since / --until / --lines / --output / --identifier (this is what add_journal method supports as arguments) ?

Comment 6 Goutham Pacha Ravi 2019-08-23 18:23:11 UTC
(In reply to Pavel Moravec from comment #5)
> Thanks for prompt feedback. So to confirm the change: ceph plugin will newly:
> 
> - be enabled (i.e. automatically run when ..) _also_ by presence of _either_
> of the service ceph-nfs@pacemaker or ceph-mds@$HOSTNAME or
> ceph-mon@$HOSTNAME or ceph-mgr@$HOSTNAME
> - collect those four journalctl commands 
>   - is the command "journalctl CONTAINER_NAME=ceph-nfs-pacemaker" equivalent
> to "journalctl --unit CONTAINER_NAME=ceph-nfs-pacemaker" or similar? Can be
> the CONTAINER_NAME=.. rephrased by either option --unit / --boot / --since /
> --until / --lines / --output / --identifier (this is what add_journal method
> supports as arguments) ?

The systemd units are however named *slightly* differently, for --unit, we'll need to use:

 $ sudo journanctl --unit ceph-nfs@pacemaker
 $ sudo journanctl --unit ceph-mds@$HOSTNAME
 $ sudo journanctl --unit ceph-mon@$HOSTNAME
 $ sudo journanctl --unit ceph-mgr@$HOSTNAME

Comment 7 John Fulton 2019-08-28 19:20:08 UTC
You might want to look at bug 1710548

Comment 10 Pavel Moravec 2019-09-05 15:14:00 UTC
Upstream PR raised, tentatively scheduled to 7.9.

Comment 13 Jan Jansky 2020-06-01 14:13:16 UTC
Hello,
thanks for the availability to test the fix. Please use below repository / package:


A yum repository for the build of sos-3.9-2.el7 (task 28860092) is available at:

http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9/2.el7/

You can install the rpms locally by putting this .repo file in your /etc/yum.repos.d/ directory:

http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9/2.el7/sos-3.9-2.el7.repo

RPMs and build logs can be found in the following locations:
http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9/2.el7/noarch/

The full list of available rpms is:
http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9/2.el7/noarch/sos-3.9-2.el7.src.rpm
http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9/2.el7/noarch/sos-3.9-2.el7.noarch.rpm

The repository will be available for the next 60 days. Scratch build output will be deleted
earlier, based on the Brew scratch build retention policy.

Comment 36 errata-xmlrpc 2020-09-29 20:55:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sos bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4034


Note You need to log in before you can comment on or make changes to this bug.