Bug 2216473 - Ceph cluster logging is incomprehensible [NEEDINFO]
Summary: Ceph cluster logging is incomprehensible
Keywords:
Status: NEW
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 5.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 7.1
Assignee: Radoslaw Zarzynski
QA Contact: Pawan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-21 14:37 UTC by Greg Farnum
Modified: 2023-07-11 19:56 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
gfarnum: needinfo? (rzarzyns)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-6902 0 None None None 2023-06-21 17:40:49 UTC

Description Greg Farnum 2023-06-21 14:37:14 UTC
Description of problem: Under cephadm, ceph daemons are configured to dump everything to stdout, where it is handled by journald and journalctl.
While in many ways this makes sense, it breaks down horribly when we commingle multiple logs, as is the case for the monitor and the central log: we no longer have a single file we can look at to see the cluster log messages, nor a good way to extract them from existing logs.

Moreover, we know that journald will happily throw out log messages that it deems to be redundant, and that can render logs useless for our purposes. :(

We need to identify and ship a more sensible solution that enables the use of the central log by users and our support org, which is the whole point of having it and logging these centralized messages in the first place.

These defaults were set by Sage to make things more "container-y" but do not seem to have received much thought or attention at the time: https://github.com/ceph/ceph/pull/32641

I see two potential approaches:
1) Just stop doing this — identify an appropriate location to write a central log file, and make sure it is gathered by sosreports and must-gather. (This may work already, since it was the way the world used to be.)

2) Change Ceph code so that these logs are dumped in a way that makes it easy to extract them from the unified journald log via journalctl. I have no idea how this works, as I ran journalctl for the first time this week (while working on the bug that prompted this: https://bugzilla.redhat.com/show_bug.cgi?id=2215168).

While working on this, we should comprehensively evaluate our logging strategy within cephadm and ODF — there have been a number of changes since older RHCS releases and it's not clear they are understood by either the development or support teams.

Comment 9 Harald Klein 2023-06-22 07:45:53 UTC
1) journal vs plain log files

While I see benefits from journal when being live on the system (e.g. per example from Adam with `-eu <systemd unit>`, the sosreport usually contains a complete journal dump into a text file. So support again ends up using grep etc  to filter in the journalctl output. Here I very prefer if customer has log to file enabled, giving per daemon log files in /var/log/ceph (in very early days of containerization the /var/log path was not mapped into the container, but that's been fixed since some RHCS3 version)


Note You need to log in before you can comment on or make changes to this bug.