Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2216473

Summary:	Ceph cluster logging is incomprehensible
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Greg Farnum <gfarnum>
Component:	RADOS	Assignee:	Vikhyat Umrao <vumrao>
Status:	NEW ---	QA Contact:	Pawan <pdhiran>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	5.3	CC:	adking, bhubbard, bkunal, ceph-eng-bugs, cephqe-warriors, hklein, idryomov, jdurgin, nojha, nravinas, pdonnell, rsachere, rzarzyns, vshankar, vumrao
Target Milestone:	---	Flags:	rzarzyns: needinfo? (vumrao)
Target Release:	9.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Greg Farnum 2023-06-21 14:37:14 UTC

Description of problem: Under cephadm, ceph daemons are configured to dump everything to stdout, where it is handled by journald and journalctl.
While in many ways this makes sense, it breaks down horribly when we commingle multiple logs, as is the case for the monitor and the central log: we no longer have a single file we can look at to see the cluster log messages, nor a good way to extract them from existing logs.

Moreover, we know that journald will happily throw out log messages that it deems to be redundant, and that can render logs useless for our purposes. :(

We need to identify and ship a more sensible solution that enables the use of the central log by users and our support org, which is the whole point of having it and logging these centralized messages in the first place.

These defaults were set by Sage to make things more "container-y" but do not seem to have received much thought or attention at the time: https://github.com/ceph/ceph/pull/32641

I see two potential approaches:
1) Just stop doing this — identify an appropriate location to write a central log file, and make sure it is gathered by sosreports and must-gather. (This may work already, since it was the way the world used to be.)

2) Change Ceph code so that these logs are dumped in a way that makes it easy to extract them from the unified journald log via journalctl. I have no idea how this works, as I ran journalctl for the first time this week (while working on the bug that prompted this: https://bugzilla.redhat.com/show_bug.cgi?id=2215168).

While working on this, we should comprehensively evaluate our logging strategy within cephadm and ODF — there have been a number of changes since older RHCS releases and it's not clear they are understood by either the development or support teams.

Comment 9 Harald Klein 2023-06-22 07:45:53 UTC

1) journal vs plain log files

While I see benefits from journal when being live on the system (e.g. per example from Adam with `-eu <systemd unit>`, the sosreport usually contains a complete journal dump into a text file. So support again ends up using grep etc  to filter in the journalctl output. Here I very prefer if customer has log to file enabled, giving per daemon log files in /var/log/ceph (in very early days of containerization the /var/log path was not mapped into the container, but that's been fixed since some RHCS3 version)