Description of problem: [17.2.0] Large cephadm Health Detail log MSG causing MON to lose the quorum Version-Release number of selected component (if applicable): Upstream quincy In a recent occurrence on one of the Ceph upstream clusters, the LRC one went into DU mode due to a bug https://tracker.ceph.com/issues/54132. Issues caused by this problem: - SSH errors logged massive hexdump output (cephadm bug already fixed) - These logs got stored as part of ‘ceph health detail’ as well - Periodically we dumped ‘ceph health detail’ to the cluster log (going through paxos, each mon db, etc.) causing the mons to lose quorum and become unresponsive WORKAROUND: Set option mon_health_detail_to_clog to false. If MONs are in quorum: ceph config set mon mon_health_detail_to_clog false If MONs are not in quorum: It needs to be changed in /var/lib/ceph/$fsid/mon.$hostname/config and restart of monitor service.
Quincy backport - https://github.com/ceph/ceph/pull/46055
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 6.0 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:1360