2083074 – [Tracker for Ceph BZ #2086419] Two Ceph mons crashed in ceph-16.2.7/src/mon/PaxosService.cc: 193: FAILED ceph_assert(have_pending)

Bug 2083074 - [Tracker for Ceph BZ #2086419] Two Ceph mons crashed in ceph-16.2.7/src/mon/PaxosService.cc: 193: FAILED ceph_assert(have_pending)

Summary: [Tracker for Ceph BZ #2086419] Two Ceph mons crashed in ceph-16.2.7/src/mon/P...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	ODF 4.11.0
Assignee:	Neha Ojha
QA Contact:	Daniel Horák
Docs Contact:
URL:
Whiteboard:
Depends On:	2086419
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-09 09:03 UTC by Daniel Horák
Modified:	2023-08-09 16:37 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-24 13:51:29 UTC
Embargoed:

Attachments	(Terms of Use)

Description Daniel Horák 2022-05-09 09:03:53 UTC

Description of problem (please be detailed as possible and provide log
snippests):
On one vSphere UPI ENCRYPTION 1AZ RHCOS VSAN LSO VMDK 3M 3W cluster two ceph mon crashed:

HEALTH_WARN 2 daemons have recently crashed
[WRN] RECENT_CRASH: 2 daemons have recently crashed
mon.a crashed on host rook-ceph-mon-a-76df6b948c-fdlpj at 2022-05-08T08:47:27.674276Z
mon.b crashed on host rook-ceph-mon-b-57c6466c5d-zvp5w at 2022-05-08T08:47:42.688845Z

Version of all relevant components (if applicable):
OCP 4.10.0-0.nightly-2022-05-07-205137
ODF: ocs-registry:4.10.1-5

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?
We saw this issue only once, re-triggered job passed, so not sure about
reproducibility.

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. Deploy OCP cluster.
2. Deploy ODF on top of OCP.
3. Check ceph health status.

Actual results:
Ceph cluster health is not OK. Health: HEALTH_WARN 2 daemons have recently crashed

Expected results:
Ceph health will be HEALTH_OK, not daemon will crash.

Additional info:
I'll post links to the job and must-gather logs in following comment.

Comment 9 Daniel Horák 2022-08-22 08:22:50 UTC

Verifying based on multiple recent passed executions and also based on the fact, that the main bug 2086419 was fixed in ceph 16.2.8-49.el8cp and verified against 16.2.8-50.el8cp.

The ODF CI executions were triggered against following versions (this is only quick selection):
* ODF: 4.11.0-105, Ceph: 16.2.8-59.el8cp
* ODF: 4.11.0-109, Ceph: 16.2.8-59.el8cp
* ODF: 4.11.0-111, Ceph: 16.2.8-65.el8cp
* ODF: 4.11.0-113, Ceph: 16.2.8-65.el8cp
* ODF: 4.11.0-127, Ceph: 16.2.8-80.el8cp
* ODF: 4.11.0-129, Ceph: 16.2.8-80.el8cp
* ODF: 4.11.0-131, Ceph: 16.2.8-84.el8cp
* ODF: 4.11.0-137, Ceph: 16.2.8-84.el8cp

>> VERIFIED

Comment 11 errata-xmlrpc 2022-08-24 13:51:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6156

Note You need to log in before you can comment on or make changes to this bug.