Bug 2083074

Summary: [Tracker for Ceph BZ #2086419] Two Ceph mons crashed in ceph-16.2.7/src/mon/PaxosService.cc: 193: FAILED ceph_assert(have_pending)
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Daniel Horák <dahorak>
Component: cephAssignee: Neha Ojha <nojha>
ceph sub component: RADOS QA Contact: Daniel Horák <dahorak>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: bniver, ebenahar, madam, muagarwa, nojha, ocs-bugs, odf-bz-bot, pdhiran, vumrao
Version: 4.10   
Target Milestone: ---   
Target Release: ODF 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-24 13:51:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2086419    
Bug Blocks:    

Description Daniel Horák 2022-05-09 09:03:53 UTC
Description of problem (please be detailed as possible and provide log
snippests):
  On one vSphere UPI ENCRYPTION 1AZ RHCOS VSAN LSO VMDK 3M 3W cluster two ceph mon crashed:

  HEALTH_WARN 2 daemons have recently crashed
[WRN] RECENT_CRASH: 2 daemons have recently crashed
    mon.a crashed on host rook-ceph-mon-a-76df6b948c-fdlpj at 2022-05-08T08:47:27.674276Z
    mon.b crashed on host rook-ceph-mon-b-57c6466c5d-zvp5w at 2022-05-08T08:47:42.688845Z

Version of all relevant components (if applicable):
  OCP 4.10.0-0.nightly-2022-05-07-205137
  ODF: ocs-registry:4.10.1-5

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
  


Can this issue reproducible?
  We saw this issue only once, re-triggered job passed, so not sure about
  reproducibility.


Can this issue reproduce from the UI?
  N/A


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy OCP cluster.
2. Deploy ODF on top of OCP.
3. Check ceph health status.


Actual results:
  Ceph cluster health is not OK. Health: HEALTH_WARN 2 daemons have recently crashed


Expected results:
  Ceph health will be HEALTH_OK, not daemon will crash.


Additional info:
  I'll post links to the job and must-gather logs in following comment.

Comment 9 Daniel Horák 2022-08-22 08:22:50 UTC
Verifying based on multiple recent passed executions and also based on the fact, that the main bug 2086419 was fixed in ceph 16.2.8-49.el8cp and verified against 16.2.8-50.el8cp.

The ODF CI executions were triggered against following versions (this is only quick selection):
* ODF: 4.11.0-105, Ceph: 16.2.8-59.el8cp
* ODF: 4.11.0-109, Ceph: 16.2.8-59.el8cp
* ODF: 4.11.0-111, Ceph: 16.2.8-65.el8cp
* ODF: 4.11.0-113, Ceph: 16.2.8-65.el8cp
* ODF: 4.11.0-127, Ceph: 16.2.8-80.el8cp
* ODF: 4.11.0-129, Ceph: 16.2.8-80.el8cp
* ODF: 4.11.0-131, Ceph: 16.2.8-84.el8cp
* ODF: 4.11.0-137, Ceph: 16.2.8-84.el8cp

>> VERIFIED

Comment 11 errata-xmlrpc 2022-08-24 13:51:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6156