Bug 2248825

Summary: [cee/sd][cephfs] mds pods are crashing with ceph_assert(state == LOCK_XLOCK || state == LOCK_XLOCKDONE || state == LOCK_XLOCKSNAP || state == LOCK_LOCK_XLOCK || state == LOCK_LOCK || is_locallock())
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Lijo Stephen Thomas <lithomas>
Component: CephFSAssignee: Xiubo Li <xiubli>
Status: CLOSED ERRATA QA Contact: Hemanth Kumar <hyelloji>
Severity: high Docs Contact: Ranjini M N <rmandyam>
Priority: unspecified    
Version: 6.1CC: amark, bniver, ceph-eng-bugs, cephqe-warriors, etamir, hyelloji, jansingh, mcaldeir, muagarwa, pratshar, rmandyam, sburke, smulay, sostapov, tserlin, vereddy, vshankar, xiubli
Target Milestone: ---   
Target Release: 5.3z6   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-16.2.10-219.el8cp Doc Type: Bug Fix
Doc Text:
.The MDS no longer crashes when the journal logs are flushed Previously, when the journal logs were successfully flushed, you could set the lockers’ state to `LOCK_SYNC` or `LOCK_PREXLOCK` when the `xclock` count was non-zero. However, the MDS would not allow that and would crash. With this fix, MDS allows the lockers’ state to `LOCK_SYNC` or `LOCK_PREXLOCK` when the `xclock` count is non-zero and the MDS does not crash.
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-02-08 16:56:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2258797    

Description Lijo Stephen Thomas 2023-11-09 08:10:56 UTC
Description of problem (please be detailed as possible and provide log snippets):
---------------------------------------------------------------------------------
MDS pods are crashing frequently with ceph_assert(state == LOCK_XLOCK || state == LOCK_XLOCKDONE || state == LOCK_XLOCKSNAP || state == LOCK_LOCK_XLOCK || state == LOCK_LOCK || is_locallock()) and MDS pod crashes are observed everyday from Oct 16 2023.  

This BZ is created to track this issue downstream.

Version of all relevant components (if applicable):
---------------------------------------------------
RHCS - 16.2.10-187.el8cp  / RHODF 4.11


Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)?
----------------------------------------------------------------------------------------------------------------------------

Is there any workaround available to the best of your knowledge?
----------------------------------------------------------------
N/A

Can this issue reproducible?
----------------------------
No, at customer environment it is present


Can this issue reproduce from the UI?
-------------------------------------
No

Additional info:
----------------
Upstream trackers: https://tracker.ceph.com/issues/44565
Backport trackers:
quincy - https://tracker.ceph.com/issues/62522
pacific - https://tracker.ceph.com/issues/62523

Comment 25 Greg Farnum 2023-12-20 04:05:01 UTC
*** Bug 2228251 has been marked as a duplicate of this bug. ***

Comment 34 errata-xmlrpc 2024-02-08 16:56:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 Security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:0745