Bug 2248825

Summary:	[cee/sd][cephfs] mds pods are crashing with ceph_assert(state == LOCK_XLOCK \|\| state == LOCK_XLOCKDONE \|\| state == LOCK_XLOCKSNAP \|\| state == LOCK_LOCK_XLOCK \|\| state == LOCK_LOCK \|\| is_locallock())
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Lijo Stephen Thomas <lithomas>
Component:	CephFS	Assignee:	Xiubo Li <xiubli>
Status:	CLOSED ERRATA	QA Contact:	Hemanth Kumar <hyelloji>
Severity:	high	Docs Contact:	Ranjini M N <rmandyam>
Priority:	unspecified
Version:	6.1	CC:	amark, bniver, ceph-eng-bugs, cephqe-warriors, etamir, hyelloji, jansingh, mcaldeir, muagarwa, pratshar, rmandyam, sburke, smulay, sostapov, tserlin, vereddy, vshankar, xiubli
Target Milestone:	---
Target Release:	5.3z6
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	ceph-16.2.10-219.el8cp	Doc Type:	Bug Fix
Doc Text:	.The MDS no longer crashes when the journal logs are flushed Previously, when the journal logs were successfully flushed, you could set the lockers’ state to `LOCK_SYNC` or `LOCK_PREXLOCK` when the `xclock` count was non-zero. However, the MDS would not allow that and would crash. With this fix, MDS allows the lockers’ state to `LOCK_SYNC` or `LOCK_PREXLOCK` when the `xclock` count is non-zero and the MDS does not crash.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-02-08 16:56:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2258797

Description Lijo Stephen Thomas 2023-11-09 08:10:56 UTC

Description of problem (please be detailed as possible and provide log snippets):
---------------------------------------------------------------------------------
MDS pods are crashing frequently with ceph_assert(state == LOCK_XLOCK || state == LOCK_XLOCKDONE || state == LOCK_XLOCKSNAP || state == LOCK_LOCK_XLOCK || state == LOCK_LOCK || is_locallock()) and MDS pod crashes are observed everyday from Oct 16 2023.  

This BZ is created to track this issue downstream.

Version of all relevant components (if applicable):
---------------------------------------------------
RHCS - 16.2.10-187.el8cp  / RHODF 4.11


Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)?
----------------------------------------------------------------------------------------------------------------------------

Is there any workaround available to the best of your knowledge?
----------------------------------------------------------------
N/A

Can this issue reproducible?
----------------------------
No, at customer environment it is present


Can this issue reproduce from the UI?
-------------------------------------
No

Additional info:
----------------
Upstream trackers: https://tracker.ceph.com/issues/44565
Backport trackers:
quincy - https://tracker.ceph.com/issues/62522
pacific - https://tracker.ceph.com/issues/62523

Comment 25 Greg Farnum 2023-12-20 04:05:01 UTC

*** Bug 2228251 has been marked as a duplicate of this bug. ***

Comment 34 errata-xmlrpc 2024-02-08 16:56:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 Security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:0745