Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2130090

Summary:	[cee/sd][cephFS] clients failing to advance oldest client/flush tid
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Janmejay Singh <jansingh>
Component:	CephFS	Assignee:	Venky Shankar <vshankar>
Status:	CLOSED DEFERRED	QA Contact:	Hemanth Kumar <hyelloji>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.2	CC:	assingh, ceph-eng-bugs, cephqe-warriors, gfarnum, gjose, jcoscia, jcrumple, kjosy, lithomas, mcaldeir, mmuench, peli, rrajaram, rraja, sbaldwin, srengan, vshankar, vumrao
Target Milestone:	---
Target Release:	6.1
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-02-13 09:28:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2134709
Bug Blocks:

Comment 2 Venky Shankar 2022-09-27 09:52:12 UTC

It does look like this warning shows up when the MDS is behind trimming (which also shows up in cluster logs and `ceph status').

Looking into this now...

Comment 3 Venky Shankar 2022-09-27 12:28:02 UTC

The MDS seems to be not trimming client completed requests. For client.10472234, `completed_requests' is 434149 which is exceeding a threashold thereby generating this warning.

I cannot find any MDS log in the sosreport. Could someone point me at it? Also, it would help if mds debug logs (debug_mds = 20 and let capture it for ~5-10 minutes) can be captured and shared. And reset the log level once the logs have been captured.

Comment 21 Greg Farnum 2022-09-29 00:08:31 UTC

It looks like the customer still hasn't prevented the SELinux relabeling from occurring, right?
That's still a good guess for what may be causing issues.

But if not, the other thing we can have them do is get manager logs of the embedded CephFS client (debug_client = 20, debug_ms = 10), and mds logs (debug mds = 20, debug ms = 10), when this incident is happening. (I'm not sure how to get that through OpenShift/Rook). They will be large and this may be intrusive, so we should give Venky a chance to find out more, and see if things improve when they turn off the SELinux relabeling. But at least it sounds like it reliably occurs whenever the manager fails over.