2263219 – [7.0z1 backport] [GSS] "MDS_CLIENT_OLDEST_TID" Errors

Bug 2263219 - [7.0z1 backport] [GSS] "MDS_CLIENT_OLDEST_TID" Errors

Summary: [7.0z1 backport] [GSS] "MDS_CLIENT_OLDEST_TID" Errors

Keywords:
Status:	CLOSED DUPLICATE of bug 2239951
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	6.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	7.0z3
Assignee:	Xiubo Li
QA Contact:	Hemanth Kumar
Docs Contact:
URL:
Whiteboard:
Depends On:	2239951
Blocks:	2257733
TreeView+	depends on / blocked

Reported:	2024-02-07 16:26 UTC by Bipin Kunal
Modified:	2024-09-19 08:33 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-09-19 08:33:14 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-8276	0	None	None	None	2024-02-07 16:26:42 UTC
Red Hat Knowledge Base (Solution)	6971376	0	None	None	None	2024-04-30 10:38:27 UTC

Description Bipin Kunal 2024-02-07 16:26:16 UTC

This bug was initially created as a copy of Bug #2239951

I am copying this bug because:

Description of problem:

- Initially the attached case was submitted because the customer was experiencing the following error message:

"MDS_CLIENT_OLDEST_TID: 15 clients failing to advance oldest client/flush tid"

- The customer then took action to update the cluster. In the customers own word:
~~~
Around 3:40am Eastern today (09/15) some of our mds's failed and standbys took over and we were left with insufficient standbys.
The missing standbys were identified and systemctl restarted.
After which the "failing to advance TID" errors went away.

This cluster has mds issues, since upgrading to 6.1 it has got worse.
if you manually fail 1 mds it will take multiple others with it. (This was happening under 5.x as well)
Under 6.1 when this happens the standbys take over and the failed mds stay in some kind of hung state.
These hung mds will not show up as standbys until they are bounced via systemctl. (ceph orch daemon restart mds.instance <-- this doesn't work in this state, only systemctl restart works)
~~~

- It's important to note that this cluster has another bz/hotfix pending resolution/deployment. That bz is located at: https://bugzilla.redhat.com/show_bug.cgi?id=2228635

- Please also note that on 09/18 client jobs were failing due to issues with mds, but most of the errors at that time pertained to errors regarding behind on trimming, blocked ops, clients failing to release caps, and they saw these in mds.5: "failed to authpin, dir is being fragmented"

- The primary issue for this particular bug is to address the the following errors:

HEALTH_WARN 2 clients failing to advance oldest client/flush tid
[WRN] MDS_CLIENT_OLDEST_TID: 2 clients failing to advance oldest client/flush tid
mds.root.host11.emjjsf(mds.1): Client client75:pthnrt failing to advance its oldest client/flush tid. client_id: 69394526
mds.root.host10.fckajv(mds.4): Client client75:pthnrt failing to advance its oldest client/flush tid. client_id: 69394526

- I'm fully aware that this could indeed be related to bz: https://bugzilla.redhat.com/show_bug.cgi?id=2228635, but I'm requesting confirmation of that fact from engineering on this bz given that this customer is experiencing frequent problems with utilizing mds.

Version-Release number of selected component (if applicable):

How reproducible:
Only reproducible in customer's environment

Steps to Reproduce:
1.
2.
3.

Actual results:
"MDS_CLIENT_OLDEST_TID" occur

Expected results:
"MDS_CLIENT_OLDEST_TID" do not occur

Additional info:
- The customer has uploaded sosreports from the failing clients and Ceph mon node. They've also uploaded mds debug logs at attachment "0100-ceph-mds.debug.logs.all.9.mds.instances.tar.gz". Here's the mds layout before mds.4 was failed over and debug mds logs were collected:

====
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active root.host13.kpbxhd Reqs: 133 /s 426k 424k 10.6k 142k
1 active root.host11.emjjsf Reqs: 47 /s 8899k 8897k 10.9k 1668k
2 active root.host12.fzjadk Reqs: 37 /s 19.4M 19.3M 35.8k 1067k
3 active root.host6.gxtzai Reqs: 42 /s 19.1M 19.1M 93.7k 960k
4 active root.host10.fckajv Reqs: 1257 /s 17.7M 17.7M 2173k 6024k
5 active root.host7.vainnz Reqs: 289 /s 14.2M 14.2M 64.7k 2506k

- Please let me know if you require any further information.

Thanks

-Brandon

Comment 1 RHEL Program Management 2024-02-07 16:26:33 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Note You need to log in before you can comment on or make changes to this bug.