Bug 2263219 - [7.0z1 backport] [GSS] "MDS_CLIENT_OLDEST_TID" Errors
Summary: [7.0z1 backport] [GSS] "MDS_CLIENT_OLDEST_TID" Errors
Keywords:
Status: CLOSED DUPLICATE of bug 2239951
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 6.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 7.0z3
Assignee: Xiubo Li
QA Contact: Hemanth Kumar
URL:
Whiteboard:
Depends On: 2239951
Blocks: 2257733
TreeView+ depends on / blocked
 
Reported: 2024-02-07 16:26 UTC by Bipin Kunal
Modified: 2024-09-19 08:33 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-09-19 08:33:14 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-8276 0 None None None 2024-02-07 16:26:42 UTC
Red Hat Knowledge Base (Solution) 6971376 0 None None None 2024-04-30 10:38:27 UTC

Description Bipin Kunal 2024-02-07 16:26:16 UTC
This bug was initially created as a copy of Bug #2239951

I am copying this bug because: 



Description of problem:

- Initially the attached case was submitted because the customer was experiencing the following error message:

"MDS_CLIENT_OLDEST_TID: 15 clients failing to advance oldest client/flush tid"

- The customer then took action to update the cluster. In the customers own word:
~~~
Around 3:40am Eastern today (09/15) some of our mds's failed and standbys took over and we were left with insufficient standbys.
The missing standbys were identified and systemctl restarted.
After which the "failing to advance TID" errors went away.

This cluster has mds issues,  since upgrading to 6.1 it has got worse.
if you manually fail 1 mds it will take multiple others with it. (This was happening under 5.x as well)
Under 6.1 when this happens the standbys take over and the failed mds stay in some kind of hung state.
These hung mds will not show up as standbys until they are bounced via systemctl. (ceph orch daemon restart mds.instance  <-- this doesn't work in this state, only systemctl restart works)
~~~

- It's important to note that this cluster has another bz/hotfix pending resolution/deployment. That bz is located at: https://bugzilla.redhat.com/show_bug.cgi?id=2228635

- Please also note that on 09/18 client jobs were failing due to issues with mds, but most of the errors at that time pertained to errors regarding behind on trimming, blocked ops, clients failing to release caps,  and they saw these in mds.5:  "failed to authpin, dir is being fragmented"

- The primary issue for this particular bug is to address the the following errors:

HEALTH_WARN 2 clients failing to advance oldest client/flush tid
[WRN] MDS_CLIENT_OLDEST_TID: 2 clients failing to advance oldest client/flush tid
    mds.root.host11.emjjsf(mds.1): Client client75:pthnrt failing to advance its oldest client/flush tid.  client_id: 69394526
    mds.root.host10.fckajv(mds.4): Client client75:pthnrt failing to advance its oldest client/flush tid.  client_id: 69394526

- I'm fully aware that this could indeed be related to bz: https://bugzilla.redhat.com/show_bug.cgi?id=2228635, but I'm requesting confirmation of that fact from engineering on this bz given that this customer is experiencing frequent problems with utilizing mds.

Version-Release number of selected component (if applicable):


How reproducible:
Only reproducible in customer's environment

Steps to Reproduce:
1.
2.
3.

Actual results:
"MDS_CLIENT_OLDEST_TID" occur

Expected results:
"MDS_CLIENT_OLDEST_TID" do not occur

Additional info:
- The customer has uploaded sosreports from the failing clients and Ceph mon node. They've also uploaded mds debug logs at attachment "0100-ceph-mds.debug.logs.all.9.mds.instances.tar.gz". Here's the mds layout before mds.4 was failed over and debug mds logs were collected:

====
RANK  STATE              MDS                 ACTIVITY     DNS    INOS   DIRS   CAPS
 0    active  root.host13.kpbxhd  Reqs:  133 /s   426k   424k  10.6k   142k
 1    active  root.host11.emjjsf  Reqs:   47 /s  8899k  8897k  10.9k  1668k
 2    active  root.host12.fzjadk  Reqs:   37 /s  19.4M  19.3M  35.8k  1067k
 3    active  root.host6.gxtzai  Reqs:   42 /s  19.1M  19.1M  93.7k   960k
 4    active  root.host10.fckajv  Reqs: 1257 /s  17.7M  17.7M  2173k  6024k
 5    active  root.host7.vainnz  Reqs:  289 /s  14.2M  14.2M  64.7k  2506k

- Please let me know if you require any further information.

Thanks

-Brandon

Comment 1 RHEL Program Management 2024-02-07 16:26:33 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.


Note You need to log in before you can comment on or make changes to this bug.