This bug was initially created as a copy of Bug #2239951 I am copying this bug because: Description of problem: - Initially the attached case was submitted because the customer was experiencing the following error message: "MDS_CLIENT_OLDEST_TID: 15 clients failing to advance oldest client/flush tid" - The customer then took action to update the cluster. In the customers own word: ~~~ Around 3:40am Eastern today (09/15) some of our mds's failed and standbys took over and we were left with insufficient standbys. The missing standbys were identified and systemctl restarted. After which the "failing to advance TID" errors went away. This cluster has mds issues, since upgrading to 6.1 it has got worse. if you manually fail 1 mds it will take multiple others with it. (This was happening under 5.x as well) Under 6.1 when this happens the standbys take over and the failed mds stay in some kind of hung state. These hung mds will not show up as standbys until they are bounced via systemctl. (ceph orch daemon restart mds.instance <-- this doesn't work in this state, only systemctl restart works) ~~~ - It's important to note that this cluster has another bz/hotfix pending resolution/deployment. That bz is located at: https://bugzilla.redhat.com/show_bug.cgi?id=2228635 - Please also note that on 09/18 client jobs were failing due to issues with mds, but most of the errors at that time pertained to errors regarding behind on trimming, blocked ops, clients failing to release caps, and they saw these in mds.5: "failed to authpin, dir is being fragmented" - The primary issue for this particular bug is to address the the following errors: HEALTH_WARN 2 clients failing to advance oldest client/flush tid [WRN] MDS_CLIENT_OLDEST_TID: 2 clients failing to advance oldest client/flush tid mds.root.host11.emjjsf(mds.1): Client client75:pthnrt failing to advance its oldest client/flush tid. client_id: 69394526 mds.root.host10.fckajv(mds.4): Client client75:pthnrt failing to advance its oldest client/flush tid. client_id: 69394526 - I'm fully aware that this could indeed be related to bz: https://bugzilla.redhat.com/show_bug.cgi?id=2228635, but I'm requesting confirmation of that fact from engineering on this bz given that this customer is experiencing frequent problems with utilizing mds. Version-Release number of selected component (if applicable): How reproducible: Only reproducible in customer's environment Steps to Reproduce: 1. 2. 3. Actual results: "MDS_CLIENT_OLDEST_TID" occur Expected results: "MDS_CLIENT_OLDEST_TID" do not occur Additional info: - The customer has uploaded sosreports from the failing clients and Ceph mon node. They've also uploaded mds debug logs at attachment "0100-ceph-mds.debug.logs.all.9.mds.instances.tar.gz". Here's the mds layout before mds.4 was failed over and debug mds logs were collected: ==== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active root.host13.kpbxhd Reqs: 133 /s 426k 424k 10.6k 142k 1 active root.host11.emjjsf Reqs: 47 /s 8899k 8897k 10.9k 1668k 2 active root.host12.fzjadk Reqs: 37 /s 19.4M 19.3M 35.8k 1067k 3 active root.host6.gxtzai Reqs: 42 /s 19.1M 19.1M 93.7k 960k 4 active root.host10.fckajv Reqs: 1257 /s 17.7M 17.7M 2173k 6024k 5 active root.host7.vainnz Reqs: 289 /s 14.2M 14.2M 64.7k 2506k - Please let me know if you require any further information. Thanks -Brandon
Please specify the severity of this bug. Severity is defined here: https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.