Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2071085

Summary: RHCS5 - MDS_CLIENT_RECALL: clients failing to respond to cache pressure
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: George Law <glaw>
Component: CephFSAssignee: Kotresh HR <khiremat>
Status: CLOSED NOTABUG QA Contact: Yogesh Mane <ymane>
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.0CC: assingh, ceph-eng-bugs, gfarnum, gjose, khiremat, nojha, pdonnell, rfriedma, sbaldwin, vereddy, vshankar, xiubli
Target Milestone: ---Flags: khiremat: needinfo-
khiremat: needinfo-
Target Release: 6.1   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-03 15:49:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2108656    
Bug Blocks:    

Comment 2 Venky Shankar 2022-04-04 14:37:17 UTC
Ramana - please take a look.

Comment 56 George Law 2022-05-17 13:36:00 UTC
Kotresh,

I'll ask for the full sosreports  from both of these nodes 

note : The event the CU mentioned was at 23:31 - not 3:50am 

May 15 23:31:18 lwtxe04kpapd1i ceph-1f483d8e-8469-11ec-8e59-d0bf9cf275c8-mds-root-lwtxe04kpapd1i-dqrnxn[3712153]: debug 2022-05-16T04:31:18.286+0000 7f84fb568700  1 mds.root.lwtxe04kpapd1i.dqrnxn Map removed me [mds.root.lwtxe04kpapd1i.dqrnxn{0:1133202} state up:standby-replay seq 1 join_fscid=1 addr [v2:171.176.38.198:6800/3401903807,v1:171.176.38.198:6801/3401903807] compat {c=[1],r=[1],i=[7ff]}] from cluster; respawning! See cluster/monitor logs for details.


May 15 23:31:17 lwtxe04kpapd1i ceph-1f483d8e-8469-11ec-8e59-d0bf9cf275c8-mon-lwtxe04kpapd1i[124408]: cluster 2022-05-16T04:31:16.215532+0
000 mgr.lwtxe04hpapd1i.qifige (mgr.1044249) 79907 : cluster [DBG] pgmap v80007: 4161 pgs: 23 active+clean+scrubbing+deep, 4138 active+cle
an; 22 TiB data, 67 TiB used, 117 TiB / 183 TiB avail; 8.6 MiB/s rd, 982 KiB/s wr, 298 op/s
May 15 23:31:17 lwtxe04kpapd1i ceph-1f483d8e-8469-11ec-8e59-d0bf9cf275c8-mon-lwtxe04kpapd1i[124408]: debug 2022-05-16T04:31:17.616+0000 7
f35ac1ff700  1 mon.lwtxe04kpapd1i@3(peon).osd e22029 e22029: 168 total, 168 up, 168 in
May 15 23:31:17 lwtxe04kpapd1i ceph-1f483d8e-8469-11ec-8e59-d0bf9cf275c8-mon-lwtxe04kpapd1i[124408]: debug 2022-05-16T04:31:17.617+0000 7
f35ada02700  1 heartbeat_map reset_timeout &#39;Monitor::cpu_tp thread 0x7f35ada02700&#39; had timed out after 0.000000000s
May 15 23:31:17 lwtxe04kpapd1i ceph-1f483d8e-8469-11ec-8e59-d0bf9cf275c8-mon-lwtxe04kpapd1i[124408]: debug 2022-05-16T04:31:17.617+0000 7
f35ad201700  1 heartbeat_map reset_timeout &#39;Monitor::cpu_tp thread 0x7f35ad201700&#39; had timed out after 0.000000000s
May 15 23:31:17 lwtxe04kpapd1i ceph-1f483d8e-8469-11ec-8e59-d0bf9cf275c8-mon-lwtxe04kpapd1i[124408]: debug 2022-05-16T04:31:17.647+0000 7
f35ae203700  1 heartbeat_map reset_timeout &#39;Monitor::cpu_tp thread 0x7f35ae203700&#39; had timed out after 0.000000000s

Note the timeout after 0.000000000s - also notice there were 23 pgs deep scrubbing at that time.