Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2071085

Summary:	RHCS5 - MDS_CLIENT_RECALL: clients failing to respond to cache pressure
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	George Law <glaw>
Component:	CephFS	Assignee:	Kotresh HR <khiremat>
Status:	CLOSED NOTABUG	QA Contact:	Yogesh Mane <ymane>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	5.0	CC:	assingh, ceph-eng-bugs, gfarnum, gjose, khiremat, nojha, pdonnell, rfriedma, sbaldwin, vereddy, vshankar, xiubli
Target Milestone:	---	Flags:	khiremat: needinfo- khiremat: needinfo-
Target Release:	6.1
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-03 15:49:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2108656
Bug Blocks:

Comment 2 Venky Shankar 2022-04-04 14:37:17 UTC

Ramana - please take a look.

Comment 56 George Law 2022-05-17 13:36:00 UTC

Kotresh,

I'll ask for the full sosreports  from both of these nodes 

note : The event the CU mentioned was at 23:31 - not 3:50am 

May 15 23:31:18 lwtxe04kpapd1i ceph-1f483d8e-8469-11ec-8e59-d0bf9cf275c8-mds-root-lwtxe04kpapd1i-dqrnxn[3712153]: debug 2022-05-16T04:31:18.286+0000 7f84fb568700  1 mds.root.lwtxe04kpapd1i.dqrnxn Map removed me [mds.root.lwtxe04kpapd1i.dqrnxn{0:1133202} state up:standby-replay seq 1 join_fscid=1 addr [v2:171.176.38.198:6800/3401903807,v1:171.176.38.198:6801/3401903807] compat {c=[1],r=[1],i=[7ff]}] from cluster; respawning! See cluster/monitor logs for details.


May 15 23:31:17 lwtxe04kpapd1i ceph-1f483d8e-8469-11ec-8e59-d0bf9cf275c8-mon-lwtxe04kpapd1i[124408]: cluster 2022-05-16T04:31:16.215532+0
000 mgr.lwtxe04hpapd1i.qifige (mgr.1044249) 79907 : cluster [DBG] pgmap v80007: 4161 pgs: 23 active+clean+scrubbing+deep, 4138 active+cle
an; 22 TiB data, 67 TiB used, 117 TiB / 183 TiB avail; 8.6 MiB/s rd, 982 KiB/s wr, 298 op/s
May 15 23:31:17 lwtxe04kpapd1i ceph-1f483d8e-8469-11ec-8e59-d0bf9cf275c8-mon-lwtxe04kpapd1i[124408]: debug 2022-05-16T04:31:17.616+0000 7
f35ac1ff700  1 mon.lwtxe04kpapd1i@3(peon).osd e22029 e22029: 168 total, 168 up, 168 in
May 15 23:31:17 lwtxe04kpapd1i ceph-1f483d8e-8469-11ec-8e59-d0bf9cf275c8-mon-lwtxe04kpapd1i[124408]: debug 2022-05-16T04:31:17.617+0000 7
f35ada02700  1 heartbeat_map reset_timeout &#39;Monitor::cpu_tp thread 0x7f35ada02700&#39; had timed out after 0.000000000s
May 15 23:31:17 lwtxe04kpapd1i ceph-1f483d8e-8469-11ec-8e59-d0bf9cf275c8-mon-lwtxe04kpapd1i[124408]: debug 2022-05-16T04:31:17.617+0000 7
f35ad201700  1 heartbeat_map reset_timeout &#39;Monitor::cpu_tp thread 0x7f35ad201700&#39; had timed out after 0.000000000s
May 15 23:31:17 lwtxe04kpapd1i ceph-1f483d8e-8469-11ec-8e59-d0bf9cf275c8-mon-lwtxe04kpapd1i[124408]: debug 2022-05-16T04:31:17.647+0000 7
f35ae203700  1 heartbeat_map reset_timeout &#39;Monitor::cpu_tp thread 0x7f35ae203700&#39; had timed out after 0.000000000s

Note the timeout after 0.000000000s - also notice there were 23 pgs deep scrubbing at that time.