2317236 – ceph-mds process reported crash on ODF 4.17 cluster

Bug 2317236 - ceph-mds process reported crash on ODF 4.17 cluster [NEEDINFO]

Summary: ceph-mds process reported crash on ODF 4.17 cluster

Keywords:
Status:	ASSIGNED
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.17
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Milind Changire
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-10-08 14:28 UTC by Parag Kamble
Modified:	2024-10-18 15:21 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
Flags:	vshankar: needinfo? (mchangir) mchangir: needinfo? (pakamble) mchangir: needinfo? (pakamble) mchangir: needinfo? (pakamble)

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OCSBZM-9345	0	None	None	None	2024-10-08 14:29:42 UTC

Description Parag Kamble 2024-10-08 14:28:59 UTC

Description of problem (please be detailed as possible and provide log
snippests):
ceph-mds process has generated a following crash on 4.17.0-114 cluster,  and cephcluster showing HEALTH_WARN state.
Since the crash happened in the libc.so.6  and backtrace doesn’t provide detailed function names or locations beyond the generic memory addresses.


sh-5.1$ ceph crash info 2024-10-05T06:10:52.621730Z_c7f22b45-d236-43ea-86a4-aa19b31c380a
{
    "backtrace": [
        "/lib64/libc.so.6(+0x3e6f0) [0x7f66b5bea6f0]",
        "[0x5579e7405330]"
    ],
    "ceph_version": "18.2.1-229.el9cp",
    "crash_id": "2024-10-05T06:10:52.621730Z_c7f22b45-d236-43ea-86a4-aa19b31c380a",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-b",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.4 (Plow)",
    "os_version_id": "9.4",
    "process_name": "ceph-mds",
    "stack_sig": "12c4f060cf8b59a0ebac25da63a7f5b2a2cf5b99f12a288248409824102b5615",
    "timestamp": "2024-10-05T06:10:52.621730Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6d77bb68ccw6s",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-427.37.1.el9_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Fri Sep 13 12:41:50 EDT 2024"
}


rook-ceph-mds logs
-=-=-=-=-=-=-=-=-=-=
❯ ocs logs rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6d77bb68ccw6s
Defaulted container "mds" out of: mds, log-collector, chown-container-data-dir (init)
debug 2024-10-05T06:10:52.959+0000 7f528f474ac0  0 set uid:gid to 167:167 (ceph:ceph)
debug 2024-10-05T06:10:52.959+0000 7f528f474ac0  0 ceph version 18.2.1-229.el9cp (ef652b206f2487adfc86613646a4cac946f6b4e0) reef (stable), process ceph-mds, pid 151
debug 2024-10-05T06:10:52.959+0000 7f528f474ac0  1 main not setting numa affinity
debug 2024-10-05T06:10:52.959+0000 7f528f474ac0  0 pidfile_write: ignore empty --pid-file
starting mds.ocs-storagecluster-cephfilesystem-b at
debug 2024-10-05T06:10:52.970+0000 7f528ac08640  1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 35 from mon.0
debug 2024-10-05T06:10:52.991+0000 7f528ac08640  1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 36 from mon.0
debug 2024-10-05T06:10:52.991+0000 7f528ac08640  1 mds.ocs-storagecluster-cephfilesystem-b Monitors have assigned me to become a standby.
debug 2024-10-05T06:11:42.402+0000 7f528ac08640  1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 41 from mon.0
debug 2024-10-05T06:11:42.403+0000 7f528ac08640  1 mds.0.0 handle_mds_map i am now mds.74241.0 replaying mds.0.0
debug 2024-10-05T06:11:42.403+0000 7f528ac08640  1 mds.0.0 handle_mds_map state change up:standby --> up:standby-replay
debug 2024-10-05T06:11:42.403+0000 7f528ac08640  1 mds.0.0 replay_start
debug 2024-10-05T06:11:42.403+0000 7f528ac08640  1 mds.0.0  waiting for osdmap 127 (which blocklists prior instance)
debug 2024-10-05T06:11:42.451+0000 7f5284bfc640  0 mds.0.cache creating system inode with ino:0x100
debug 2024-10-05T06:11:42.451+0000 7f5284bfc640  0 mds.0.cache creating system inode with ino:0x1
debug 2024-10-06T00:07:31.076+0000 7f528c40b640 -1 received  signal: Hangup from  (PID: 39092) UID: 0
debug 2024-10-06T00:07:31.080+0000 7f528c40b640 -1 received  signal: Hangup from  (PID: 39093) UID: 0
debug 2024-10-07T00:07:31.508+0000 7f528c40b640 -1 Fail to open '/proc/91114/cmdline' error = (2) No such file or directory
debug 2024-10-07T00:07:31.508+0000 7f528c40b640 -1 received  signal: Hangup from <unknown> (PID: 91114) UID: 0
debug 2024-10-07T00:07:31.511+0000 7f528c40b640 -1 received  signal: Hangup from  (PID: 91115) UID: 0

Cephcluster is in HEALTH_WARN state
-=-=-=--=-=-=-=-=-=-=
❯ ocs get cephclusters.ceph.rook.io
NAME                             DATADIRHOSTPATH   MONCOUNT   AGE     PHASE   MESSAGE                        HEALTH        EXTERNAL   FSID
ocs-storagecluster-cephcluster   /var/lib/rook     3          3d21h   Ready   Cluster created successfully   HEALTH_WARN              6b3f9622-7cbd-44b0-9991-4c75c6f9cf39




Version of all relevant components (if applicable): 4.17


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)? Y


Is there any workaround available to the best of your knowledge? N


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)? 


Can this issue reproducible? N


Can this issue reproduce from the UI? N


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy ODF 4.17.0-114 cluster
2. Create a 4 PVC with cephFS interface 
3. Attach a PVC to pods and start FIO workload from each pod
4. wait till 3-4 minuters
5. POwerOff one Worker node from vcenter and wait 120 seconds minute.
6. POwerOn same worker node and wait till node join to the cluster

Actual results: After PowerOn worker node has joined but the cephcluster showing as HEALTH_WARN state and ceph-mds process has generated a crash


Expected results: When the node rejoins the cluster, all operations are expected to work


Additional info:

Must Gather logs : https://ibm.box.com/s/vxanlqhr461m82gafl3984a3awtsrlso

Comment 3 Sunil Kumar Acharya 2024-10-10 13:16:15 UTC

Moving the non-blocker BZs out of ODF-4.17.0. If this is a blocker BZ, please update the flag appropriately and propose it back to ODF-4.17.0 with justification note.

Note You need to log in before you can comment on or make changes to this bug.