Description of problem (please be detailed as possible and provide log snippests): ceph-mds process has generated a following crash on 4.17.0-114 cluster, and cephcluster showing HEALTH_WARN state. Since the crash happened in the libc.so.6 and backtrace doesn’t provide detailed function names or locations beyond the generic memory addresses. sh-5.1$ ceph crash info 2024-10-05T06:10:52.621730Z_c7f22b45-d236-43ea-86a4-aa19b31c380a { "backtrace": [ "/lib64/libc.so.6(+0x3e6f0) [0x7f66b5bea6f0]", "[0x5579e7405330]" ], "ceph_version": "18.2.1-229.el9cp", "crash_id": "2024-10-05T06:10:52.621730Z_c7f22b45-d236-43ea-86a4-aa19b31c380a", "entity_name": "mds.ocs-storagecluster-cephfilesystem-b", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.4 (Plow)", "os_version_id": "9.4", "process_name": "ceph-mds", "stack_sig": "12c4f060cf8b59a0ebac25da63a7f5b2a2cf5b99f12a288248409824102b5615", "timestamp": "2024-10-05T06:10:52.621730Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6d77bb68ccw6s", "utsname_machine": "x86_64", "utsname_release": "5.14.0-427.37.1.el9_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Fri Sep 13 12:41:50 EDT 2024" } rook-ceph-mds logs -=-=-=-=-=-=-=-=-=-= ❯ ocs logs rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6d77bb68ccw6s Defaulted container "mds" out of: mds, log-collector, chown-container-data-dir (init) debug 2024-10-05T06:10:52.959+0000 7f528f474ac0 0 set uid:gid to 167:167 (ceph:ceph) debug 2024-10-05T06:10:52.959+0000 7f528f474ac0 0 ceph version 18.2.1-229.el9cp (ef652b206f2487adfc86613646a4cac946f6b4e0) reef (stable), process ceph-mds, pid 151 debug 2024-10-05T06:10:52.959+0000 7f528f474ac0 1 main not setting numa affinity debug 2024-10-05T06:10:52.959+0000 7f528f474ac0 0 pidfile_write: ignore empty --pid-file starting mds.ocs-storagecluster-cephfilesystem-b at debug 2024-10-05T06:10:52.970+0000 7f528ac08640 1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 35 from mon.0 debug 2024-10-05T06:10:52.991+0000 7f528ac08640 1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 36 from mon.0 debug 2024-10-05T06:10:52.991+0000 7f528ac08640 1 mds.ocs-storagecluster-cephfilesystem-b Monitors have assigned me to become a standby. debug 2024-10-05T06:11:42.402+0000 7f528ac08640 1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 41 from mon.0 debug 2024-10-05T06:11:42.403+0000 7f528ac08640 1 mds.0.0 handle_mds_map i am now mds.74241.0 replaying mds.0.0 debug 2024-10-05T06:11:42.403+0000 7f528ac08640 1 mds.0.0 handle_mds_map state change up:standby --> up:standby-replay debug 2024-10-05T06:11:42.403+0000 7f528ac08640 1 mds.0.0 replay_start debug 2024-10-05T06:11:42.403+0000 7f528ac08640 1 mds.0.0 waiting for osdmap 127 (which blocklists prior instance) debug 2024-10-05T06:11:42.451+0000 7f5284bfc640 0 mds.0.cache creating system inode with ino:0x100 debug 2024-10-05T06:11:42.451+0000 7f5284bfc640 0 mds.0.cache creating system inode with ino:0x1 debug 2024-10-06T00:07:31.076+0000 7f528c40b640 -1 received signal: Hangup from (PID: 39092) UID: 0 debug 2024-10-06T00:07:31.080+0000 7f528c40b640 -1 received signal: Hangup from (PID: 39093) UID: 0 debug 2024-10-07T00:07:31.508+0000 7f528c40b640 -1 Fail to open '/proc/91114/cmdline' error = (2) No such file or directory debug 2024-10-07T00:07:31.508+0000 7f528c40b640 -1 received signal: Hangup from <unknown> (PID: 91114) UID: 0 debug 2024-10-07T00:07:31.511+0000 7f528c40b640 -1 received signal: Hangup from (PID: 91115) UID: 0 Cephcluster is in HEALTH_WARN state -=-=-=--=-=-=-=-=-=-= ❯ ocs get cephclusters.ceph.rook.io NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID ocs-storagecluster-cephcluster /var/lib/rook 3 3d21h Ready Cluster created successfully HEALTH_WARN 6b3f9622-7cbd-44b0-9991-4c75c6f9cf39 Version of all relevant components (if applicable): 4.17 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Y Is there any workaround available to the best of your knowledge? N Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? N Can this issue reproduce from the UI? N If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy ODF 4.17.0-114 cluster 2. Create a 4 PVC with cephFS interface 3. Attach a PVC to pods and start FIO workload from each pod 4. wait till 3-4 minuters 5. POwerOff one Worker node from vcenter and wait 120 seconds minute. 6. POwerOn same worker node and wait till node join to the cluster Actual results: After PowerOn worker node has joined but the cephcluster showing as HEALTH_WARN state and ceph-mds process has generated a crash Expected results: When the node rejoins the cluster, all operations are expected to work Additional info: Must Gather logs : https://ibm.box.com/s/vxanlqhr461m82gafl3984a3awtsrlso
Moving the non-blocker BZs out of ODF-4.17.0. If this is a blocker BZ, please update the flag appropriately and propose it back to ODF-4.17.0 with justification note.