Good Afternoon, I have an odd situation here. The first I've seen with everything looking the way it does. The customer’s ODF seems healthy. All deployments, pods, all PVs/PVCs are bound, etc. BOTH mds pods are up and running. Basically, ODF looks almost perfect however... Description of problem (please be detailed as possible and provide log snippests): Ceph is currently in HEALTH_ERR with ceph status showing the following: sh-4.4$ ceph -s cluster: id: <omitted> health: HEALTH_ERR 1 filesystem is degraded 1 filesystem is offline 1 mds daemon damaged 1 daemons have recently crashed services: mon: 3 daemons, quorum d,e,f (age 12d) mgr: a(active, since 29h) mds: 0/1 daemons up, 2 standby osd: 3 osds: 3 up (since 12d), 3 in (since 12d) data: volumes: 0/1 healthy, 1 recovering; 1 damaged pools: 11 pools, 177 pgs objects: 58.35k objects, 85 GiB usage: 260 GiB used, 1.2 TiB / 1.5 TiB avail pgs: 177 active+clean sh-4.4$ ceph health detail HEALTH_ERR 1 filesystem is degraded; 1 filesystem is offline; 1 mds daemon damaged; 1 daemons have recently crashed [WRN] FS_DEGRADED: 1 filesystem is degraded fs ocs-storagecluster-cephfilesystem is degraded [ERR] MDS_ALL_DOWN: 1 filesystem is offline fs ocs-storagecluster-cephfilesystem is offline because no MDS is active for it. [ERR] MDS_DAMAGE: 1 mds daemon damaged fs ocs-storagecluster-cephfilesystem mds.0 is damaged [WRN] RECENT_CRASH: 1 daemons have recently crashed mds.ocs-storagecluster-cephfilesystem-b crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5896fbb5k7r8z at 2023-02-10T03:40:44.301123Z Version of all relevant components (if applicable): OCP: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.45 True False 21d Cluster version is 4.10.45 ODF: NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.9 NooBaa Operator 4.10.9 mcg-operator.v4.9.13 Succeeded ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.9.13 Succeeded odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.8 Succeeded odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.9.13 Succeeded $ ceph versions { "mon": { "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 1 }, "osd": { "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 3 }, "mds": { "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 2 }, "overall": { "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 9 } } Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, cannot access any PVCs associated with CephFS Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 4 Can this issue reproducible? No Can this issue reproduce from the UI? No Additional info: We’ve tried the following: - Restarted ceph-mgr - Marked filesystem repaired - Collected debug logs (because mds pods are not crashing nothing of importance was in there) - Restarted mds pods - Scaled ocs/rook-ceph operators along with mds deployments down, then back up - Ran the following command: $ ceph fs set ocs-storagecluster-cephfilesystem max_mds 1 - Collected output of dumps_ops_in_flight which yielded this error: ERROR: (38) Function not implemented I spoke with Michael Kidd and Greg Farnum who agreed this was odd that the mds pods weren’t crashing and were up in Running status yet Ceph is in this state. I asked the customer for a detailed response as to what preceded this issue (upgrade, machine outage, etc.). I will upload the logs along with the response from the customer soon.
This is the crash backtrace (at least one of the crashes): { "crash_id": "2023-02-01T05:44:04.993341Z_ff9d8815-10e1-4304-90fd-32d91f7bbcdb", "timestamp": "2023-02-01T05:44:04.993341Z", "process_name": "ceph-mon", "entity_name": "mon.b", "ceph_version": "16.2.0-152.el8cp", "utsname_hostname": "rook-ceph-mon-b-549b6df65f-2s5dw", "utsname_sysname": "Linux", "utsname_release": "4.18.0-305.72.1.el8_4.x86_64", "utsname_version": "#1 SMP Thu Nov 17 09:15:11 EST 2022", "utsname_machine": "x86_64", "os_name": "Red Hat Enterprise Linux", "os_id": "rhel", "os_version_id": "8.5", "os_version": "8.5 (Ootpa)", "assert_condition": "fs->mds_map.compat.compare(compat) == 0", "assert_func": "void FSMap::sanity() const", "assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/mds/FSMap.cc", "assert_line": 857, "assert_thread_name": "ceph-mon", "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/mds/FSMap.cc: In function 'void FSMap::sanity() const' thread 7f11298e8700 time 2023-02-01T05:44:04.989175+0000\n/builddir/build/BUILD/ceph-16.2.0/src/mds/FSMap.cc: 857: FAILED ceph_assert(fs->mds_map.compat.compare(compat) == 0)\n", "backtrace": [ "/lib64/libpthread.so.0(+0x12c20) [0x7f111e577c20]", "gsignal()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f1120a7abb1]", "/usr/lib64/ceph/libceph-common.so.2(+0x274d7a) [0x7f1120a7ad7a]", "(FSMap::sanity() const+0xcd) [0x7f1120fbd9dd]", "(MDSMonitor::update_from_paxos(bool*)+0x378) [0x55c6dacfe838]", "(PaxosService::refresh(bool*)+0x10e) [0x55c6dac1fc9e]", "(Monitor::refresh_from_paxos(bool*)+0x18c) [0x55c6daad147c]", "(Monitor::init_paxos()+0x10c) [0x55c6daad178c]", "(Monitor::preinit()+0xd30) [0x55c6daafec40]", "main()", "__libc_start_main()", "_start()" ] } This does ring a bell, I've seen this before (probably during an upgrade). Checking...