Versions: mcg-operator.v4.13.0-rhodf odf-operator.v4.13.0-rhodf ocs-operator.v4.13.0-rhodf OCP: 4.13.0 The cluster was running for 51 days and there was no issue. Was checking the API performance dashboard today, selecting 2 weeks period... Apparently this is resource intensive operation. oc get pod -A|grep -v Run|grep -v Comple NAMESPACE NAME READY STATUS RESTARTS AGE openshift-storage rook-ceph-osd-1-88fc6f54d-xxfzt 1/2 CrashLoopBackOff 20 (4m43s ago) 85m oc logs -n openshift-storage rook-ceph-osd-1-88fc6f54d-xxfzt|grep FAIL Defaulted container "osd" out of: osd, log-collector, blkdevmapper (init), activate (init), expand-bluefs (init), chown-container-data-dir (init) /builddir/build/BUILD/ceph-17.2.6/src/osd/osd_types.h: 4882: FAILED ceph_assert(it != missing.end()) /builddir/build/BUILD/ceph-17.2.6/src/osd/osd_types.h: 4882: FAILED ceph_assert(it != missing.end()) /builddir/build/BUILD/ceph-17.2.6/src/osd/osd_types.h: 4882: FAILED ceph_assert(it != missing.end()) /builddir/build/BUILD/ceph-17.2.6/src/osd/osd_types.h: 4882: FAILED ceph_assert(it != missing.end()
Note: After I rebooted all the 3 nodes in this compact (only 3 controllers and 0 workers) cluster, the issue didn't reproduce
Since the issue has been resolved, I don't think this is urgent. It still seems wise to leave this open until we have someone available who can take a look at the must-gather to see if there are any clear error indications. It's possible this could have been a random issue with a memory block becoming corrupt in RAM or on disk.