Description of problem (please be detailed as possible and provide log snippets): When upgrading "AWS IPI FIPS ENCRYPTION 3AZ RHCOS 3M 3W 3I" Cluster from 4.12 to 4.13 one rook-ceph-mgr pod crashed which led also to a Ceph health warning. Version of all relevant components (if applicable): OCP 4.12.0, ODF 4.12.2 before the upgrade. OCP 4.13.0, ODF 4.13.0-169 after the upgrade. Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes. Ceph Health is not OK. Is there any workaround available to the best of your knowledge? Yes. There might be a workaround to silence the Ceph crash warning. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes. Need to run an upgrade from 4.12 to 4.13 for both OCP and ODF. Can this issue reproduce from the UI? I think so. If this is a regression, please provide more details to justify this: Yes. I saw that in the previous similar upgrade tests, it was fine(For example this one https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/493/10648/469343/469358/469360/log?logParams=history%3D433269%26page.page%3D1). Steps to Reproduce: 1. Run an upgrade from 4.12 to 4.13 for OCP and ODF. 2. Check for a rook-ceph-mgr pod crash(using the command "ceph crash ls"), and check the Ceph Health. Actual results: One rook-ceph-mgr pod crash and the Ceph health is not OK. Expected results: No rook-ceph pod crash should appear, and Ceph health should be OK. Additional info: RP link https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/493/10648/469343/469358/469360/log?logParams=history%3D433269%26page.page%3D1. Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/7616/.Link to the rook-ceph-mgr crash log: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-007aife3c333-uba/j-007aife3c333-uba_20230420T185929/logs/failed_testcase_ocs_logs_1682021026/test_crush_map_unchanged_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-79f522ddb035becf5878305c4af24de6d83610b42e849505b5159ab20b8bb5fa/ceph/must_gather_commands/ceph_crash_ls.Link to the pod logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-007aife3c333-uba/j-007aife3c333-uba_20230420T185929/logs/failed_testcase_ocs_logs_1682021026/test_crush_map_unchanged_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-79f522ddb035becf5878305c4af24de6d83610b42e849505b5159ab20b8bb5fa/namespaces/openshift-storage/oc_output/all_-o_wide. I didn't see the error above with other deployments. So I think it's a problem only with the specific deployment above.
Additional info: RP link https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/493/10648/469343/469358/469360/log?logParams=history%3D433269%26page.page%3D1. Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/7616/. Link to the rook-ceph-mgr crash log: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-007aife3c333-uba/j-007aife3c333-uba_20230420T185929/logs/failed_testcase_ocs_logs_1682021026/test_crush_map_unchanged_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-79f522ddb035becf5878305c4af24de6d83610b42e849505b5159ab20b8bb5fa/ceph/must_gather_commands/ceph_crash_ls. Link to the pod logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-007aife3c333-uba/j-007aife3c333-uba_20230420T185929/logs/failed_testcase_ocs_logs_1682021026/test_crush_map_unchanged_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-79f522ddb035becf5878305c4af24de6d83610b42e849505b5159ab20b8bb5fa/namespaces/openshift-storage/oc_output/all_-o_wide. I didn't see the error above with other deployments. So I think it's a problem only with the specific deployment above.
In the mgr log [1], the following is seen. After the pod restarted, the current log [2] does not show any issues. 2023-04-20T23:26:29.905955523Z what(): End of buffer 2023-04-20T23:26:29.905988328Z *** Caught signal (Aborted) ** 2023-04-20T23:26:29.905988328Z in thread 7fc3946cb640 thread_name:ms_dispatch 2023-04-20T23:26:29.906669365Z ceph version 17.2.6-10.el9cp (19b8858bfb3d0d1b84ec6f0d3fd7c6148831f7c8) quincy (stable) 2023-04-20T23:26:29.906669365Z 1: /lib64/libc.so.6(+0x54d90) [0x7fc3dd867d90] 2023-04-20T23:26:29.906669365Z 2: /lib64/libc.so.6(+0xa154c) [0x7fc3dd8b454c] 2023-04-20T23:26:29.906669365Z 3: raise() 2023-04-20T23:26:29.906669365Z 4: abort() 2023-04-20T23:26:29.906669365Z 5: /lib64/libstdc++.so.6(+0xa1a21) [0x7fc3ddbb3a21] 2023-04-20T23:26:29.906669365Z 6: /lib64/libstdc++.so.6(+0xad39c) [0x7fc3ddbbf39c] 2023-04-20T23:26:29.906669365Z 7: /lib64/libstdc++.so.6(+0xad407) [0x7fc3ddbbf407] 2023-04-20T23:26:29.906669365Z 8: /lib64/libstdc++.so.6(+0xad669) [0x7fc3ddbbf669] 2023-04-20T23:26:29.906669365Z 9: /usr/lib64/ceph/libceph-common.so.2(+0x170c95) [0x7fc3ddeabc95] 2023-04-20T23:26:29.906669365Z 10: (SnapRealmInfo::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x3f) [0x7fc3ddfddfef] 2023-04-20T23:26:29.906669365Z 11: /lib64/libcephfs.so.2(+0xa1007) [0x7fc3d243e007] 2023-04-20T23:26:29.906669365Z 12: /lib64/libcephfs.so.2(+0xab969) [0x7fc3d2448969] 2023-04-20T23:26:29.906669365Z 13: /lib64/libcephfs.so.2(+0xacc20) [0x7fc3d2449c20] 2023-04-20T23:26:29.906669365Z 14: /lib64/libcephfs.so.2(+0x939d8) [0x7fc3d24309d8] 2023-04-20T23:26:29.906669365Z 15: (DispatchQueue::entry()+0x53a) [0x7fc3de0693ca] 2023-04-20T23:26:29.906669365Z 16: /usr/lib64/ceph/libceph-common.so.2(+0x3b9ed1) [0x7fc3de0f4ed1] 2023-04-20T23:26:29.906669365Z 17: /lib64/libc.so.6(+0x9f802) [0x7fc3dd8b2802] 2023-04-20T23:26:29.906669365Z 18: /lib64/libc.so.6(+0x3f450) [0x7fc3dd852450] 2023-04-20T23:26:29.906904618Z debug 2023-04-20T23:26:29.905+0000 7fc3946cb640 -1 *** Caught signal (Aborted) ** 2023-04-20T23:26:29.906904618Z in thread 7fc3946cb640 thread_name:ms_dispatch [1] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-007aife3c333-uba/j-007aife3c333-uba_20230420T185929/logs/failed_testcase_ocs_logs_1682021026/test_crush_map_unchanged_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-79f522ddb035becf5878305c4af24de6d83610b42e849505b5159ab20b8bb5fa/namespaces/openshift-storage/pods/rook-ceph-mgr-a-848474c4d9-cxngw/mgr/mgr/logs/previous.log [2] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-007aife3c333-uba/j-007aife3c333-uba_20230420T185929/logs/failed_testcase_ocs_logs_1682021026/test_crush_map_unchanged_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-79f522ddb035becf5878305c4af24de6d83610b42e849505b5159ab20b8bb5fa/namespaces/openshift-storage/pods/rook-ceph-mgr-a-848474c4d9-cxngw/mgr/mgr/logs/current.log