Bug 2196020
| Summary: | When upgrading "AWS IPI 3M 3W 3I" cluster from 4.12 to 4.13 one rook-ceph-mgr pod crashed and led also to a Ceph Health warning | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Itzhak <ikave> |
| Component: | ceph | Assignee: | Radoslaw Zarzynski <rzarzyns> |
| ceph sub component: | Ceph-MGR | QA Contact: | Elad <ebenahar> |
| Status: | CLOSED WORKSFORME | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | bniver, muagarwa, nojha, odf-bz-bot, sostapov |
| Version: | 4.13 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-08-08 13:06:16 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Itzhak
2023-05-07 15:39:54 UTC
Additional info: RP link https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/493/10648/469343/469358/469360/log?logParams=history%3D433269%26page.page%3D1. Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/7616/. Link to the rook-ceph-mgr crash log: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-007aife3c333-uba/j-007aife3c333-uba_20230420T185929/logs/failed_testcase_ocs_logs_1682021026/test_crush_map_unchanged_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-79f522ddb035becf5878305c4af24de6d83610b42e849505b5159ab20b8bb5fa/ceph/must_gather_commands/ceph_crash_ls. Link to the pod logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-007aife3c333-uba/j-007aife3c333-uba_20230420T185929/logs/failed_testcase_ocs_logs_1682021026/test_crush_map_unchanged_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-79f522ddb035becf5878305c4af24de6d83610b42e849505b5159ab20b8bb5fa/namespaces/openshift-storage/oc_output/all_-o_wide. I didn't see the error above with other deployments. So I think it's a problem only with the specific deployment above. In the mgr log [1], the following is seen. After the pod restarted, the current log [2] does not show any issues. 2023-04-20T23:26:29.905955523Z what(): End of buffer 2023-04-20T23:26:29.905988328Z *** Caught signal (Aborted) ** 2023-04-20T23:26:29.905988328Z in thread 7fc3946cb640 thread_name:ms_dispatch 2023-04-20T23:26:29.906669365Z ceph version 17.2.6-10.el9cp (19b8858bfb3d0d1b84ec6f0d3fd7c6148831f7c8) quincy (stable) 2023-04-20T23:26:29.906669365Z 1: /lib64/libc.so.6(+0x54d90) [0x7fc3dd867d90] 2023-04-20T23:26:29.906669365Z 2: /lib64/libc.so.6(+0xa154c) [0x7fc3dd8b454c] 2023-04-20T23:26:29.906669365Z 3: raise() 2023-04-20T23:26:29.906669365Z 4: abort() 2023-04-20T23:26:29.906669365Z 5: /lib64/libstdc++.so.6(+0xa1a21) [0x7fc3ddbb3a21] 2023-04-20T23:26:29.906669365Z 6: /lib64/libstdc++.so.6(+0xad39c) [0x7fc3ddbbf39c] 2023-04-20T23:26:29.906669365Z 7: /lib64/libstdc++.so.6(+0xad407) [0x7fc3ddbbf407] 2023-04-20T23:26:29.906669365Z 8: /lib64/libstdc++.so.6(+0xad669) [0x7fc3ddbbf669] 2023-04-20T23:26:29.906669365Z 9: /usr/lib64/ceph/libceph-common.so.2(+0x170c95) [0x7fc3ddeabc95] 2023-04-20T23:26:29.906669365Z 10: (SnapRealmInfo::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x3f) [0x7fc3ddfddfef] 2023-04-20T23:26:29.906669365Z 11: /lib64/libcephfs.so.2(+0xa1007) [0x7fc3d243e007] 2023-04-20T23:26:29.906669365Z 12: /lib64/libcephfs.so.2(+0xab969) [0x7fc3d2448969] 2023-04-20T23:26:29.906669365Z 13: /lib64/libcephfs.so.2(+0xacc20) [0x7fc3d2449c20] 2023-04-20T23:26:29.906669365Z 14: /lib64/libcephfs.so.2(+0x939d8) [0x7fc3d24309d8] 2023-04-20T23:26:29.906669365Z 15: (DispatchQueue::entry()+0x53a) [0x7fc3de0693ca] 2023-04-20T23:26:29.906669365Z 16: /usr/lib64/ceph/libceph-common.so.2(+0x3b9ed1) [0x7fc3de0f4ed1] 2023-04-20T23:26:29.906669365Z 17: /lib64/libc.so.6(+0x9f802) [0x7fc3dd8b2802] 2023-04-20T23:26:29.906669365Z 18: /lib64/libc.so.6(+0x3f450) [0x7fc3dd852450] 2023-04-20T23:26:29.906904618Z debug 2023-04-20T23:26:29.905+0000 7fc3946cb640 -1 *** Caught signal (Aborted) ** 2023-04-20T23:26:29.906904618Z in thread 7fc3946cb640 thread_name:ms_dispatch [1] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-007aife3c333-uba/j-007aife3c333-uba_20230420T185929/logs/failed_testcase_ocs_logs_1682021026/test_crush_map_unchanged_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-79f522ddb035becf5878305c4af24de6d83610b42e849505b5159ab20b8bb5fa/namespaces/openshift-storage/pods/rook-ceph-mgr-a-848474c4d9-cxngw/mgr/mgr/logs/previous.log [2] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-007aife3c333-uba/j-007aife3c333-uba_20230420T185929/logs/failed_testcase_ocs_logs_1682021026/test_crush_map_unchanged_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-79f522ddb035becf5878305c4af24de6d83610b42e849505b5159ab20b8bb5fa/namespaces/openshift-storage/pods/rook-ceph-mgr-a-848474c4d9-cxngw/mgr/mgr/logs/current.log |