2196020 – When upgrading "AWS IPI 3M 3W 3I" cluster from 4.12 to 4.13 one rook-ceph-mgr pod crashed and led also to a Ceph Health warning

Bug 2196020 - When upgrading "AWS IPI 3M 3W 3I" cluster from 4.12 to 4.13 one rook-ceph-mgr pod crashed and led also to a Ceph Health warning

Summary: When upgrading "AWS IPI 3M 3W 3I" cluster from 4.12 to 4.13 one rook-ceph-mgr...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Radoslaw Zarzynski
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-05-07 15:39 UTC by Itzhak
Modified:	2023-08-09 16:37 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-08-08 13:06:16 UTC
Embargoed:

Attachments	(Terms of Use)

Description Itzhak 2023-05-07 15:39:54 UTC

Description of problem (please be detailed as possible and provide log
snippets):
When upgrading "AWS IPI FIPS ENCRYPTION 3AZ RHCOS 3M 3W 3I" Cluster from 4.12 to 4.13 one rook-ceph-mgr pod crashed which led also to a Ceph health warning.

Version of all relevant components (if applicable):
OCP 4.12.0, ODF 4.12.2 before the upgrade. OCP 4.13.0, ODF 4.13.0-169 after the upgrade.

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes. Ceph Health is not OK.

Is there any workaround available to the best of your knowledge?
Yes. There might be a workaround to silence the Ceph crash warning.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes. Need to run an upgrade from 4.12 to 4.13 for both OCP and ODF.

Can this issue reproduce from the UI?
I think so.

If this is a regression, please provide more details to justify this:
Yes. I saw that in the previous similar upgrade tests, it was fine(For example this one https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/493/10648/469343/469358/469360/log?logParams=history%3D433269%26page.page%3D1).

Steps to Reproduce:
1. Run an upgrade from 4.12 to 4.13 for OCP and ODF.
2. Check for a rook-ceph-mgr pod crash(using the command "ceph crash ls"), and check the Ceph Health.

Actual results:
One rook-ceph-mgr pod crash and the Ceph health is not OK.

Expected results:
No rook-ceph pod crash should appear, and Ceph health should be OK.

Additional info:
RP link https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/493/10648/469343/469358/469360/log?logParams=history%3D433269%26page.page%3D1. Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/7616/.Link to the rook-ceph-mgr crash log: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-007aife3c333-uba/j-007aife3c333-uba_20230420T185929/logs/failed_testcase_ocs_logs_1682021026/test_crush_map_unchanged_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-79f522ddb035becf5878305c4af24de6d83610b42e849505b5159ab20b8bb5fa/ceph/must_gather_commands/ceph_crash_ls.Link to the pod logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-007aife3c333-uba/j-007aife3c333-uba_20230420T185929/logs/failed_testcase_ocs_logs_1682021026/test_crush_map_unchanged_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-79f522ddb035becf5878305c4af24de6d83610b42e849505b5159ab20b8bb5fa/namespaces/openshift-storage/oc_output/all_-o_wide.
I didn't see the error above with other deployments. So I think it's a problem only with the specific deployment above.

Comment 2 Itzhak 2023-05-08 10:19:18 UTC

Additional info:

RP link https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/493/10648/469343/469358/469360/log?logParams=history%3D433269%26page.page%3D1. Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/7616/.

Link to the rook-ceph-mgr crash log: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-007aife3c333-uba/j-007aife3c333-uba_20230420T185929/logs/failed_testcase_ocs_logs_1682021026/test_crush_map_unchanged_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-79f522ddb035becf5878305c4af24de6d83610b42e849505b5159ab20b8bb5fa/ceph/must_gather_commands/ceph_crash_ls.

Link to the pod logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-007aife3c333-uba/j-007aife3c333-uba_20230420T185929/logs/failed_testcase_ocs_logs_1682021026/test_crush_map_unchanged_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-79f522ddb035becf5878305c4af24de6d83610b42e849505b5159ab20b8bb5fa/namespaces/openshift-storage/oc_output/all_-o_wide.

 
I didn't see the error above with other deployments. So I think it's a problem only with the specific deployment above.

Comment 3 Travis Nielsen 2023-05-08 17:54:46 UTC

In the mgr log [1], the following is seen. After the pod restarted, the current log [2] does not show any issues.

2023-04-20T23:26:29.905955523Z   what():  End of buffer
2023-04-20T23:26:29.905988328Z *** Caught signal (Aborted) **
2023-04-20T23:26:29.905988328Z  in thread 7fc3946cb640 thread_name:ms_dispatch
2023-04-20T23:26:29.906669365Z  ceph version 17.2.6-10.el9cp (19b8858bfb3d0d1b84ec6f0d3fd7c6148831f7c8) quincy (stable)
2023-04-20T23:26:29.906669365Z  1: /lib64/libc.so.6(+0x54d90) [0x7fc3dd867d90]
2023-04-20T23:26:29.906669365Z  2: /lib64/libc.so.6(+0xa154c) [0x7fc3dd8b454c]
2023-04-20T23:26:29.906669365Z  3: raise()
2023-04-20T23:26:29.906669365Z  4: abort()
2023-04-20T23:26:29.906669365Z  5: /lib64/libstdc++.so.6(+0xa1a21) [0x7fc3ddbb3a21]
2023-04-20T23:26:29.906669365Z  6: /lib64/libstdc++.so.6(+0xad39c) [0x7fc3ddbbf39c]
2023-04-20T23:26:29.906669365Z  7: /lib64/libstdc++.so.6(+0xad407) [0x7fc3ddbbf407]
2023-04-20T23:26:29.906669365Z  8: /lib64/libstdc++.so.6(+0xad669) [0x7fc3ddbbf669]
2023-04-20T23:26:29.906669365Z  9: /usr/lib64/ceph/libceph-common.so.2(+0x170c95) [0x7fc3ddeabc95]
2023-04-20T23:26:29.906669365Z  10: (SnapRealmInfo::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x3f) [0x7fc3ddfddfef]
2023-04-20T23:26:29.906669365Z  11: /lib64/libcephfs.so.2(+0xa1007) [0x7fc3d243e007]
2023-04-20T23:26:29.906669365Z  12: /lib64/libcephfs.so.2(+0xab969) [0x7fc3d2448969]
2023-04-20T23:26:29.906669365Z  13: /lib64/libcephfs.so.2(+0xacc20) [0x7fc3d2449c20]
2023-04-20T23:26:29.906669365Z  14: /lib64/libcephfs.so.2(+0x939d8) [0x7fc3d24309d8]
2023-04-20T23:26:29.906669365Z  15: (DispatchQueue::entry()+0x53a) [0x7fc3de0693ca]
2023-04-20T23:26:29.906669365Z  16: /usr/lib64/ceph/libceph-common.so.2(+0x3b9ed1) [0x7fc3de0f4ed1]
2023-04-20T23:26:29.906669365Z  17: /lib64/libc.so.6(+0x9f802) [0x7fc3dd8b2802]
2023-04-20T23:26:29.906669365Z  18: /lib64/libc.so.6(+0x3f450) [0x7fc3dd852450]
2023-04-20T23:26:29.906904618Z debug 2023-04-20T23:26:29.905+0000 7fc3946cb640 -1 *** Caught signal (Aborted) **
2023-04-20T23:26:29.906904618Z  in thread 7fc3946cb640 thread_name:ms_dispatch


[1] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-007aife3c333-uba/j-007aife3c333-uba_20230420T185929/logs/failed_testcase_ocs_logs_1682021026/test_crush_map_unchanged_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-79f522ddb035becf5878305c4af24de6d83610b42e849505b5159ab20b8bb5fa/namespaces/openshift-storage/pods/rook-ceph-mgr-a-848474c4d9-cxngw/mgr/mgr/logs/previous.log

[2] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-007aife3c333-uba/j-007aife3c333-uba_20230420T185929/logs/failed_testcase_ocs_logs_1682021026/test_crush_map_unchanged_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-79f522ddb035becf5878305c4af24de6d83610b42e849505b5159ab20b8bb5fa/namespaces/openshift-storage/pods/rook-ceph-mgr-a-848474c4d9-cxngw/mgr/mgr/logs/current.log

Note You need to log in before you can comment on or make changes to this bug.