Bug 2253185

Summary: [GSS] During installation of ODF 4.14 'rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod' stuck in CrashLoopBackOff state
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Karun Josy <kjosy>
Component: rookAssignee: Parth Arora <paarora>
Status: CLOSED ERRATA QA Contact: Uday kurundwade <ukurundw>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.14CC: akestert, cgaynor, ebenahar, jthottan, kbg, kelwhite, kurathod, muagarwa, nigoyal, odf-bz-bot, paarora, sapillai, sheggodu, tdesala, tnielsen
Target Milestone: ---Flags: kjosy: needinfo-
Target Release: ODF 4.15.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 4.15.0-123 Doc Type: Rebase: Bug Fixes and Enhancements
Doc Text:
.'rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod' stuck in CrashLoopBackOff state Previously, the `rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a` pod was stuck in `CrashLoopBackOff` state as the RADOS Gateway (RGW) multisite zonegroup was not getting created and fetched, and the error handling was reporting wrong text. With this release, the error handling bug in multisite configuration is fixed and fetching the zonegroup is improved by fetching it for a particular rgw-realm that was created earlier. As a result, the multisite configuration and `rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a` pod gets created successfully.
Story Points: ---
Clone Of:
: 2254547 (view as bug list) Environment:
Last Closed: 2024-03-19 15:29:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2246375, 2254547    

Description Karun Josy 2023-12-06 12:55:43 UTC
* Description of problem (please be detailed as possible and provide log
snippets):

ODF 4.14 installation does not complete because the 'rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod' stuck in  CrashLoopBackOff state


* Version of all relevant components (if applicable):
ODF 4.14

Comment 8 Santosh Pillai 2023-12-13 04:07:18 UTC
@karun A few questions:

- Does restarting the rook operator fixes the issue completely and cluster is usable after that? (Trying to understand if there a race condition like you mentioned above)
- In the logs that you have shared, the ceph status seems to be `Healh_ok` and no OSDs are down. Is must gather taken after the Operator was restarted and StorageCluster was successful?

Comment 10 Santosh Pillai 2023-12-13 04:31:32 UTC
(In reply to Santosh Pillai from comment #8)
> @karun A few questions:
> 
> - Does restarting the rook operator fixes the issue completely and cluster
> is usable after that? (Trying to understand if there a race condition like
> you mentioned above)
> - In the logs that you have shared, the ceph status seems to be `Healh_ok`
> and no OSDs are down. Is must gather taken after the Operator was restarted
> and StorageCluster was successful?

- is it possible to get the output of `radosgw-admin zonegroup list` on a cluster with this error vs on a cluster that is working fine?

Comment 13 Santosh Pillai 2023-12-14 03:48:43 UTC
Thanks Karun for the details. This should be fixed by Parth's PR https://github.com/rook/rook/pull/12817 which correctly handles the timeout errors and has increased the retry count.

Comment 39 errata-xmlrpc 2024-03-19 15:29:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383