Bug 2253185

Summary:	[GSS] During installation of ODF 4.14 'rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod' stuck in CrashLoopBackOff state
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Karun Josy <kjosy>
Component:	rook	Assignee:	Parth Arora <paarora>
Status:	CLOSED ERRATA	QA Contact:	Uday kurundwade <ukurundw>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.14	CC:	akestert, cgaynor, ebenahar, jthottan, kbg, kelwhite, kurathod, muagarwa, nigoyal, odf-bz-bot, paarora, sapillai, sheggodu, tdesala, tnielsen
Target Milestone:	---	Flags:	kjosy: needinfo-
Target Release:	ODF 4.15.0
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	4.15.0-123	Doc Type:	Rebase: Bug Fixes and Enhancements
Doc Text:	.'rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod' stuck in CrashLoopBackOff state Previously, the `rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a` pod was stuck in `CrashLoopBackOff` state as the RADOS Gateway (RGW) multisite zonegroup was not getting created and fetched, and the error handling was reporting wrong text. With this release, the error handling bug in multisite configuration is fixed and fetching the zonegroup is improved by fetching it for a particular rgw-realm that was created earlier. As a result, the multisite configuration and `rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a` pod gets created successfully.	Story Points:	---
Clone Of:
Clones:	2254547 (view as bug list)		Environment:
Last Closed:	2024-03-19 15:29:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2246375, 2254547

Description Karun Josy 2023-12-06 12:55:43 UTC

* Description of problem (please be detailed as possible and provide log
snippets):

ODF 4.14 installation does not complete because the 'rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod' stuck in  CrashLoopBackOff state


* Version of all relevant components (if applicable):
ODF 4.14

Comment 8 Santosh Pillai 2023-12-13 04:07:18 UTC

@karun A few questions:

- Does restarting the rook operator fixes the issue completely and cluster is usable after that? (Trying to understand if there a race condition like you mentioned above)
- In the logs that you have shared, the ceph status seems to be `Healh_ok` and no OSDs are down. Is must gather taken after the Operator was restarted and StorageCluster was successful?

Comment 10 Santosh Pillai 2023-12-13 04:31:32 UTC

(In reply to Santosh Pillai from comment #8)
> @karun A few questions:
> 
> - Does restarting the rook operator fixes the issue completely and cluster
> is usable after that? (Trying to understand if there a race condition like
> you mentioned above)
> - In the logs that you have shared, the ceph status seems to be `Healh_ok`
> and no OSDs are down. Is must gather taken after the Operator was restarted
> and StorageCluster was successful?

- is it possible to get the output of `radosgw-admin zonegroup list` on a cluster with this error vs on a cluster that is working fine?

Comment 13 Santosh Pillai 2023-12-14 03:48:43 UTC

Thanks Karun for the details. This should be fixed by Parth's PR https://github.com/rook/rook/pull/12817 which correctly handles the timeout errors and has increased the retry count.

Comment 39 errata-xmlrpc 2024-03-19 15:29:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383