2253185 – [GSS] During installation of ODF 4.14 'rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod' stuck in CrashLoopBackOff state

Bug 2253185 - [GSS] During installation of ODF 4.14 'rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod' stuck in CrashLoopBackOff state

Summary: [GSS] During installation of ODF 4.14 'rook-ceph-rgw-ocs-storagecluster-cepho...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.14
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	Parth Arora
QA Contact:	Uday kurundwade
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2246375 2254547
TreeView+	depends on / blocked

Reported:	2023-12-06 12:55 UTC by Karun Josy
Modified:	2024-05-30 07:28 UTC (History)
CC List:	15 users (show)
Fixed In Version:	4.15.0-123
Doc Type:	Rebase: Bug Fixes and Enhancements
Doc Text:	.'rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod' stuck in CrashLoopBackOff state Previously, the `rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a` pod was stuck in `CrashLoopBackOff` state as the RADOS Gateway (RGW) multisite zonegroup was not getting created and fetched, and the error handling was reporting wrong text. With this release, the error handling bug in multisite configuration is fixed and fetching the zonegroup is improved by fetching it for a particular rgw-realm that was created earlier. As a result, the multisite configuration and `rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a` pod gets created successfully.
Clone Of:
Clones:	2254547 (view as bug list)
Environment:
Last Closed:	2024-03-19 15:29:24 UTC
Embargoed:
Flags:	kjosy: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	rook rook pull 12817	None	Merged	object: improve the error handling for multisite objs	2023-12-14 03:47:10 UTC
Red Hat Knowledge Base (Solution)	7049176	None	None	None	2023-12-14 09:55:34 UTC
Red Hat Product Errata	RHSA-2024:1383	None	None	None	2024-03-19 15:29:43 UTC

Description Karun Josy 2023-12-06 12:55:43 UTC

* Description of problem (please be detailed as possible and provide log
snippets):

ODF 4.14 installation does not complete because the 'rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod' stuck in  CrashLoopBackOff state


* Version of all relevant components (if applicable):
ODF 4.14

Comment 8 Santosh Pillai 2023-12-13 04:07:18 UTC

@karun A few questions:

- Does restarting the rook operator fixes the issue completely and cluster is usable after that? (Trying to understand if there a race condition like you mentioned above)
- In the logs that you have shared, the ceph status seems to be `Healh_ok` and no OSDs are down. Is must gather taken after the Operator was restarted and StorageCluster was successful?

Comment 10 Santosh Pillai 2023-12-13 04:31:32 UTC

(In reply to Santosh Pillai from comment #8)
> @karun A few questions:
> 
> - Does restarting the rook operator fixes the issue completely and cluster
> is usable after that? (Trying to understand if there a race condition like
> you mentioned above)
> - In the logs that you have shared, the ceph status seems to be `Healh_ok`
> and no OSDs are down. Is must gather taken after the Operator was restarted
> and StorageCluster was successful?

- is it possible to get the output of `radosgw-admin zonegroup list` on a cluster with this error vs on a cluster that is working fine?

Comment 13 Santosh Pillai 2023-12-14 03:48:43 UTC

Thanks Karun for the details. This should be fixed by Parth's PR https://github.com/rook/rook/pull/12817 which correctly handles the timeout errors and has increased the retry count.

Comment 39 errata-xmlrpc 2024-03-19 15:29:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Note You need to log in before you can comment on or make changes to this bug.