Bug 2253185 - [GSS] During installation of ODF 4.14 'rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod' stuck in CrashLoopBackOff state
Summary: [GSS] During installation of ODF 4.14 'rook-ceph-rgw-ocs-storagecluster-cepho...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.14
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: ODF 4.15.0
Assignee: Parth Arora
QA Contact: Uday kurundwade
URL:
Whiteboard:
Depends On:
Blocks: 2246375 2254547
TreeView+ depends on / blocked
 
Reported: 2023-12-06 12:55 UTC by Karun Josy
Modified: 2024-05-30 07:28 UTC (History)
15 users (show)

Fixed In Version: 4.15.0-123
Doc Type: Rebase: Bug Fixes and Enhancements
Doc Text:
.'rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod' stuck in CrashLoopBackOff state Previously, the `rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a` pod was stuck in `CrashLoopBackOff` state as the RADOS Gateway (RGW) multisite zonegroup was not getting created and fetched, and the error handling was reporting wrong text. With this release, the error handling bug in multisite configuration is fixed and fetching the zonegroup is improved by fetching it for a particular rgw-realm that was created earlier. As a result, the multisite configuration and `rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a` pod gets created successfully.
Clone Of:
: 2254547 (view as bug list)
Environment:
Last Closed: 2024-03-19 15:29:24 UTC
Embargoed:
kjosy: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rook rook pull 12817 0 None Merged object: improve the error handling for multisite objs 2023-12-14 03:47:10 UTC
Red Hat Knowledge Base (Solution) 7049176 0 None None None 2023-12-14 09:55:34 UTC
Red Hat Product Errata RHSA-2024:1383 0 None None None 2024-03-19 15:29:43 UTC

Description Karun Josy 2023-12-06 12:55:43 UTC
* Description of problem (please be detailed as possible and provide log
snippets):

ODF 4.14 installation does not complete because the 'rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod' stuck in  CrashLoopBackOff state


* Version of all relevant components (if applicable):
ODF 4.14

Comment 8 Santosh Pillai 2023-12-13 04:07:18 UTC
@karun A few questions:

- Does restarting the rook operator fixes the issue completely and cluster is usable after that? (Trying to understand if there a race condition like you mentioned above)
- In the logs that you have shared, the ceph status seems to be `Healh_ok` and no OSDs are down. Is must gather taken after the Operator was restarted and StorageCluster was successful?

Comment 10 Santosh Pillai 2023-12-13 04:31:32 UTC
(In reply to Santosh Pillai from comment #8)
> @karun A few questions:
> 
> - Does restarting the rook operator fixes the issue completely and cluster
> is usable after that? (Trying to understand if there a race condition like
> you mentioned above)
> - In the logs that you have shared, the ceph status seems to be `Healh_ok`
> and no OSDs are down. Is must gather taken after the Operator was restarted
> and StorageCluster was successful?

- is it possible to get the output of `radosgw-admin zonegroup list` on a cluster with this error vs on a cluster that is working fine?

Comment 13 Santosh Pillai 2023-12-14 03:48:43 UTC
Thanks Karun for the details. This should be fixed by Parth's PR https://github.com/rook/rook/pull/12817 which correctly handles the timeout errors and has increased the retry count.

Comment 39 errata-xmlrpc 2024-03-19 15:29:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383


Note You need to log in before you can comment on or make changes to this bug.