Bug 2253185
Summary: | [GSS] During installation of ODF 4.14 'rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod' stuck in CrashLoopBackOff state | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Karun Josy <kjosy> | |
Component: | rook | Assignee: | Parth Arora <paarora> | |
Status: | CLOSED ERRATA | QA Contact: | Uday kurundwade <ukurundw> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.14 | CC: | akestert, cgaynor, ebenahar, jthottan, kbg, kelwhite, kurathod, muagarwa, nigoyal, odf-bz-bot, paarora, sapillai, sheggodu, tdesala, tnielsen | |
Target Milestone: | --- | Flags: | kjosy:
needinfo-
|
|
Target Release: | ODF 4.15.0 | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | 4.15.0-123 | Doc Type: | Rebase: Bug Fixes and Enhancements | |
Doc Text: |
.'rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod' stuck in CrashLoopBackOff state
Previously, the `rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a` pod was stuck in `CrashLoopBackOff` state as the RADOS Gateway (RGW) multisite zonegroup was not getting created and fetched, and the error handling was reporting wrong text.
With this release, the error handling bug in multisite configuration is fixed and fetching the zonegroup is improved by fetching it for a particular rgw-realm that was created earlier. As a result, the multisite configuration and `rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a` pod gets created successfully.
|
Story Points: | --- | |
Clone Of: | ||||
: | 2254547 (view as bug list) | Environment: | ||
Last Closed: | 2024-03-19 15:29:24 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2246375, 2254547 |
Description
Karun Josy
2023-12-06 12:55:43 UTC
@karun A few questions: - Does restarting the rook operator fixes the issue completely and cluster is usable after that? (Trying to understand if there a race condition like you mentioned above) - In the logs that you have shared, the ceph status seems to be `Healh_ok` and no OSDs are down. Is must gather taken after the Operator was restarted and StorageCluster was successful? (In reply to Santosh Pillai from comment #8) > @karun A few questions: > > - Does restarting the rook operator fixes the issue completely and cluster > is usable after that? (Trying to understand if there a race condition like > you mentioned above) > - In the logs that you have shared, the ceph status seems to be `Healh_ok` > and no OSDs are down. Is must gather taken after the Operator was restarted > and StorageCluster was successful? - is it possible to get the output of `radosgw-admin zonegroup list` on a cluster with this error vs on a cluster that is working fine? Thanks Karun for the details. This should be fixed by Parth's PR https://github.com/rook/rook/pull/12817 which correctly handles the timeout errors and has increased the retry count. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383 |