Bug 2247313 - rook-ceph-osd-prepare pods in CLBO during Installation
Summary: rook-ceph-osd-prepare pods in CLBO during Installation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.15.0
Assignee: Malay Kumar parida
QA Contact: Aviad Polak
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-10-31 20:05 UTC by Malay Kumar parida
Modified: 2024-03-19 15:28 UTC (History)
2 users (show)

Fixed In Version: 4.15.0-103
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-03-19 15:28:13 UTC
Embargoed:


Attachments (Terms of Use)
Logs of one of the failing osd prepare pods (29.65 KB, text/plain)
2023-10-31 20:05 UTC, Malay Kumar parida
no flags Details
ocs-mustgather (8.88 MB, application/gzip)
2023-10-31 20:07 UTC, Malay Kumar parida
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 2245 0 None Merged Use exact ceph image version v17.2.6 instead of v17 2023-11-02 10:29:54 UTC
Red Hat Product Errata RHSA-2024:1383 0 None None None 2024-03-19 15:28:15 UTC

Description Malay Kumar parida 2023-10-31 20:05:08 UTC
Created attachment 1996428 [details]
Logs of one of the failing osd prepare pods

Description of problem (please be detailed as possible and provide log
snippests):

Recently I noticed in upstream ocs-operator ci the e2e tests started failing. Upon investigation, I found during installation the osd-prepare pods go into CLBO state which is why the ci was failing. I then went on to recreate the same on my own cluster and I was able to reproduce it. I am attaching the logs of the failing osd-prepare pod & also the ocs must gather I collected. Considering nothing else changed, I suspect something has changed on ceph side as we use the ceph image quay.io/ceph/ceph:v17 in our upstream ocs-operator.

It's blocking all PRs in the ocs-operator repo as the e2e tests are not working.


Is there any workaround available to the best of your knowledge?
No


Steps to Reproduce:
1. Install upstream ocs operator
2. Create a storagecluster
3. You will see the osd-prepare pods will go to CLBO

Comment 2 Malay Kumar parida 2023-10-31 20:07:11 UTC
Created attachment 1996429 [details]
ocs-mustgather

Adding the must-gather

Comment 3 Travis Nielsen 2023-11-01 20:14:58 UTC
The tests are using the latest Ceph Quincy image quay.io/ceph/ceph:v17, which as of four days ago v17.2.7 came out, and introduced this issue.

From the Rook operator log:

2023-10-31T14:38:45.593344167Z 2023-10-31 14:38:45.593319 I | ceph-spec: detecting the ceph image version for image quay.io/ceph/ceph:v17...
2023-10-31T14:38:47.307103937Z 2023-10-31 14:38:47.307067 I | ceph-spec: detected ceph image version: "17.2.7-0 quincy"

This issue is affecting all OSDs on PVCs upstream as described here:
https://github.com/rook/rook/issues/13136

Until the fix is found, you can instead use the previous good version of the Ceph image (instead of using the v17 tag that will pick up the latest image that is currently broken):
quay.io/ceph/ceph:v17.2.6

Comment 8 errata-xmlrpc 2024-03-19 15:28:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383


Note You need to log in before you can comment on or make changes to this bug.