Bug 2247313

Summary: rook-ceph-osd-prepare pods in CLBO during Installation
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Malay Kumar parida <mparida>
Component: ocs-operatorAssignee: Malay Kumar parida <mparida>
Status: CLOSED ERRATA QA Contact: Aviad Polak <apolak>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.14CC: odf-bz-bot, tnielsen
Target Milestone: ---   
Target Release: ODF 4.15.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.15.0-103 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-03-19 15:28:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs of one of the failing osd prepare pods
none
ocs-mustgather none

Description Malay Kumar parida 2023-10-31 20:05:08 UTC
Created attachment 1996428 [details]
Logs of one of the failing osd prepare pods

Description of problem (please be detailed as possible and provide log
snippests):

Recently I noticed in upstream ocs-operator ci the e2e tests started failing. Upon investigation, I found during installation the osd-prepare pods go into CLBO state which is why the ci was failing. I then went on to recreate the same on my own cluster and I was able to reproduce it. I am attaching the logs of the failing osd-prepare pod & also the ocs must gather I collected. Considering nothing else changed, I suspect something has changed on ceph side as we use the ceph image quay.io/ceph/ceph:v17 in our upstream ocs-operator.

It's blocking all PRs in the ocs-operator repo as the e2e tests are not working.


Is there any workaround available to the best of your knowledge?
No


Steps to Reproduce:
1. Install upstream ocs operator
2. Create a storagecluster
3. You will see the osd-prepare pods will go to CLBO

Comment 2 Malay Kumar parida 2023-10-31 20:07:11 UTC
Created attachment 1996429 [details]
ocs-mustgather

Adding the must-gather

Comment 3 Travis Nielsen 2023-11-01 20:14:58 UTC
The tests are using the latest Ceph Quincy image quay.io/ceph/ceph:v17, which as of four days ago v17.2.7 came out, and introduced this issue.

From the Rook operator log:

2023-10-31T14:38:45.593344167Z 2023-10-31 14:38:45.593319 I | ceph-spec: detecting the ceph image version for image quay.io/ceph/ceph:v17...
2023-10-31T14:38:47.307103937Z 2023-10-31 14:38:47.307067 I | ceph-spec: detected ceph image version: "17.2.7-0 quincy"

This issue is affecting all OSDs on PVCs upstream as described here:
https://github.com/rook/rook/issues/13136

Until the fix is found, you can instead use the previous good version of the Ceph image (instead of using the v17 tag that will pick up the latest image that is currently broken):
quay.io/ceph/ceph:v17.2.6

Comment 8 errata-xmlrpc 2024-03-19 15:28:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383