Bug 2247313

Summary:

rook-ceph-osd-prepare pods in CLBO during Installation

Product:

[Red Hat Storage] Red Hat OpenShift Data Foundation

Reporter:

Malay Kumar parida <mparida>

Component:

ocs-operator

Assignee:

Malay Kumar parida <mparida>

Status:

CLOSED ERRATA

QA Contact:

Aviad Polak <apolak>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.14

CC:

odf-bz-bot, tnielsen

Target Milestone:

---

Target Release:

ODF 4.15.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

4.15.0-103

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2024-03-19 15:28:13 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Logs of one of the failing osd prepare pods	none
ocs-mustgather	none

Description Malay Kumar parida 2023-10-31 20:05:08 UTC

Created attachment 1996428 [details]
Logs of one of the failing osd prepare pods

Description of problem (please be detailed as possible and provide log
snippests):

Recently I noticed in upstream ocs-operator ci the e2e tests started failing. Upon investigation, I found during installation the osd-prepare pods go into CLBO state which is why the ci was failing. I then went on to recreate the same on my own cluster and I was able to reproduce it. I am attaching the logs of the failing osd-prepare pod & also the ocs must gather I collected. Considering nothing else changed, I suspect something has changed on ceph side as we use the ceph image quay.io/ceph/ceph:v17 in our upstream ocs-operator.

It's blocking all PRs in the ocs-operator repo as the e2e tests are not working.


Is there any workaround available to the best of your knowledge?
No


Steps to Reproduce:
1. Install upstream ocs operator
2. Create a storagecluster
3. You will see the osd-prepare pods will go to CLBO

Comment 2 Malay Kumar parida 2023-10-31 20:07:11 UTC

Created attachment 1996429 [details]
ocs-mustgather

Adding the must-gather

Comment 3 Travis Nielsen 2023-11-01 20:14:58 UTC

The tests are using the latest Ceph Quincy image quay.io/ceph/ceph:v17, which as of four days ago v17.2.7 came out, and introduced this issue.

From the Rook operator log:

2023-10-31T14:38:45.593344167Z 2023-10-31 14:38:45.593319 I | ceph-spec: detecting the ceph image version for image quay.io/ceph/ceph:v17...
2023-10-31T14:38:47.307103937Z 2023-10-31 14:38:47.307067 I | ceph-spec: detected ceph image version: "17.2.7-0 quincy"

This issue is affecting all OSDs on PVCs upstream as described here:
https://github.com/rook/rook/issues/13136

Until the fix is found, you can instead use the previous good version of the Ceph image (instead of using the v17 tag that will pick up the latest image that is currently broken):
quay.io/ceph/ceph:v17.2.6

Comment 8 errata-xmlrpc 2024-03-19 15:28:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383