2247313 – rook-ceph-osd-prepare pods in CLBO during Installation

Bug 2247313 - rook-ceph-osd-prepare pods in CLBO during Installation

Summary: rook-ceph-osd-prepare pods in CLBO during Installation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	Malay Kumar parida
QA Contact:	Aviad Polak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-10-31 20:05 UTC by Malay Kumar parida
Modified:	2024-03-19 15:28 UTC (History)
CC List:	2 users (show)
Fixed In Version:	4.15.0-103
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-03-19 15:28:13 UTC
Embargoed:

Attachments	(Terms of Use)
Logs of one of the failing osd prepare pods (29.65 KB, text/plain) 2023-10-31 20:05 UTC, Malay Kumar parida	no flags	Details
ocs-mustgather (8.88 MB, application/gzip) 2023-10-31 20:07 UTC, Malay Kumar parida	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2245	0	None	Merged	Use exact ceph image version v17.2.6 instead of v17	2023-11-02 10:29:54 UTC
Red Hat Product Errata	RHSA-2024:1383	0	None	None	None	2024-03-19 15:28:15 UTC

Description Malay Kumar parida 2023-10-31 20:05:08 UTC

Created attachment 1996428 [details]
Logs of one of the failing osd prepare pods

Description of problem (please be detailed as possible and provide log
snippests):

Recently I noticed in upstream ocs-operator ci the e2e tests started failing. Upon investigation, I found during installation the osd-prepare pods go into CLBO state which is why the ci was failing. I then went on to recreate the same on my own cluster and I was able to reproduce it. I am attaching the logs of the failing osd-prepare pod & also the ocs must gather I collected. Considering nothing else changed, I suspect something has changed on ceph side as we use the ceph image quay.io/ceph/ceph:v17 in our upstream ocs-operator.

It's blocking all PRs in the ocs-operator repo as the e2e tests are not working.


Is there any workaround available to the best of your knowledge?
No


Steps to Reproduce:
1. Install upstream ocs operator
2. Create a storagecluster
3. You will see the osd-prepare pods will go to CLBO

Comment 2 Malay Kumar parida 2023-10-31 20:07:11 UTC

Created attachment 1996429 [details]
ocs-mustgather

Adding the must-gather

Comment 3 Travis Nielsen 2023-11-01 20:14:58 UTC

The tests are using the latest Ceph Quincy image quay.io/ceph/ceph:v17, which as of four days ago v17.2.7 came out, and introduced this issue.

From the Rook operator log:

2023-10-31T14:38:45.593344167Z 2023-10-31 14:38:45.593319 I | ceph-spec: detecting the ceph image version for image quay.io/ceph/ceph:v17...
2023-10-31T14:38:47.307103937Z 2023-10-31 14:38:47.307067 I | ceph-spec: detected ceph image version: "17.2.7-0 quincy"

This issue is affecting all OSDs on PVCs upstream as described here:
https://github.com/rook/rook/issues/13136

Until the fix is found, you can instead use the previous good version of the Ceph image (instead of using the v17 tag that will pick up the latest image that is currently broken):
quay.io/ceph/ceph:v17.2.6

Comment 8 errata-xmlrpc 2024-03-19 15:28:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Note You need to log in before you can comment on or make changes to this bug.