2241268 – Enabling ceph mirroring enables bluestore-rdr and OSDs fail to recreate

Bug 2241268 - Enabling ceph mirroring enables bluestore-rdr and OSDs fail to recreate

Summary: Enabling ceph mirroring enables bluestore-rdr and OSDs fail to recreate

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	Santosh Pillai
QA Contact:	Pratik Surve
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-09-28 22:13 UTC by Annette Clewett
Modified:	2024-07-18 04:25 UTC (History)
CC List:	6 users (show)
Fixed In Version:	4.15.0-110
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-03-19 15:25:09 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2247	None	open	allow migration of osds to bluestore-rdr	2023-11-14 08:48:17 UTC
Github	red-hat-storage rook pull 552	None	open	bug 2241268: osd: create config before migration osd	2024-01-08 16:40:30 UTC
Github	rook rook pull 13524	None	open	osd: create ceph conf and keyring files before osd migration	2024-01-08 10:28:29 UTC
Red Hat Product Errata	RHSA-2024:1383	None	None	None	2024-03-19 15:25:12 UTC

Description Annette Clewett 2023-09-28 22:13:04 UTC

Created attachment 1990993 [details]
rook-ceph-prepare pod for missing OSD

Description of problem (please be detailed as possible and provide log
snippests):
Created RDR test env with OCP 4.14, ACM 2.9,and ODF 4.14. When ODF was initially installed bluestore-rdr was NOT enabled. 

After creating first DRPolicy for RDR, ceph mirroring was enabled which caused bluestore-rdr to be enabled for ODF storagecluster. First OSD attempted to recreate with bluestore-rdr. The prepare pod for first (deleted) OSD does not get to Completed state, stuck in Running.

Rook operator log Before first DRPolicy created and mirroring enabled - http://pastebin.test.redhat.com/1110031

Rook operator log AFTER first DRPolicy created and mirroring enabled - http://pastebin.test.redhat.com/1110030

Version of all relevant components (if applicable):
ODF - 4.14.0-139.stable
OCP - 4.14.0-rc.2
ACM - 2.9.0-165

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Yes, happened in 2 independent OCP clusters.

Steps to Reproduce:
1. Create RDR test env with ACM 2.9 (install MCO)
2. Create first DR Policy


Actual results:
One of three OSD pods is deleted and associated prepare pod is stuck in Running (log attached). Cephcluster has bluestore-rdr enabled.

Expected results:
All three OSD pods are Running and bluestore-rdr is not enabled in cephcluster.

Additional info:

Comment 2 Travis Nielsen 2023-09-28 22:22:02 UTC

Santosh, there is an OCS operator BZ for disabling brown field RDR in 4.14, right? Assuming that BZ, we won't need to fix this issue with replacing OSDs in brownfield yet and can move this BZ to 4.15.

Comment 3 Santosh Pillai 2023-09-29 04:15:16 UTC

(In reply to Travis Nielsen from comment #2)
> Santosh, there is an OCS operator BZ for disabling brown field RDR in 4.14,
> right? Assuming that BZ, we won't need to fix this issue with replacing OSDs
> in brownfield yet and can move this BZ to 4.15.


Yes, we have a BZ (https://bugzilla.redhat.com/show_bug.cgi?id=2234735)  to revert these changes. Since QE does not have the resources to test it and current BZ might blocking the testing, I would open a PR to revert the changes today. 

Annette, can you please share the logs of the osd prepare pod that is stuck.

Comment 4 Santosh Pillai 2023-09-29 09:32:20 UTC

proposing to move this BZ to 4.15 since we are not support bluestore-rdr in this release.

Comment 5 Annette Clewett 2023-09-29 15:23:34 UTC

@sapillai I attached the log for the osd prepare pod that is stuck (attached when I created BZ).

Comment 6 Santosh Pillai 2023-11-14 08:47:52 UTC

This should be retested with 4.15

Now we are using `ceph-volume lvm zap` to clean up the resources. I've tested with some data and I'm not seeing OSD prepare pod being stuck for long time clean up the data on the OSD. 

Also, there is a change in the flow now. In 4.14, we had a approach where OSD migration would start as soon as the user enabled mirroring on a cluster with OSDs on bluestore. 

In 4.15, we have changed this flow. Now migration will not happen in the while enabling mirroring. User will have to first migrate (by adding the annotation) and after migration they can enable mirroring. 

There is an OCS operator PR (https://github.com/red-hat-storage/ocs-operator/pull/2247) to support this new flow.

Comment 14 errata-xmlrpc 2024-03-19 15:25:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Comment 15 Red Hat Bugzilla 2024-07-18 04:25:08 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.