2053490 – Red Hat OpenShift Data Foundation deployment issue

Bug 2053490 - Red Hat OpenShift Data Foundation deployment issue

Summary: Red Hat OpenShift Data Foundation deployment issue

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Blaine Gardner
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-11 12:13 UTC by adrian.podlawski
Modified:	2023-12-08 04:27 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-21 16:43:59 UTC
Embargoed:

Attachments	(Terms of Use)
ocs-logs (269.59 KB, text/plain) 2022-02-11 12:13 UTC, adrian.podlawski	no flags	Details
odf-logs (123.70 KB, text/plain) 2022-02-11 12:14 UTC, adrian.podlawski	no flags	Details
ocs-error (3.49 KB, text/plain) 2022-02-11 12:15 UTC, adrian.podlawski	no flags	Details
cluster setup (1.16 KB, text/plain) 2022-02-11 12:17 UTC, adrian.podlawski	no flags	Details
osd-logs (1.95 KB, text/plain) 2022-02-11 12:25 UTC, adrian.podlawski	no flags	Details
View All

Description adrian.podlawski 2022-02-11 12:13:50 UTC

Created attachment 1860594 [details]
ocs-logs

Description of problem:
I created the StorageCluster with an updated YAML file for 4.9 ODF version. (odf-cluster.yaml)
At first attempt 3 prepare-ocs pods were marked as completed, but I was able to see only 2 OSD pods running. I found in OCS-prepare pods information that the device was already provisioned. The same situation had a place a few times for the 4.8 setup.
I made a cleanup and created StorageCluster again, with the same YAML file. In the second attempt all OSD-s were provided, but the StorageCluster was stuck in the “Progressing” state. I found some issues in the ODF operator and also in the OCS operator (logs in attachments).  Could you help us to find the root cause?


Version-Release number of selected component (if applicable): 4.9

How reproducible: 100%

Steps to Reproduce:
1. Create a StorageCluster with yaml file.

Actual results:
StorageCluster is in Progressing state/missing OSD

Comment 1 adrian.podlawski 2022-02-11 12:14:28 UTC

Created attachment 1860595 [details]
odf-logs

Comment 2 adrian.podlawski 2022-02-11 12:15:52 UTC

Created attachment 1860596 [details]
ocs-error

Comment 3 adrian.podlawski 2022-02-11 12:17:20 UTC

Created attachment 1860597 [details]
cluster setup

Comment 4 adrian.podlawski 2022-02-11 12:25:32 UTC

Created attachment 1860600 [details]
osd-logs

Comment 7 Blaine Gardner 2022-02-14 16:32:01 UTC

I believe this is a case of Rook operating as intended. From the OSD prepare pod logs shared (relevant line copied below), Rook is reporting that it found an OSD belonging to a different Ceph cluster. Rook will not clobber existing data on a disk in order to deploy an OSD to preserve user data on the disk. If you wish Rook to deploy successfully on that disk, you must wipe it.

2022-02-08 19:41:02.828518 I | cephosd: skipping device "/wal/ocs-deviceset-2-wal-0rswt2": failed to detect if there is already an osd. osd.7: "17498860-4536-42fc-981e-c6e8df6d7d89" belonging to a different ceph cluster "77dca5b1-3d9b-436a-94a8-6c35fef679a8".

Generally `sgdisk --zap` is sufficient to wipe the disk. I have also recommended using `dd` to zero out the first 2MB of disk to ensure LVM metadata is removed.

Comment 8 Mudit Agarwal 2022-02-15 13:42:20 UTC

Lowering the severity, please justify the urgent severity if it is required.

Comment 9 Travis Nielsen 2022-02-28 16:23:49 UTC

Moving to 4.11 while waiting for confirmation if this is an issue

Comment 10 Travis Nielsen 2022-03-07 16:24:10 UTC

Did the previous comment help resolve the issue? If we don't hear back in the next week we will close the issue, thanks.

Comment 11 Sébastien Han 2022-03-21 16:43:59 UTC

Closing due to lack of information. Please open again if you encounter this issue.
Thanks!

Comment 12 Red Hat Bugzilla 2023-12-08 04:27:47 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.