Bug 2053490

Summary: Red Hat OpenShift Data Foundation deployment issue
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: adrian.podlawski
Component: rookAssignee: Blaine Gardner <brgardne>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.9CC: aos-bugs, brault, etamir, madam, mmuench, muagarwa, ocs-bugs, odf-bz-bot, sapillai, shan, srozen, tnielsen
Target Milestone: ---Flags: brgardne: needinfo? (adrian.podlawski)
tnielsen: needinfo? (adrian.podlawski)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-21 16:43:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ocs-logs
none
odf-logs
none
ocs-error
none
cluster setup
none
osd-logs none

Description adrian.podlawski 2022-02-11 12:13:50 UTC
Created attachment 1860594 [details]
ocs-logs

Description of problem:
I created the StorageCluster with an updated YAML file for 4.9 ODF version. (odf-cluster.yaml)
At first attempt 3 prepare-ocs pods were marked as completed, but I was able to see only 2 OSD pods running. I found in OCS-prepare pods information that the device was already provisioned. The same situation had a place a few times for the 4.8 setup.
I made a cleanup and created StorageCluster again, with the same YAML file. In the second attempt all OSD-s were provided, but the StorageCluster was stuck in the “Progressing” state. I found some issues in the ODF operator and also in the OCS operator (logs in attachments).  Could you help us to find the root cause?


Version-Release number of selected component (if applicable): 4.9

How reproducible: 100%

Steps to Reproduce:
1. Create a StorageCluster with yaml file.

Actual results:
StorageCluster is in Progressing state/missing OSD

Comment 1 adrian.podlawski 2022-02-11 12:14:28 UTC
Created attachment 1860595 [details]
odf-logs

Comment 2 adrian.podlawski 2022-02-11 12:15:52 UTC
Created attachment 1860596 [details]
ocs-error

Comment 3 adrian.podlawski 2022-02-11 12:17:20 UTC
Created attachment 1860597 [details]
cluster setup

Comment 4 adrian.podlawski 2022-02-11 12:25:32 UTC
Created attachment 1860600 [details]
osd-logs

Comment 7 Blaine Gardner 2022-02-14 16:32:01 UTC
I believe this is a case of Rook operating as intended. From the OSD prepare pod logs shared (relevant line copied below), Rook is reporting that it found an OSD belonging to a different Ceph cluster. Rook will not clobber existing data on a disk in order to deploy an OSD to preserve user data on the disk. If you wish Rook to deploy successfully on that disk, you must wipe it.

2022-02-08 19:41:02.828518 I | cephosd: skipping device "/wal/ocs-deviceset-2-wal-0rswt2": failed to detect if there is already an osd. osd.7: "17498860-4536-42fc-981e-c6e8df6d7d89" belonging to a different ceph cluster "77dca5b1-3d9b-436a-94a8-6c35fef679a8".

Generally `sgdisk --zap` is sufficient to wipe the disk. I have also recommended using `dd` to zero out the first 2MB of disk to ensure LVM metadata is removed.

Comment 8 Mudit Agarwal 2022-02-15 13:42:20 UTC
Lowering the severity, please justify the urgent severity if it is required.

Comment 9 Travis Nielsen 2022-02-28 16:23:49 UTC
Moving to 4.11 while waiting for confirmation if this is an issue

Comment 10 Travis Nielsen 2022-03-07 16:24:10 UTC
Did the previous comment help resolve the issue? If we don't hear back in the next week we will close the issue, thanks.

Comment 11 Sébastien Han 2022-03-21 16:43:59 UTC
Closing due to lack of information. Please open again if you encounter this issue.
Thanks!