Created attachment 1871709 [details] list of pods, storageclass and storagecluster yaml Description of problem (please be detailed as possible and provide log snippests): We installed the Redhat ODF operator from the console and try to create the storagesystem for ibm flashsystem with no encryption option. The osd pods were not created and the rook-ceph-osd-prepare pods are stuck in CrashLoopBaskOff with the following error: 022-04-06 11:55:37.199155 E | op-osd: failed to provision OSD(s) on PVC ocs-deviceset-ibm-odf-test-0-data-0jmkhn. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to set kek as an environment variable: key encryption key is empty} Version of all relevant components (if applicable): RH ODF operator: Image quay.io/rhceph-dev/ocs-registry:4.10.0-211 OCP version: 4.10.3 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes, we can't install our new ibm odf operator Is there any workaround available to the best of your knowledge? no Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? yes Can this issue reproduce from the UI? yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Change redhat-operator catalog source to use quay.io/rhceph-dev/ocs-registry:4.10.0-211 image 2. Install odf operator for ocp 4.10 from the UI 3. Install StorageSystem ibm-flashsystem from the UI Actual results: rook-ceph-osd-prepare pods are stuck in CrashLoopBackOff with the kms error: 022-04-06 11:55:37.199155 E | op-osd: failed to provision OSD(s) on PVC ocs-deviceset-ibm-odf-test-0-data-0jmkhn. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to set kek as an environment variable: key encryption key is empty} Expected results: successfully create the osd pods with no kms encryption Additional info:
Thanks Tal, Looks like an issue on operator side, I am creating a setup to check. Please also provide the must gather logs: oc adm must-gather --image="quay.io/rhceph-dev/ocs-must-gather:latest-4.10"
must gather logs were sent directly to Afreen as requested.
Created attachment 1871768 [details] browser post request for storage cluster creation with ibm flashsystem and no kms selection The issue is not happening on non IBM-FS/ODF deployments when checked with latest-stable-4.10 images. To be sure, checked the UI flow which is also not setting this flag when created via IBM-FS option with no KMS (screenshot attached) And storage cluster yaml also dont have kms enabled but in error due to: Error while reconciling: some StorageClasses [ocs-storagecluster-ceph-rbd] were skipped while waiting for pre-requisites to be met It can be checked again with latest builds "stable-4.10" tag and see if its persistent as I cannot replicate it for normal deployments. Meanwhile moving this to rook for more inputs and debugging.
Created attachment 1871769 [details] storagesystem review page for creating ibm-fs with kms disabled
Some logs are empty but I think I understand what's going on. I think the prepare job gets injected with some env variables prefixed by IBM_ and thus triggers some code logic. I'm sending a fix for this. I would consider this a blocker since all the deployments on IBM will fail.
Please add doc text
Hi, When the fix is done, please let us know which ODF image to use. Thanks
Hi, Is there any ODF image we can use that contain this fix? Thanks
(In reply to Tal Yichye from comment #14) > Hi, > Is there any ODF image we can use that contain this fix? > Thanks No since this downstream PR hasn't merged yet (the target being 4.10.z we need to wait.
Hi RedHat team, Is there an ETA when 4.10.z will be available internally for testing?
Hi, We installed the ODF version 4.10.1 (which should contain the fix for the kms issue), again without any encryption, but encounter a new error with the rook-osd-prepare pods. This message appear in the prepare pod logs: 2022-05-02 10:24:58.759564 E | cephosd: failed to determine if the encrypted block "/mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc" is from our cluster. failed to dump LUKS header for disk "/mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc". Device /mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc is not a valid LUKS device. As I said, no encryption option was selected as you can see here (taken from the prepare pod): ROOK_PVC_BACKED_OSD: true ROOK_ENCRYPTED_DEVICE: false Thanks
(In reply to Tal Yichye from comment #17) > Hi, > We installed the ODF version 4.10.1 (which should contain the fix for the > kms issue), again without any encryption, but encounter a new error with the > rook-osd-prepare pods. > This message appear in the prepare pod logs: > > 2022-05-02 10:24:58.759564 E | cephosd: failed to determine if the encrypted > block "/mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc" is from our cluster. > failed to dump LUKS header for disk > "/mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc". Device > /mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc is not a valid LUKS device. > > As I said, no encryption option was selected as you can see here (taken from > the prepare pod): > ROOK_PVC_BACKED_OSD: true > ROOK_ENCRYPTED_DEVICE: false > > Thanks Hi, Can I get the full prepare job log? So no OSDs were deployed? Thanks!
Created attachment 1876518 [details] rook osd prepare pod logs
Yes, There are no OSDs, only osd-prepare pods.
The log is incomplete, what's after the line: 2022-05-02 10:24:59.803582 D | exec: Running command: stdbuf -oL ceph-volume --log-path /var/log/ceph/ocs-deviceset-ibm-odf-test-1-data-0xx8fc raw prepare --bluestore --data /dev/mapper/mpathau Thanks
That's what I get when I run 'oc logs <pod_name>', there are no additional lines.
(In reply to Tal Yichye from comment #22) > That's what I get when I run 'oc logs <pod_name>', there are no additional > lines. Is the process stuck? Is the prepare job still in running state and not in "Completed"? Thanks
The prepare jobs are in progress state. I am not sure if it stuck, looks like the rook-operator is waiting for the odf prepare jobs to finish: 2022-05-02 14:53:53.000488 I | clusterdisruption-controller: reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0 2022-05-02 14:54:00.489112 I | clusterdisruption-controller: reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0 2022-05-02 14:54:23.553004 I | clusterdisruption-controller: reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0 2022-05-02 14:54:31.012784 I | clusterdisruption-controller: reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0 2022-05-02 14:54:34.427057 I | op-osd: waiting... 0 of 3 OSD prepare jobs have finished processing and 0 of 0 OSDs have been updated
Can you: * log into the node the prepare job is running * verify if the ceph-volume is still running * if so, try to strace it and see where it is pending on Also, can you confirm "/mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc" is backend by an LV and tell us why? Thanks
Created attachment 1876691 [details] ceph-volume log
Created attachment 1876693 [details] ocs-devices output for lsblk and ls -al commands Also, that's the output for strace: strace: Process 2980552 attached wait4(-1,
As discussed offline, moving on with a different BZ since the original issue is fixed.
Hi, After installing with ODF 4.10.1 image, this issue does not appear any more. The OSDs prepare jobs are now in state 'in progress' and no OSDs pods were deployed - I opened another BZ ticket as Sébastien requested, to investigate the new issue https://bugzilla.redhat.com/show_bug.cgi?id=2081431 Thanks
Hi, As mentioned earlier - the fix was tested with 4.10.1 image. We have not yet tested it with 4.11.
Hi, We tested the fix with internal build of ODF 4.11 and it works. Thanks, Tal.
Moving to VERIFIED based on comment #37
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6156