Bug 2073920 - rook osd prepare failed with this error - failed to set kek as an environment variable: key encryption key is empty
Summary: rook osd prepare failed with this error - failed to set kek as an environment...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: ODF 4.11.0
Assignee: Sébastien Han
QA Contact: Tal Yichye
URL:
Whiteboard:
Depends On:
Blocks: 2056571
TreeView+ depends on / blocked
 
Reported: 2022-04-11 06:40 UTC by Tal Yichye
Modified: 2023-08-09 17:03 UTC (History)
13 users (show)

Fixed In Version: 4.11.0-66
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-24 13:50:40 UTC
Embargoed:


Attachments (Terms of Use)
list of pods, storageclass and storagecluster yaml (10.77 KB, text/plain)
2022-04-11 06:40 UTC, Tal Yichye
no flags Details
browser post request for storage cluster creation with ibm flashsystem and no kms selection (63.98 KB, image/png)
2022-04-11 11:15 UTC, Afreen
no flags Details
storagesystem review page for creating ibm-fs with kms disabled (72.20 KB, image/png)
2022-04-11 11:16 UTC, Afreen
no flags Details
rook osd prepare pod logs (7.72 KB, text/plain)
2022-05-02 14:14 UTC, Tal Yichye
no flags Details
ceph-volume log (3.38 KB, text/plain)
2022-05-03 07:39 UTC, Tal Yichye
no flags Details
ocs-devices output for lsblk and ls -al commands (1.97 KB, text/plain)
2022-05-03 07:40 UTC, Tal Yichye
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 365 0 None open Bug 2073920: osd: only set kek to env var on encryption scenario 2022-04-12 07:19:56 UTC
Github rook rook pull 10035 0 None Merged osd: only set kek to env var on encryption scenario 2022-04-12 12:04:22 UTC
Red Hat Product Errata RHSA-2022:6156 0 None None None 2022-08-24 13:52:10 UTC

Description Tal Yichye 2022-04-11 06:40:03 UTC
Created attachment 1871709 [details]
list of pods, storageclass and storagecluster yaml

Description of problem (please be detailed as possible and provide log
snippests):

We installed the Redhat ODF operator from the console and try to create the storagesystem for ibm flashsystem with no encryption option.
The osd pods were not created and the rook-ceph-osd-prepare pods are stuck in CrashLoopBaskOff with the following error:

022-04-06 11:55:37.199155 E | op-osd: failed to provision OSD(s) on PVC ocs-deviceset-ibm-odf-test-0-data-0jmkhn. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to set kek as an environment variable: key encryption key is empty}


Version of all relevant components (if applicable):

RH ODF operator: Image quay.io/rhceph-dev/ocs-registry:4.10.0-211
OCP version: 4.10.3

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

yes, we can't install our new ibm odf operator

Is there any workaround available to the best of your knowledge?

no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1

Can this issue reproducible?
 yes

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Change redhat-operator catalog source to use quay.io/rhceph-dev/ocs-registry:4.10.0-211 image
2. Install odf operator for ocp 4.10 from the UI
3. Install StorageSystem ibm-flashsystem from the UI

Actual results:
rook-ceph-osd-prepare pods are stuck in CrashLoopBackOff with the kms error:

022-04-06 11:55:37.199155 E | op-osd: failed to provision OSD(s) on PVC ocs-deviceset-ibm-odf-test-0-data-0jmkhn. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to set kek as an environment variable: key encryption key is empty}

Expected results:

successfully create the osd pods with no kms encryption

Additional info:

Comment 2 Afreen 2022-04-11 07:40:55 UTC
Thanks Tal,
Looks like an issue on operator side, I am creating a setup to check.
Please also provide the must gather logs: oc adm must-gather --image="quay.io/rhceph-dev/ocs-must-gather:latest-4.10"

Comment 3 Tal Yichye 2022-04-11 09:36:08 UTC
must gather logs were sent directly to Afreen as requested.

Comment 4 Afreen 2022-04-11 11:15:29 UTC
Created attachment 1871768 [details]
browser post request for storage cluster creation with ibm flashsystem and no kms selection

The issue is not happening on non IBM-FS/ODF deployments when checked with latest-stable-4.10 images. 

To be sure, checked the UI flow which is also not setting this flag when created via IBM-FS option with no KMS (screenshot attached)
And storage cluster yaml also dont have kms enabled but in error due to:

Error while reconciling: some StorageClasses [ocs-storagecluster-ceph-rbd]
        were skipped while waiting for pre-requisites to be met

It can be checked again with latest builds "stable-4.10" tag and see if its persistent as I cannot replicate it for normal deployments.
Meanwhile moving this to rook for more inputs and debugging.

Comment 5 Afreen 2022-04-11 11:16:26 UTC
Created attachment 1871769 [details]
storagesystem review page for creating ibm-fs with kms disabled

Comment 7 Sébastien Han 2022-04-11 14:53:18 UTC
Some logs are empty but I think I understand what's going on. I think the prepare job gets injected with some env variables prefixed by IBM_ and thus triggers some code logic.
I'm sending a fix for this.

I would consider this a blocker since all the deployments on IBM will fail.

Comment 8 Mudit Agarwal 2022-04-11 16:13:11 UTC
Please add doc text

Comment 9 Tal Yichye 2022-04-12 06:55:54 UTC
Hi,
When the fix is done, please let us know which ODF image to use.
Thanks

Comment 14 Tal Yichye 2022-04-14 06:50:27 UTC
Hi,
Is there any ODF image we can use that contain this fix?
Thanks

Comment 15 Sébastien Han 2022-04-14 07:33:57 UTC
(In reply to Tal Yichye from comment #14)
> Hi,
> Is there any ODF image we can use that contain this fix?
> Thanks

No since this downstream PR hasn't merged yet (the target being 4.10.z we need to wait.

Comment 16 Vered Berenstein Paz 2022-04-26 08:03:46 UTC
Hi RedHat team, 
Is there an ETA when 4.10.z will be available internally for testing?

Comment 17 Tal Yichye 2022-05-02 12:24:39 UTC
Hi,
We installed the ODF version 4.10.1 (which should contain the fix for the kms issue), again without any encryption, but encounter a new error with the rook-osd-prepare pods.
This message appear in the prepare pod logs:

2022-05-02 10:24:58.759564 E | cephosd: failed to determine if the encrypted block "/mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc" is from our cluster. failed to dump LUKS header for disk "/mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc". Device /mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc is not a valid LUKS device.

As I said, no encryption option was selected as you can see here (taken from the prepare pod):
ROOK_PVC_BACKED_OSD:            true
ROOK_ENCRYPTED_DEVICE:          false

Thanks

Comment 18 Sébastien Han 2022-05-02 13:47:44 UTC
(In reply to Tal Yichye from comment #17)
> Hi,
> We installed the ODF version 4.10.1 (which should contain the fix for the
> kms issue), again without any encryption, but encounter a new error with the
> rook-osd-prepare pods.
> This message appear in the prepare pod logs:
> 
> 2022-05-02 10:24:58.759564 E | cephosd: failed to determine if the encrypted
> block "/mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc" is from our cluster.
> failed to dump LUKS header for disk
> "/mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc". Device
> /mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc is not a valid LUKS device.
> 
> As I said, no encryption option was selected as you can see here (taken from
> the prepare pod):
> ROOK_PVC_BACKED_OSD:            true
> ROOK_ENCRYPTED_DEVICE:          false
> 
> Thanks

Hi,

Can I get the full prepare job log? So no OSDs were deployed?
Thanks!

Comment 19 Tal Yichye 2022-05-02 14:14:06 UTC
Created attachment 1876518 [details]
rook osd prepare pod logs

Comment 20 Tal Yichye 2022-05-02 14:16:19 UTC
Yes, There are no OSDs, only osd-prepare pods.

Comment 21 Sébastien Han 2022-05-02 14:35:51 UTC
The log is incomplete, what's after the line:

2022-05-02 10:24:59.803582 D | exec: Running command: stdbuf -oL ceph-volume --log-path /var/log/ceph/ocs-deviceset-ibm-odf-test-1-data-0xx8fc raw prepare --bluestore --data /dev/mapper/mpathau

Thanks

Comment 22 Tal Yichye 2022-05-02 14:46:31 UTC
That's what I get when I run 'oc logs <pod_name>', there are no additional lines.

Comment 23 Sébastien Han 2022-05-02 14:51:14 UTC
(In reply to Tal Yichye from comment #22)
> That's what I get when I run 'oc logs <pod_name>', there are no additional
> lines.

Is the process stuck?
Is the prepare job still in running state and not in "Completed"?

Thanks

Comment 24 Tal Yichye 2022-05-02 15:00:58 UTC
The prepare jobs are in progress state.
I am not sure if it stuck, looks like the rook-operator is waiting for the odf prepare jobs to finish:

2022-05-02 14:53:53.000488 I | clusterdisruption-controller: reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0
2022-05-02 14:54:00.489112 I | clusterdisruption-controller: reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0
2022-05-02 14:54:23.553004 I | clusterdisruption-controller: reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0
2022-05-02 14:54:31.012784 I | clusterdisruption-controller: reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0
2022-05-02 14:54:34.427057 I | op-osd: waiting... 0 of 3 OSD prepare jobs have finished processing and 0 of 0 OSDs have been updated

Comment 25 Sébastien Han 2022-05-02 15:29:08 UTC
Can you:

* log into the node the prepare job is running
* verify if the ceph-volume is still running
* if so, try to strace it and see where it is pending on


Also, can you confirm  "/mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc" is backend by an LV and tell us why?

Thanks

Comment 26 Tal Yichye 2022-05-03 07:39:23 UTC
Created attachment 1876691 [details]
ceph-volume log

Comment 27 Tal Yichye 2022-05-03 07:40:55 UTC
Created attachment 1876693 [details]
ocs-devices output for lsblk and ls -al commands

Also, that's the output for strace:
strace: Process 2980552 attached
wait4(-1,

Comment 28 Sébastien Han 2022-05-03 12:25:24 UTC
As discussed offline, moving on with a different BZ since the original issue is fixed.

Comment 29 Tal Yichye 2022-05-09 08:49:57 UTC
Hi,
After installing with ODF 4.10.1 image, this issue does not appear any more.
The OSDs prepare jobs are now in state 'in progress' and no OSDs pods were deployed - I opened another BZ ticket as Sébastien requested, to investigate the new issue https://bugzilla.redhat.com/show_bug.cgi?id=2081431

Thanks

Comment 34 Tal Yichye 2022-05-19 07:45:06 UTC
Hi,
As mentioned earlier - the fix was tested with 4.10.1 image. We have not yet tested it with 4.11.

Comment 37 Tal Yichye 2022-07-20 16:24:58 UTC
Hi,
We tested the fix with internal build of ODF 4.11 and it works.

Thanks, Tal.

Comment 38 Elad 2022-07-21 07:24:23 UTC
Moving to VERIFIED based on comment #37

Comment 40 errata-xmlrpc 2022-08-24 13:50:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6156


Note You need to log in before you can comment on or make changes to this bug.