2073920 – rook osd prepare failed with this error - failed to set kek as an environment variable: key encryption key is empty

Bug 2073920 - rook osd prepare failed with this error - failed to set kek as an environment variable: key encryption key is empty

Summary: rook osd prepare failed with this error - failed to set kek as an environment...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	ODF 4.11.0
Assignee:	Sébastien Han
QA Contact:	Tal Yichye
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2056571
TreeView+	depends on / blocked

Reported:	2022-04-11 06:40 UTC by Tal Yichye
Modified:	2023-08-09 17:03 UTC (History)
CC List:	13 users (show)
Fixed In Version:	4.11.0-66
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-24 13:50:40 UTC
Embargoed:

Attachments	(Terms of Use)
list of pods, storageclass and storagecluster yaml (10.77 KB, text/plain) 2022-04-11 06:40 UTC, Tal Yichye	no flags	Details
browser post request for storage cluster creation with ibm flashsystem and no kms selection (63.98 KB, image/png) 2022-04-11 11:15 UTC, Afreen	no flags	Details
storagesystem review page for creating ibm-fs with kms disabled (72.20 KB, image/png) 2022-04-11 11:16 UTC, Afreen	no flags	Details
rook osd prepare pod logs (7.72 KB, text/plain) 2022-05-02 14:14 UTC, Tal Yichye	no flags	Details
ceph-volume log (3.38 KB, text/plain) 2022-05-03 07:39 UTC, Tal Yichye	no flags	Details
ocs-devices output for lsblk and ls -al commands (1.97 KB, text/plain) 2022-05-03 07:40 UTC, Tal Yichye	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 365	None	open	Bug 2073920: osd: only set kek to env var on encryption scenario	2022-04-12 07:19:56 UTC
Github	rook rook pull 10035	None	Merged	osd: only set kek to env var on encryption scenario	2022-04-12 12:04:22 UTC
Red Hat Product Errata	RHSA-2022:6156	None	None	None	2022-08-24 13:52:10 UTC

Description Tal Yichye 2022-04-11 06:40:03 UTC

Created attachment 1871709 [details]
list of pods, storageclass and storagecluster yaml

Description of problem (please be detailed as possible and provide log
snippests):

We installed the Redhat ODF operator from the console and try to create the storagesystem for ibm flashsystem with no encryption option.
The osd pods were not created and the rook-ceph-osd-prepare pods are stuck in CrashLoopBaskOff with the following error:

022-04-06 11:55:37.199155 E | op-osd: failed to provision OSD(s) on PVC ocs-deviceset-ibm-odf-test-0-data-0jmkhn. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to set kek as an environment variable: key encryption key is empty}


Version of all relevant components (if applicable):

RH ODF operator: Image quay.io/rhceph-dev/ocs-registry:4.10.0-211
OCP version: 4.10.3

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

yes, we can't install our new ibm odf operator

Is there any workaround available to the best of your knowledge?

no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1

Can this issue reproducible?
 yes

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Change redhat-operator catalog source to use quay.io/rhceph-dev/ocs-registry:4.10.0-211 image
2. Install odf operator for ocp 4.10 from the UI
3. Install StorageSystem ibm-flashsystem from the UI

Actual results:
rook-ceph-osd-prepare pods are stuck in CrashLoopBackOff with the kms error:

022-04-06 11:55:37.199155 E | op-osd: failed to provision OSD(s) on PVC ocs-deviceset-ibm-odf-test-0-data-0jmkhn. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to set kek as an environment variable: key encryption key is empty}

Expected results:

successfully create the osd pods with no kms encryption

Additional info:

Comment 2 Afreen 2022-04-11 07:40:55 UTC

Thanks Tal,
Looks like an issue on operator side, I am creating a setup to check.
Please also provide the must gather logs: oc adm must-gather --image="quay.io/rhceph-dev/ocs-must-gather:latest-4.10"

Comment 3 Tal Yichye 2022-04-11 09:36:08 UTC

must gather logs were sent directly to Afreen as requested.

Comment 4 Afreen 2022-04-11 11:15:29 UTC

Created attachment 1871768 [details]
browser post request for storage cluster creation with ibm flashsystem and no kms selection

The issue is not happening on non IBM-FS/ODF deployments when checked with latest-stable-4.10 images. 

To be sure, checked the UI flow which is also not setting this flag when created via IBM-FS option with no KMS (screenshot attached)
And storage cluster yaml also dont have kms enabled but in error due to:

Error while reconciling: some StorageClasses [ocs-storagecluster-ceph-rbd]
        were skipped while waiting for pre-requisites to be met

It can be checked again with latest builds "stable-4.10" tag and see if its persistent as I cannot replicate it for normal deployments.
Meanwhile moving this to rook for more inputs and debugging.

Comment 5 Afreen 2022-04-11 11:16:26 UTC

Created attachment 1871769 [details]
storagesystem review page for creating ibm-fs with kms disabled

Comment 7 Sébastien Han 2022-04-11 14:53:18 UTC

Some logs are empty but I think I understand what's going on. I think the prepare job gets injected with some env variables prefixed by IBM_ and thus triggers some code logic.
I'm sending a fix for this.

I would consider this a blocker since all the deployments on IBM will fail.

Comment 8 Mudit Agarwal 2022-04-11 16:13:11 UTC

Please add doc text

Comment 9 Tal Yichye 2022-04-12 06:55:54 UTC

Hi,
When the fix is done, please let us know which ODF image to use.
Thanks

Comment 14 Tal Yichye 2022-04-14 06:50:27 UTC

Hi,
Is there any ODF image we can use that contain this fix?
Thanks

Comment 15 Sébastien Han 2022-04-14 07:33:57 UTC

(In reply to Tal Yichye from comment #14)
> Hi,
> Is there any ODF image we can use that contain this fix?
> Thanks

No since this downstream PR hasn't merged yet (the target being 4.10.z we need to wait.

Comment 16 Vered Berenstein Paz 2022-04-26 08:03:46 UTC

Hi RedHat team, 
Is there an ETA when 4.10.z will be available internally for testing?

Comment 17 Tal Yichye 2022-05-02 12:24:39 UTC

Hi,
We installed the ODF version 4.10.1 (which should contain the fix for the kms issue), again without any encryption, but encounter a new error with the rook-osd-prepare pods.
This message appear in the prepare pod logs:

2022-05-02 10:24:58.759564 E | cephosd: failed to determine if the encrypted block "/mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc" is from our cluster. failed to dump LUKS header for disk "/mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc". Device /mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc is not a valid LUKS device.

As I said, no encryption option was selected as you can see here (taken from the prepare pod):
ROOK_PVC_BACKED_OSD:            true
ROOK_ENCRYPTED_DEVICE:          false

Thanks

Comment 18 Sébastien Han 2022-05-02 13:47:44 UTC

(In reply to Tal Yichye from comment #17)
> Hi,
> We installed the ODF version 4.10.1 (which should contain the fix for the
> kms issue), again without any encryption, but encounter a new error with the
> rook-osd-prepare pods.
> This message appear in the prepare pod logs:
> 
> 2022-05-02 10:24:58.759564 E | cephosd: failed to determine if the encrypted
> block "/mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc" is from our cluster.
> failed to dump LUKS header for disk
> "/mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc". Device
> /mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc is not a valid LUKS device.
> 
> As I said, no encryption option was selected as you can see here (taken from
> the prepare pod):
> ROOK_PVC_BACKED_OSD:            true
> ROOK_ENCRYPTED_DEVICE:          false
> 
> Thanks

Hi,

Can I get the full prepare job log? So no OSDs were deployed?
Thanks!

Comment 19 Tal Yichye 2022-05-02 14:14:06 UTC

Created attachment 1876518 [details]
rook osd prepare pod logs

Comment 20 Tal Yichye 2022-05-02 14:16:19 UTC

Yes, There are no OSDs, only osd-prepare pods.

Comment 21 Sébastien Han 2022-05-02 14:35:51 UTC

The log is incomplete, what's after the line:

2022-05-02 10:24:59.803582 D | exec: Running command: stdbuf -oL ceph-volume --log-path /var/log/ceph/ocs-deviceset-ibm-odf-test-1-data-0xx8fc raw prepare --bluestore --data /dev/mapper/mpathau

Thanks

Comment 22 Tal Yichye 2022-05-02 14:46:31 UTC

That's what I get when I run 'oc logs <pod_name>', there are no additional lines.

Comment 23 Sébastien Han 2022-05-02 14:51:14 UTC

(In reply to Tal Yichye from comment #22)
> That's what I get when I run 'oc logs <pod_name>', there are no additional
> lines.

Is the process stuck?
Is the prepare job still in running state and not in "Completed"?

Thanks

Comment 24 Tal Yichye 2022-05-02 15:00:58 UTC

The prepare jobs are in progress state.
I am not sure if it stuck, looks like the rook-operator is waiting for the odf prepare jobs to finish:

2022-05-02 14:53:53.000488 I | clusterdisruption-controller: reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0
2022-05-02 14:54:00.489112 I | clusterdisruption-controller: reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0
2022-05-02 14:54:23.553004 I | clusterdisruption-controller: reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0
2022-05-02 14:54:31.012784 I | clusterdisruption-controller: reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0
2022-05-02 14:54:34.427057 I | op-osd: waiting... 0 of 3 OSD prepare jobs have finished processing and 0 of 0 OSDs have been updated

Comment 25 Sébastien Han 2022-05-02 15:29:08 UTC

Can you:

* log into the node the prepare job is running
* verify if the ceph-volume is still running
* if so, try to strace it and see where it is pending on


Also, can you confirm  "/mnt/ocs-deviceset-ibm-odf-test-1-data-0xx8fc" is backend by an LV and tell us why?

Thanks

Comment 26 Tal Yichye 2022-05-03 07:39:23 UTC

Created attachment 1876691 [details]
ceph-volume log

Comment 27 Tal Yichye 2022-05-03 07:40:55 UTC

Created attachment 1876693 [details]
ocs-devices output for lsblk and ls -al commands

Also, that's the output for strace:
strace: Process 2980552 attached
wait4(-1,

Comment 28 Sébastien Han 2022-05-03 12:25:24 UTC

As discussed offline, moving on with a different BZ since the original issue is fixed.

Comment 29 Tal Yichye 2022-05-09 08:49:57 UTC

Hi,
After installing with ODF 4.10.1 image, this issue does not appear any more.
The OSDs prepare jobs are now in state 'in progress' and no OSDs pods were deployed - I opened another BZ ticket as Sébastien requested, to investigate the new issue https://bugzilla.redhat.com/show_bug.cgi?id=2081431

Thanks

Comment 34 Tal Yichye 2022-05-19 07:45:06 UTC

Hi,
As mentioned earlier - the fix was tested with 4.10.1 image. We have not yet tested it with 4.11.

Comment 37 Tal Yichye 2022-07-20 16:24:58 UTC

Hi,
We tested the fix with internal build of ODF 4.11 and it works.

Thanks, Tal.

Comment 38 Elad 2022-07-21 07:24:23 UTC

Moving to VERIFIED based on comment #37

Comment 40 errata-xmlrpc 2022-08-24 13:50:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6156

Note You need to log in before you can comment on or make changes to this bug.