Bug 2030291

Summary:	[GSS] idempotent prepare job for cluster-wide encryption
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Sébastien Han <shan>
Component:	rook	Assignee:	Sébastien Han <shan>
Status:	CLOSED ERRATA	QA Contact:	Rachael <rgeorge>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	agantony, assingh, bkunal, etamir, khover, kramdoss, madam, mmuench, muagarwa, nberry, ocs-bugs, sheggodu, tdesala
Target Milestone:	---	Keywords:	ZStream
Target Release:	OCS 4.7.8
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Previously, when you configured the cluster with cluster-wide encryption, it was impossible to identify whether or not an encrypted volume was part of the cluster as there was no label set and the encrypted nature of the volume did not allow you to open its content. Before you attempted to use it, you had to read the Linux Unified Key System (LUKS) header or label to determine whether or not the encrypted device is an Object Storage Device (OSD). With this update, the `fsid` label is set on the cluster. Now, it is possible to identify whether or not an encrypted volume is part of your cluster.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-02-15 06:33:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2004746

Description Sébastien Han 2021-12-08 11:40:50 UTC

Description of problem (please be detailed as possible and provide log
snippets):


If the prepare job runs again, the PVC will be skipped with:

skipping device "/mnt/set1-data-0mqnt6" because it contains a filesystem "crypto_LUKS"

In 4.8 we added tags on the LUKS header to recognize whether a disk is an OSD and if this OSD belongs to our cluster.

This could an issue if an OSD is removed and attempts to be re-added to the cluster. For instance, if the rook-ceph-osd deployment is accidentally removed.
OSD details must be refreshed and we need the prepare job for this.

This is essentially a backport request.

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Remove an OSD deployment
2. The Operator will kick off a new prepare job (or the operator might need to be restarted in 4.7)
3. Look at the prepare logs, the disk is skipped

Actual results:

The OSD does not come up again

Expected results:

OSD comes online.

Comment 1 Sébastien Han 2021-12-14 09:28:07 UTC

Neha, who can give the QA ack? Thanks

Comment 7 krishnaram Karthick 2021-12-23 16:21:00 UTC

Mudit/Bipin - can you please provide the justification for having this fix in 4.7?

@Sebastian - what are the steps to verify this bug and what additional tests would be needed to make sure the fix is tested completely?

Comment 10 Sébastien Han 2022-01-07 13:18:51 UTC

(In reply to krishnaram Karthick from comment #7)
> Mudit/Bipin - can you please provide the justification for having this fix
> in 4.7?
> 
> @Sebastian - what are the steps to verify this bug and what additional tests
> would be needed to make sure the fix is tested completely?

To verify this bug we need to:

* deploy encrypted cluster before the fix is present, so if the fix is in 4.7.8 deploy a 4.7.7 cluster
* check that no label/subsystems are set on the encrypted disk, you can use "cryptsetup luksDump <dev>" and see the "Label" and "Subsystem" are empty
* upgrade to the version with the fix
* check again on the OSD disk the label/subsystem and it should look like this:

Label:          pvc_name=set1-data-0lmdjp
Subsystem:      ceph_fsid=811e7dc0-ea13-4951-b000-24a8565d0735

Comment 11 khover 2022-01-10 15:24:31 UTC

Message from the customer.

We are planning an OCP 4.8.x upgrade we will not need a 4.7 backport fix.

Comment 14 Sébastien Han 2022-01-18 09:44:14 UTC

Bipin, 

Based on the above comments, should just close this BZ?
The fix is in 4.8 and above.
Thanks

Comment 26 Sébastien Han 2022-02-07 14:24:14 UTC

Thanks Rachael, the testing looks good and the results are correct. Thanks

Comment 33 errata-xmlrpc 2022-02-15 06:33:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.7.8 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0528