2030291 – [GSS] idempotent prepare job for cluster-wide encryption

Bug 2030291 - [GSS] idempotent prepare job for cluster-wide encryption

Summary: [GSS] idempotent prepare job for cluster-wide encryption

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	OCS 4.7.8
Assignee:	Sébastien Han
QA Contact:	Rachael
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2004746
TreeView+	depends on / blocked

Reported:	2021-12-08 11:40 UTC by Sébastien Han
Modified:	2022-04-26 08:27 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, when you configured the cluster with cluster-wide encryption, it was impossible to identify whether or not an encrypted volume was part of the cluster as there was no label set and the encrypted nature of the volume did not allow you to open its content. Before you attempted to use it, you had to read the Linux Unified Key System (LUKS) header or label to determine whether or not the encrypted device is an Object Storage Device (OSD). With this update, the `fsid` label is set on the cluster. Now, it is possible to identify whether or not an encrypted volume is part of your cluster.
Clone Of:
Environment:
Last Closed:	2022-02-15 06:33:56 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-ci pull 5555	None	Merged	#bz2030291 checks label on luks header info for all the encrypted osds	2022-04-26 08:27:07 UTC
Github	red-hat-storage rook pull 317	None	open	Bug 2030291: ceph: add ceph cluster fsid to LUKS header	2021-12-14 09:28:06 UTC
Github	rook rook pull 8005	None	Merged	ceph: add ceph cluster fsid to LUKS header	2021-12-13 15:48:52 UTC
Red Hat Product Errata	RHBA-2022:0528	None	None	None	2022-02-15 06:34:02 UTC

Internal Links: 2069722

Description Sébastien Han 2021-12-08 11:40:50 UTC

Description of problem (please be detailed as possible and provide log
snippets):


If the prepare job runs again, the PVC will be skipped with:

skipping device "/mnt/set1-data-0mqnt6" because it contains a filesystem "crypto_LUKS"

In 4.8 we added tags on the LUKS header to recognize whether a disk is an OSD and if this OSD belongs to our cluster.

This could an issue if an OSD is removed and attempts to be re-added to the cluster. For instance, if the rook-ceph-osd deployment is accidentally removed.
OSD details must be refreshed and we need the prepare job for this.

This is essentially a backport request.

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Remove an OSD deployment
2. The Operator will kick off a new prepare job (or the operator might need to be restarted in 4.7)
3. Look at the prepare logs, the disk is skipped

Actual results:

The OSD does not come up again

Expected results:

OSD comes online.

Comment 1 Sébastien Han 2021-12-14 09:28:07 UTC

Neha, who can give the QA ack? Thanks

Comment 7 krishnaram Karthick 2021-12-23 16:21:00 UTC

Mudit/Bipin - can you please provide the justification for having this fix in 4.7?

@Sebastian - what are the steps to verify this bug and what additional tests would be needed to make sure the fix is tested completely?

Comment 10 Sébastien Han 2022-01-07 13:18:51 UTC

(In reply to krishnaram Karthick from comment #7)
> Mudit/Bipin - can you please provide the justification for having this fix
> in 4.7?
> 
> @Sebastian - what are the steps to verify this bug and what additional tests
> would be needed to make sure the fix is tested completely?

To verify this bug we need to:

* deploy encrypted cluster before the fix is present, so if the fix is in 4.7.8 deploy a 4.7.7 cluster
* check that no label/subsystems are set on the encrypted disk, you can use "cryptsetup luksDump <dev>" and see the "Label" and "Subsystem" are empty
* upgrade to the version with the fix
* check again on the OSD disk the label/subsystem and it should look like this:

Label:          pvc_name=set1-data-0lmdjp
Subsystem:      ceph_fsid=811e7dc0-ea13-4951-b000-24a8565d0735

Comment 11 khover 2022-01-10 15:24:31 UTC

Message from the customer.

We are planning an OCP 4.8.x upgrade we will not need a 4.7 backport fix.

Comment 14 Sébastien Han 2022-01-18 09:44:14 UTC

Bipin, 

Based on the above comments, should just close this BZ?
The fix is in 4.8 and above.
Thanks

Comment 26 Sébastien Han 2022-02-07 14:24:14 UTC

Thanks Rachael, the testing looks good and the results are correct. Thanks

Comment 33 errata-xmlrpc 2022-02-15 06:33:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.7.8 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0528

Note You need to log in before you can comment on or make changes to this bug.