Description of problem (please be detailed as possible and provide log snippets): If the prepare job runs again, the PVC will be skipped with: skipping device "/mnt/set1-data-0mqnt6" because it contains a filesystem "crypto_LUKS" In 4.8 we added tags on the LUKS header to recognize whether a disk is an OSD and if this OSD belongs to our cluster. This could an issue if an OSD is removed and attempts to be re-added to the cluster. For instance, if the rook-ceph-osd deployment is accidentally removed. OSD details must be refreshed and we need the prepare job for this. This is essentially a backport request. Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Remove an OSD deployment 2. The Operator will kick off a new prepare job (or the operator might need to be restarted in 4.7) 3. Look at the prepare logs, the disk is skipped Actual results: The OSD does not come up again Expected results: OSD comes online.
Neha, who can give the QA ack? Thanks
Mudit/Bipin - can you please provide the justification for having this fix in 4.7? @Sebastian - what are the steps to verify this bug and what additional tests would be needed to make sure the fix is tested completely?
(In reply to krishnaram Karthick from comment #7) > Mudit/Bipin - can you please provide the justification for having this fix > in 4.7? > > @Sebastian - what are the steps to verify this bug and what additional tests > would be needed to make sure the fix is tested completely? To verify this bug we need to: * deploy encrypted cluster before the fix is present, so if the fix is in 4.7.8 deploy a 4.7.7 cluster * check that no label/subsystems are set on the encrypted disk, you can use "cryptsetup luksDump <dev>" and see the "Label" and "Subsystem" are empty * upgrade to the version with the fix * check again on the OSD disk the label/subsystem and it should look like this: Label: pvc_name=set1-data-0lmdjp Subsystem: ceph_fsid=811e7dc0-ea13-4951-b000-24a8565d0735
Message from the customer. We are planning an OCP 4.8.x upgrade we will not need a 4.7 backport fix.
Bipin, Based on the above comments, should just close this BZ? The fix is in 4.8 and above. Thanks
Thanks Rachael, the testing looks good and the results are correct. Thanks
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.7.8 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0528