Cause:
Deletion of OSD deployment in an encrypted cluster backed by CSI provisioned PVC causes the rook-ceph-osd-prepare job for that OSD to be stuck in CrashLoopBackOff (CLBO) state.
Consequence:
The rook-ceph-osd-prepare job will be stuck in CrashLoopBackOff (CLBO) state and that particular OSD pod will never come up.
Fix:
The rook-ceph-osd-prepare job now removes up stale encrypted device and opens it again avoiding CLBO state.
Result:
The rook-ceph-osd-prepare job will run as expected and OSD will come up.
Description of problem (please be detailed as possible and provide log
snippets):
On an ODF 4.12 cluster deployed with Ciphertrust KMS enabled for clusterwide encrytption, on deleting the OSD deployment, the rook-ceph-osd-prepare pods are stuck in CrashLoopBackOff (CLBO) state.
$ oc get pods | grep rook-ceph-osd-prepare
rook-ceph-osd-prepare-09be0bc7271ca33da51142ad5efd3efc-k86hm 0/1 CrashLoopBackOff 4 (31s ago) 2m8s
rook-ceph-osd-prepare-6b07717c2d455fdaf2da3ac9794c5b26-n89nw 0/1 Completed 0 98m
rook-ceph-osd-prepare-8dd7ce809cb962fc829a804f5f76a3ae-qhfff 0/1 CrashLoopBackOff 4 (19s ago) 2m4s
rook-ceph-osd-prepare-b860d3da40fceae884f9706def3ebe36-hqzpk 0/1 CrashLoopBackOff 4 (81s ago) 3m7s
rook-ceph-osd-prepare-fd4d91d26c907b2b4e7aca12b07791a2-kqr4v 0/1 Completed 0 98m
The following error message was seen in the rook-ceph-operator logs:
2022-11-15 09:06:08.244081 E | op-osd: failed to provision OSD(s) on PVC ocs-deviceset-gp3-csi-0-data-0mpv8c. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to configure devices: failed to get device already provisioned by ceph-volume raw: failed to find the encrypted block device for "/mnt/ocs-deviceset-gp3-csi-0-data-0mpv8c", not opened?}
Version of all relevant components (if applicable):
---------------------------------------------------
OCP: 4.12.0-0.nightly-2022-11-10-033725
ODF: odf-operator.v4.12.0-111.stable full_version=4.12.0-111
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, the OSDs are down
Is there any workaround available to the best of your knowledge?
Manually open the encrypted block. I don't know the exact steps however
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3
Can this issue reproducible?
1/1
Can this issue reproduce from the UI?
Yes
If this is a regression, please provide more details to justify this:
For Vault KMS provider, this test runs successfully.
Steps to Reproduce:
-------------------
1. Deploy an ODF 4.12 cluster, with Ciphertrust KMS enabled
2. Delete the rook-ceph-osd deployment
3. Check for the status of new pods
Actual results:
---------------
The rook-ceph-osd-prepare pods are stuck in CrashLoopBackOff (CLBO) state, with the following error:
2022-11-15 09:06:08.244081 E | op-osd: failed to provision OSD(s) on PVC ocs-deviceset-gp3-csi-0-data-0mpv8c. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to configure devices: failed to get device already provisioned by ceph-volume raw: failed to find the encrypted block device for "/mnt/ocs-deviceset-gp3-csi-0-data-0mpv8c", not opened?}
Expected results:
-----------------
New OSD pods should spin up when the rook-ceph-osd deployment is deleted.
Additional info:
----------------
See bug: https://bugzilla.redhat.com/show_bug.cgi?id=2032656
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2023:0551
Description of problem (please be detailed as possible and provide log snippets): On an ODF 4.12 cluster deployed with Ciphertrust KMS enabled for clusterwide encrytption, on deleting the OSD deployment, the rook-ceph-osd-prepare pods are stuck in CrashLoopBackOff (CLBO) state. $ oc get pods | grep rook-ceph-osd-prepare rook-ceph-osd-prepare-09be0bc7271ca33da51142ad5efd3efc-k86hm 0/1 CrashLoopBackOff 4 (31s ago) 2m8s rook-ceph-osd-prepare-6b07717c2d455fdaf2da3ac9794c5b26-n89nw 0/1 Completed 0 98m rook-ceph-osd-prepare-8dd7ce809cb962fc829a804f5f76a3ae-qhfff 0/1 CrashLoopBackOff 4 (19s ago) 2m4s rook-ceph-osd-prepare-b860d3da40fceae884f9706def3ebe36-hqzpk 0/1 CrashLoopBackOff 4 (81s ago) 3m7s rook-ceph-osd-prepare-fd4d91d26c907b2b4e7aca12b07791a2-kqr4v 0/1 Completed 0 98m The following error message was seen in the rook-ceph-operator logs: 2022-11-15 09:06:08.244081 E | op-osd: failed to provision OSD(s) on PVC ocs-deviceset-gp3-csi-0-data-0mpv8c. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to configure devices: failed to get device already provisioned by ceph-volume raw: failed to find the encrypted block device for "/mnt/ocs-deviceset-gp3-csi-0-data-0mpv8c", not opened?} Version of all relevant components (if applicable): --------------------------------------------------- OCP: 4.12.0-0.nightly-2022-11-10-033725 ODF: odf-operator.v4.12.0-111.stable full_version=4.12.0-111 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, the OSDs are down Is there any workaround available to the best of your knowledge? Manually open the encrypted block. I don't know the exact steps however Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? 1/1 Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: For Vault KMS provider, this test runs successfully. Steps to Reproduce: ------------------- 1. Deploy an ODF 4.12 cluster, with Ciphertrust KMS enabled 2. Delete the rook-ceph-osd deployment 3. Check for the status of new pods Actual results: --------------- The rook-ceph-osd-prepare pods are stuck in CrashLoopBackOff (CLBO) state, with the following error: 2022-11-15 09:06:08.244081 E | op-osd: failed to provision OSD(s) on PVC ocs-deviceset-gp3-csi-0-data-0mpv8c. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to configure devices: failed to get device already provisioned by ceph-volume raw: failed to find the encrypted block device for "/mnt/ocs-deviceset-gp3-csi-0-data-0mpv8c", not opened?} Expected results: ----------------- New OSD pods should spin up when the rook-ceph-osd deployment is deleted. Additional info: ---------------- See bug: https://bugzilla.redhat.com/show_bug.cgi?id=2032656