Description of problem (please be detailed as possible and provide log snippests): =================================================================== Storage cluster deletion stuck for hours , waiting on noobaa resources in a partially installed KMS encryption enabled OCS 4.7 cluster. Due to issues, OSDs could not be created and hence the noobaa-db-pg-0 PVC was stuck in pending state. Could it be the resource noobaa is waiting for deletion ? Snip from ocs-operator logs ----------------------------- {"level":"info","ts":1610467637.3470006,"logger":"controllers.StorageCluster","msg":"Uninstall in progress","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Status":"Uninstall: Waiting on NooBaa system to be deleted"} ======= PVC ========== NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE db-noobaa-db-pg-0 Pending ocs-storagecluster-ceph-rbd 4h >> Details and background: We tried to install OCS using the KMS encryption from UI. In this attempt, we hit the issue of [1]. To WA this issue, we added [VAULT_SKIP_VERIFY: "true"] in the configmap and deployment progressed a little, but the OSDs failed to come up due to following error: [1] Bug 1915202 - Can not configure KMS with unknown CA certificate Snip from rook-op logs ----- 2021-01-12 12:14:20.978135 E | op-osd: failed to store secret. failed to init vault kms: failed to initialize vault secret store: Error making API request. ... 2021-01-12 12:14:21.375825 E | op-osd: failed to store secret. failed to init vault kms: failed to initialize vault secret store: Error making API request. URL: GET https://10.0.106.147:8200/v1/sys/mounts Code: 403. Errors: * permission denied Version of all relevant components (if applicable): =================================================================== OCP = 4.7.0-0.nightly-2021-01-07-034013 OCS = ocs-operator.v4.7.0-230.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? =================================================================== Yes. unable to uninstall and re-install Is there any workaround available to the best of your knowledge? =================================================================== No idea Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? =================================================================== 4 Can this issue reproducible? =================================================================== Not sure Can this issue reproduce from the UI? =================================================================== Not sure If this is a regression, please provide more details to justify this: =================================================================== Not sure Steps to Reproduce: =================================================================== 1. Install OCP 4.7 on vmware 2. Install OCS 4.7 operator and then click on Create STorage cluster 3. In the configure section - enable cluster-wide encryption and add the KMS details from external vault server. 4. Click Create in Review and Create Page 5. If you hit Bug 1915202, edit the configmap below to add [VAULT_SKIP_VERIFY: "true"] 6. See if install succeeds, but it is seen OSD creation still fails due to KMS related permission denied issues 7. The noobaa-db-pg-0 PVC stays in pending state 8. Try to uninstall OCS by deleting the Storagecluster from UI or CLI. Make sure no extra OBCs or PVCs apart from OSD/MON/Nooobaa db PVCs exist. The configmap "ocs-kms-connection-details" was edited data: KMS_PROVIDER: vault KMS_SERVICE_NAME: vault VAULT_ADDR: https://10.0.106.147:8200 VAULT_BACKEND_PATH: "" VAULT_NAMESPACE: "" VAULT_SKIP_VERIFY: "true" VAULT_TLS_SERVER_NAME: "" Actual results: =================================================================== Storagecluster deletion is stuck since hours $ oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 4h22m Deleting 2021-01-12T12:04:45Z 4.7.0 deletionTimestamp: "2021-01-12T15:07:07Z" Expected results: =================================================================== Uninstall should succeed, especially for clusters which failed proper deployment due to various issues and need to be cleaned up.
This is being looked by Noobaa team (https://chat.google.com/room/AAAAREGEba8/_IBytsgj4uo), moving it there.
Marking as a regression because uninstall used to work up until now
I've noticed that when KMS configured must-gather takes hours. Probably the same issue.
Yeah, sure. I guess this issue is already there for Noobaa uninstall. You can create one for cephobjectore.
Deleting storage cluster works as expected when kms is set correctly and also when the cluster installation failed due to misconfiguration of kms certificates. --> VERIFIED Tested with: ocs v4.7.0-250.ci
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041