Description of problem (please be detailed as possible and provide log snippets): When trying to deploy an ODF 4.11 cluster with cluster wide encryption enabled using kubernetes authentication method for KMS with vault namespaces, the deployment fails with the following error observed in the rook operator logs: 2022-05-24 08:16:33.051746 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: failed to validate kms connection details: failed to validate vault connection details: failed to find TLS connection details k8s secret "ocs-kms-ca-secret-0rky0c" The secret mentioned in the error message exists in the openshift-storage namespace: $ oc get secret ocs-kms-ca-secret-0rky0c -n openshift-storage -o yaml apiVersion: v1 data: cert: LS0tLS1CRUdJTiBDRVJUS.... kind: Secret metadata: creationTimestamp: "2022-05-24T07:53:46Z" name: ocs-kms-ca-secret-0rky0c namespace: openshift-storage resourceVersion: "83446" uid: 11e6915a-6555-4fd3-87ac-bed5b6c39c7b type: Opaque $ oc get cm ocs-kms-connection-details -o yaml apiVersion: v1 data: KMS_PROVIDER: vault KMS_SERVICE_NAME: vault VAULT_ADDR: https://vault-cluster.vault.2467e33a-73f9-408b-b9ff-b0476a654d30.aws.hashicorp.cloud:8200 VAULT_AUTH_KUBERNETES_ROLE: odf-rook-ceph-op VAULT_AUTH_METHOD: kubernetes VAULT_AUTH_MOUNT_PATH: "" VAULT_BACKEND_PATH: rook VAULT_CACERT: ocs-kms-ca-secret-0rky0c VAULT_NAMESPACE: admin VAULT_TLS_SERVER_NAME: "" kind: ConfigMap metadata: creationTimestamp: "2022-05-24T07:53:46Z" name: ocs-kms-connection-details namespace: openshift-storage resourceVersion: "51613" uid: 2ba1ab36-3516-47be-a2ea-c9d4d4f57c56 Version of all relevant components (if applicable): --------------------------------------------------- OCP: 4.11.0-0.nightly-2022-05-20-213928 ODF: odf-operator.v4.11.0 full_version=4.11.0-75 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, deployment fails Is there any workaround available to the best of your knowledge? Not that I am aware of Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Steps to Reproduce: ------------------- 1. Deploy an ODF cluster using kubernetes authentication method for Vault, where the auth method is enabled inside a Vault namespace. Follow the steps mentioned here: https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html-single/deploying_openshift_data_foundation_using_amazon_web_services/index#enabling-cluster-wide-encryprtion-with-the-token-authentication-using-kms_cloud-storage Actual results: --------------- The deployment fails with the following error, even though the secret is present in the cluster: 2022-05-24 08:16:33.051746 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: failed to validate kms connection details: failed to validate vault connection details: failed to find TLS connection details k8s secret "ocs-kms-ca-secret-0rky0c" Expected results: ----------------- The deployment should be successful Additional info: ---------------- - This issue was not seen when the same vault instance was used with the token authentication method using vault namespaces. The deployment was successful. - Without using vault namespaces, the kube auth method was successful for clusterwide encryption
Although the error is hidden, I suspect the context is still canceled, that's why we fail to execute the command. We need to investigate that.
Sebastien/Rachel, is this a blocker for 4.11 else lets move it out of 4.11 as we are in a blocker only phase now.
I just pushed a new patch, it's still under review. We have a workaround which is to restart the op, but I'd prefer keeping this as a blocker. Rachel? Toughts?
https://github.com/rook/rook/pull/10523 has been merged upstream. I don't see the relationship of this to anything in the UI or doc changes, I believe it's just a reconcile fix, right Seb? If there is no UI or doc change, I agree with Seb we should merge it to 4.11. Rachael and Seb could you clarify? Thanks!
Yeah, I agree with Travis and I am confused because it was acked by QE earlier. So, what's changed since then? My question was just because we entered the blocker only phase otherwise it was already approved (and fixed)
Thanks Rachael.
Moving this BZ to 4.12. Rachel, can you please file a doc BZ for the same because the first build of 4.12 will contain the fix.
Moving to modified since the Rook reconcile fix is in. Rachel, you'll open the UI and doc BZs? Thanks!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:0551