Bug 2089755

Summary: [KMS] Deployment of clusterwide encryption with kube auth using vault namespace fails due to TLS check failure
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Rachael <rgeorge>
Component: rookAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: Rachael <rgeorge>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.11CC: ebenahar, madam, mmuench, muagarwa, ocs-bugs, odf-bz-bot, shan, tnielsen
Target Milestone: ---   
Target Release: ODF 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.11.0-96 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-31 00:19:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2110868, 2131648, 2048902, 2110866, 2124827    

Description Rachael 2022-05-24 11:21:53 UTC
Description of problem (please be detailed as possible and provide log
snippets):

When trying to deploy an ODF 4.11 cluster with cluster wide encryption enabled using kubernetes authentication method for KMS with vault namespaces, the deployment fails with the following error observed in the rook operator logs: 

2022-05-24 08:16:33.051746 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: failed to validate kms connection details: failed to validate vault connection details: failed to find TLS connection details k8s secret "ocs-kms-ca-secret-0rky0c"


The secret mentioned in the error message exists in the openshift-storage namespace: 

$ oc get secret ocs-kms-ca-secret-0rky0c -n openshift-storage -o yaml
apiVersion: v1
data:
  cert: LS0tLS1CRUdJTiBDRVJUS....
kind: Secret
metadata:
  creationTimestamp: "2022-05-24T07:53:46Z"
  name: ocs-kms-ca-secret-0rky0c
  namespace: openshift-storage
  resourceVersion: "83446"
  uid: 11e6915a-6555-4fd3-87ac-bed5b6c39c7b
type: Opaque

$ oc get cm ocs-kms-connection-details -o yaml
apiVersion: v1
data:
  KMS_PROVIDER: vault
  KMS_SERVICE_NAME: vault
  VAULT_ADDR: https://vault-cluster.vault.2467e33a-73f9-408b-b9ff-b0476a654d30.aws.hashicorp.cloud:8200
  VAULT_AUTH_KUBERNETES_ROLE: odf-rook-ceph-op
  VAULT_AUTH_METHOD: kubernetes
  VAULT_AUTH_MOUNT_PATH: ""
  VAULT_BACKEND_PATH: rook
  VAULT_CACERT: ocs-kms-ca-secret-0rky0c
  VAULT_NAMESPACE: admin
  VAULT_TLS_SERVER_NAME: ""
kind: ConfigMap
metadata:
  creationTimestamp: "2022-05-24T07:53:46Z"
  name: ocs-kms-connection-details
  namespace: openshift-storage
  resourceVersion: "51613"
  uid: 2ba1ab36-3516-47be-a2ea-c9d4d4f57c56


Version of all relevant components (if applicable):
---------------------------------------------------
OCP: 4.11.0-0.nightly-2022-05-20-213928
ODF: odf-operator.v4.11.0      full_version=4.11.0-75


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, deployment fails

Is there any workaround available to the best of your knowledge?
Not that I am aware of 


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3


Can this issue reproducible?
Yes


Can this issue reproduce from the UI?
Yes


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
-------------------
1. Deploy an ODF cluster using kubernetes authentication method for Vault, where the auth method is enabled inside a Vault namespace.

Follow the steps mentioned here: https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html-single/deploying_openshift_data_foundation_using_amazon_web_services/index#enabling-cluster-wide-encryprtion-with-the-token-authentication-using-kms_cloud-storage


Actual results:
---------------
The deployment fails with the following error, even though the secret is present in the cluster:

2022-05-24 08:16:33.051746 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: failed to validate kms connection details: failed to validate vault connection details: failed to find TLS connection details k8s secret "ocs-kms-ca-secret-0rky0c"


Expected results:
-----------------
The deployment should be successful


Additional info:
----------------
- This issue was not seen when the same vault instance was used with the token authentication method using vault namespaces. The deployment was successful.

- Without using vault namespaces, the kube auth method was successful for clusterwide encryption

Comment 4 Sébastien Han 2022-05-25 07:18:27 UTC
Although the error is hidden, I suspect the context is still canceled, that's why we fail to execute the command.
We need to investigate that.

Comment 10 Mudit Agarwal 2022-07-12 09:30:13 UTC
Sebastien/Rachel, is this a blocker for 4.11 else lets move it out of 4.11 as we are in a blocker only phase now.

Comment 11 Sébastien Han 2022-07-12 09:44:17 UTC
I just pushed a new patch, it's still under review. We have a workaround which is to restart the op, but I'd prefer keeping this as a blocker.
Rachel? Toughts?

Comment 13 Travis Nielsen 2022-07-18 17:24:50 UTC
https://github.com/rook/rook/pull/10523 has been merged upstream. I don't see the relationship of this to anything in the UI or doc changes, I believe it's just a reconcile fix, right Seb? If there is no UI or doc change, I agree with Seb we should merge it to 4.11. Rachael and Seb could you clarify? Thanks!

Comment 14 Mudit Agarwal 2022-07-19 14:01:11 UTC
Yeah, I agree with Travis and I am confused because it was acked by QE earlier. So, what's changed since then?

My question was just because we entered the blocker only phase otherwise it was already approved (and fixed)

Comment 16 Sébastien Han 2022-07-21 08:06:38 UTC
Thanks Rachael.

Comment 17 Mudit Agarwal 2022-07-25 11:27:43 UTC
Moving this BZ to 4.12.
Rachel, can you please file a doc BZ for the same because the first build of 4.12 will contain the fix.

Comment 21 Travis Nielsen 2022-07-25 21:51:15 UTC
Moving to modified since the Rook reconcile fix is in. Rachel, you'll open the UI and doc BZs? Thanks!

Comment 31 errata-xmlrpc 2023-01-31 00:19:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:0551