2089755 – [KMS] Deployment of clusterwide encryption with kube auth using vault namespace fails due to TLS check failure

Bug 2089755 - [KMS] Deployment of clusterwide encryption with kube auth using vault namespace fails due to TLS check failure

Summary: [KMS] Deployment of clusterwide encryption with kube auth using vault namespa...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.12.0
Assignee:	Sébastien Han
QA Contact:	Rachael
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2110868 2131648 2048902 2110866 2124827
TreeView+	depends on / blocked

Reported:	2022-05-24 11:21 UTC by Rachael
Modified:	2023-08-09 17:03 UTC (History)
CC List:	8 users (show)
Fixed In Version:	4.11.0-96
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-31 00:19:21 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 383	None	open	Bug 2089755: osd: do not hide the error	2022-06-06 15:20:21 UTC
Github	rook rook pull 10373	None	Merged	osd: do not hide the error	2022-06-06 15:18:41 UTC
Github	rook rook pull 10523	None	open	wip	2022-06-28 15:20:47 UTC
Red Hat Product Errata	RHBA-2023:0551	None	None	None	2023-01-31 00:20:04 UTC

Description Rachael 2022-05-24 11:21:53 UTC

Description of problem (please be detailed as possible and provide log
snippets):

When trying to deploy an ODF 4.11 cluster with cluster wide encryption enabled using kubernetes authentication method for KMS with vault namespaces, the deployment fails with the following error observed in the rook operator logs: 

2022-05-24 08:16:33.051746 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: failed to validate kms connection details: failed to validate vault connection details: failed to find TLS connection details k8s secret "ocs-kms-ca-secret-0rky0c"


The secret mentioned in the error message exists in the openshift-storage namespace: 

$ oc get secret ocs-kms-ca-secret-0rky0c -n openshift-storage -o yaml
apiVersion: v1
data:
  cert: LS0tLS1CRUdJTiBDRVJUS....
kind: Secret
metadata:
  creationTimestamp: "2022-05-24T07:53:46Z"
  name: ocs-kms-ca-secret-0rky0c
  namespace: openshift-storage
  resourceVersion: "83446"
  uid: 11e6915a-6555-4fd3-87ac-bed5b6c39c7b
type: Opaque

$ oc get cm ocs-kms-connection-details -o yaml
apiVersion: v1
data:
  KMS_PROVIDER: vault
  KMS_SERVICE_NAME: vault
  VAULT_ADDR: https://vault-cluster.vault.2467e33a-73f9-408b-b9ff-b0476a654d30.aws.hashicorp.cloud:8200
  VAULT_AUTH_KUBERNETES_ROLE: odf-rook-ceph-op
  VAULT_AUTH_METHOD: kubernetes
  VAULT_AUTH_MOUNT_PATH: ""
  VAULT_BACKEND_PATH: rook
  VAULT_CACERT: ocs-kms-ca-secret-0rky0c
  VAULT_NAMESPACE: admin
  VAULT_TLS_SERVER_NAME: ""
kind: ConfigMap
metadata:
  creationTimestamp: "2022-05-24T07:53:46Z"
  name: ocs-kms-connection-details
  namespace: openshift-storage
  resourceVersion: "51613"
  uid: 2ba1ab36-3516-47be-a2ea-c9d4d4f57c56


Version of all relevant components (if applicable):
---------------------------------------------------
OCP: 4.11.0-0.nightly-2022-05-20-213928
ODF: odf-operator.v4.11.0      full_version=4.11.0-75


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, deployment fails

Is there any workaround available to the best of your knowledge?
Not that I am aware of 


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3


Can this issue reproducible?
Yes


Can this issue reproduce from the UI?
Yes


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
-------------------
1. Deploy an ODF cluster using kubernetes authentication method for Vault, where the auth method is enabled inside a Vault namespace.

Follow the steps mentioned here: https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html-single/deploying_openshift_data_foundation_using_amazon_web_services/index#enabling-cluster-wide-encryprtion-with-the-token-authentication-using-kms_cloud-storage


Actual results:
---------------
The deployment fails with the following error, even though the secret is present in the cluster:

2022-05-24 08:16:33.051746 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: failed to validate kms connection details: failed to validate vault connection details: failed to find TLS connection details k8s secret "ocs-kms-ca-secret-0rky0c"


Expected results:
-----------------
The deployment should be successful


Additional info:
----------------
- This issue was not seen when the same vault instance was used with the token authentication method using vault namespaces. The deployment was successful.

- Without using vault namespaces, the kube auth method was successful for clusterwide encryption

Comment 4 Sébastien Han 2022-05-25 07:18:27 UTC

Although the error is hidden, I suspect the context is still canceled, that's why we fail to execute the command.
We need to investigate that.

Comment 10 Mudit Agarwal 2022-07-12 09:30:13 UTC

Sebastien/Rachel, is this a blocker for 4.11 else lets move it out of 4.11 as we are in a blocker only phase now.

Comment 11 Sébastien Han 2022-07-12 09:44:17 UTC

I just pushed a new patch, it's still under review. We have a workaround which is to restart the op, but I'd prefer keeping this as a blocker.
Rachel? Toughts?

Comment 13 Travis Nielsen 2022-07-18 17:24:50 UTC

https://github.com/rook/rook/pull/10523 has been merged upstream. I don't see the relationship of this to anything in the UI or doc changes, I believe it's just a reconcile fix, right Seb? If there is no UI or doc change, I agree with Seb we should merge it to 4.11. Rachael and Seb could you clarify? Thanks!

Comment 14 Mudit Agarwal 2022-07-19 14:01:11 UTC

Yeah, I agree with Travis and I am confused because it was acked by QE earlier. So, what's changed since then?

My question was just because we entered the blocker only phase otherwise it was already approved (and fixed)

Comment 16 Sébastien Han 2022-07-21 08:06:38 UTC

Thanks Rachael.

Comment 17 Mudit Agarwal 2022-07-25 11:27:43 UTC

Moving this BZ to 4.12.
Rachel, can you please file a doc BZ for the same because the first build of 4.12 will contain the fix.

Comment 21 Travis Nielsen 2022-07-25 21:51:15 UTC

Moving to modified since the Rook reconcile fix is in. Rachel, you'll open the UI and doc BZs? Thanks!

Comment 31 errata-xmlrpc 2023-01-31 00:19:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:0551

Note You need to log in before you can comment on or make changes to this bug.