Bug 2039240

Summary: [KMS] Deployment of ODF cluster fails when cluster wide encryption is enabled using service account for KMS auth
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Rachael <rgeorge>
Component: rookAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: Rachael <rgeorge>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: madam, muagarwa, nberry, ocs-bugs, odf-bz-bot, shan
Target Milestone: ---   
Target Release: ODF 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.10.0-113 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-13 18:51:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rachael 2022-01-11 10:34:44 UTC
Description of problem (please be detailed as possible and provide log
snippets):

When cluster wide encryption is enabled using service account for KMS authentication, the OSD pods fail to come up and are stuck in Init:CrashLoopBackOff state. The following error is seen in the logs:


$ oc logs rook-ceph-osd-0-7cd85d4c67-9dxvp -c encryption-kms-get-kek 
2022-01-11 08:57:56.688171 C | rookcmd: failed to get ceph cluster in namespace "openshift-storage": cephclusters.ceph.rook.io "openshift-storage" not found


$ oc get pods|grep osd
NAME                                                              READY   STATUS                  RESTARTS         AGE
rook-ceph-osd-0-7cd85d4c67-9dxvp                                  0/2     Init:CrashLoopBackOff   28 (2m46s ago)   120m
rook-ceph-osd-1-6699c6c4f7-26sml                                  0/2     Init:CrashLoopBackOff   28 (2m21s ago)   120m
rook-ceph-osd-2-547ffc96b9-t8v4s                                  0/2     Init:CrashLoopBackOff   28 (2m38s ago)   120m



Version of all relevant components (if applicable):
---------------------------------------------------

ODF: odf-operator.v4.10.0      full_version=4.10.0-79
OCP: 4.10.0-0.nightly-2022-01-10-144202

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, the deployment fails and the cluster is not ready to be used.

Is there any workaround available to the best of your knowledge?
Not that I am aware of

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Tried it once

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
No

Steps to Reproduce:
-------------------

1. Install the ODF operator

2. In the openshift-storage namespace, create a service account called odf-vault-auth
   # oc -n openshift-storage create serviceaccount odf-vault-auth

3. Create clusterrolebinding as shown below
   # oc -n openshift-storage create clusterrolebinding vault-tokenreview-binding --clusterrole=system:auth-delegator --serviceaccount=openshift-storage:odf-vault-auth

4. Get the secret name from the service account
   # oc -n openshift-storage get sa odf-vault-auth -o jsonpath="{.secrets[*]['name']}"

5. Get the Token and CA cert used to configure the kube auth in Vault
   # SA_JWT_TOKEN=$(oc -n openshift-storage get secret "$VAULT_SA_SECRET_NAME" -o jsonpath="{.data.token}" | base64 --decode; echo)
   # SA_CA_CRT=$(oc -n openshift-storage get secret "$VAULT_SA_SECRET_NAME" -o jsonpath="{.data['ca\.crt']}" | base64 --decode; echo)

6. Get the OCP endpoint and sa issuer
   # K8S_HOST=$(oc config view --minify --flatten -o jsonpath="{.clusters[0].cluster.server}")
   # issuer="$(oc get authentication.config cluster -o template="{{ .spec.serviceAccountIssuer }}")"

7. On the vault node/pod, configure the kube auth method
   # vault auth enable kubernetes
   
   # vault write auth/kubernetes/config \
          token_reviewer_jwt="$SA_JWT_TOKEN" \
          kubernetes_host="$K8S_HOST" \
          kubernetes_ca_cert="$SA_CA_CRT" \
          issuer="$issuer"

   # vault write auth/kubernetes/role/odf-rook-ceph-op \
        bound_service_account_names=rook-ceph-system,rook-ceph-osd, noobaa \
        bound_service_account_namespaces=openshift-storage \
        policies=odf \
        ttl=1440h

   # vault write auth/kubernetes/role/odf-rook-ceph-osd \
        bound_service_account_names=rook-ceph-osd \
        bound_service_account_namespaces=openshift-storage \
        policies=odf \
        ttl=1440h

8. From the ODF management console, follow the steps to create the storagesystem.
9. On the Security and network page, click on "Enable data encryption for block and file storage"
10. Select "Cluster-wide encryption" from encryption level and click on "Connect to an external key management service".
11. Set Authentication method to "Kubernetes" and fill out the rest of the details 
12. Review and create the storagesystem
13. Check the status of the OSD pods


Actual results:
---------------
The OSD pods are in Init:CrashLoopBackOff state. 


Expected results:
-----------------
The deployment should be successful and the OSD pods should be up and running.

Comment 3 Sébastien Han 2022-01-11 15:36:36 UTC
Will be in the next resync https://github.com/red-hat-storage/rook/pull/326

Comment 11 errata-xmlrpc 2022-04-13 18:51:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1372