Bug 2066289

Summary: [KMS] OSDs are up and running even after the TTL for the vault role has expired
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Rachael <rgeorge>
Component: rookAssignee: Sébastien Han <shan>
Status: CLOSED NOTABUG QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: madam, mmuench, ocs-bugs, odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-18 09:21:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rachael 2022-03-21 12:26:39 UTC
Description of problem (please be detailed as possible and provide log snippets):
---------------------------------------------------------------------------------

When configuring kubernetes auth method for clusterwide encryption using KMS, the ttl value for the roles were set to 1h. 

$ vault write auth/kubernetes/role/odf-role bound_service_account_names=rook-ceph-system,rook-ceph-osd,noobaa bound_service_account_namespaces=openshift-storage policies=odf ttl=1h
Success! Data written to: auth/kubernetes/role/odf-role

$ vault write auth/kubernetes/role/odf-rook-ceph-osd bound_service_account_names=rook-ceph-osd bound_service_account_namespaces=openshift-storage policies=odf ttl=1h
Success! Data written to: auth/kubernetes/role/odf-rook-ceph-osd

$ oc get pods|grep osd
rook-ceph-osd-0-68d85f7ccd-mdvct                                  2/2     Running     0          75m
rook-ceph-osd-1-688fc5476-w4wt8                                   2/2     Running     0          75m
rook-ceph-osd-2-5555988759-ks8nl                                  2/2     Running     0          75m

After 1 hour, the OSD pods were re-spun. Since the ttl had expired, the OSD pods were not expected to come up, but there were no issues observed.The pods were up and running after re-spin.

$ oc get pods|grep osd
rook-ceph-osd-0-68d85f7ccd-6z8r9                                  2/2     Running     0          3m14s
rook-ceph-osd-1-688fc5476-b27kh                                   2/2     Running     0          3m4s
rook-ceph-osd-2-5555988759-rww2m                                  2/2     Running     0          2m50s

Deleting the OSD deployments also resulted in creation of new OSD pods which were up and running (11:46:49.285985 in the rook operator logs)

$ oc get pods|grep osd
rook-ceph-osd-0-7dfbc9dbbd-82vlv                                  2/2     Running     0          3m23s
rook-ceph-osd-1-85bcb496ff-rj5js                                  2/2     Running     0          2m17s
rook-ceph-osd-2-5595d4f9b5-7h7bs                                  2/2     Running     0          2m20s


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
OCP: 4.10.0-0.nightly-2022-03-19-230512
ODF: odf-operator.v4.10.0   full_version=4.10.0-199


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
N/A

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes


If this is a regression, please provide more details to justify this: 
Not a regression


Steps to Reproduce:
-------------------
1. Configure kubernetes auth with 1 hour TTL for the Vault roles.
   Eg:

   $ vault write auth/kubernetes/role/odf-role bound_service_account_names=rook-ceph-system,rook-ceph-osd,noobaa bound_service_account_namespaces=openshift-storage policies=odf ttl=1h
   Success! Data written to: auth/kubernetes/role/odf-role

   $ vault write auth/kubernetes/role/odf-rook-ceph-osd bound_service_account_names=rook-ceph-osd bound_service_account_namespaces=openshift-storage policies=odf ttl=1h
   Success! Data written to: auth/kubernetes/role/odf-rook-ceph-osd

2. Deploy an ODF cluster with clusterwide encryption enabled using KMS
3. After the TTL expires, respin the OSD pods


Actual results:
---------------
New OSD pods come up without any failure or errors, even after the TTL expires


Expected results:
-----------------
If the TTL has expired, the OSD pods shouldn't be able to come up.


Additional info:
----------------

Deleting the role in Vault and re-starting the OSD pods, resulted in the pods going into Init:CrashLoopBackOff

Comment 3 Sébastien Han 2022-03-28 12:33:49 UTC
I think the ttl is present for a given internal token, which is renewed every hour so I don't think we are expecting any failures after an hour. Only if we use an outdated token but that's not the case, internally the newly generated one is used.

Comment 4 Sébastien Han 2022-05-18 09:21:02 UTC
Ok after some more diving, the TTL we set is for the role's token, which is automatically renewed after expiring. The token for that role is valid for an hour and is generated upon the initial OSD request.
So when Rook authenticates with Vault through the Kubernetes Service account, the client (Rook library) is being given a token, that is valid for an hour.
When we create a new OSD, a new auth is done and a new token delivered, again for an hour.

I'm closing this now as it's not a bug but more a question about the internals.
Thanks.