Bug 2066289 - [KMS] OSDs are up and running even after the TTL for the vault role has expired
Summary: [KMS] OSDs are up and running even after the TTL for the vault role has expired
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Sébastien Han
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-21 12:26 UTC by Rachael
Modified: 2023-08-09 17:03 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-18 09:21:02 UTC
Embargoed:


Attachments (Terms of Use)

Description Rachael 2022-03-21 12:26:39 UTC
Description of problem (please be detailed as possible and provide log snippets):
---------------------------------------------------------------------------------

When configuring kubernetes auth method for clusterwide encryption using KMS, the ttl value for the roles were set to 1h. 

$ vault write auth/kubernetes/role/odf-role bound_service_account_names=rook-ceph-system,rook-ceph-osd,noobaa bound_service_account_namespaces=openshift-storage policies=odf ttl=1h
Success! Data written to: auth/kubernetes/role/odf-role

$ vault write auth/kubernetes/role/odf-rook-ceph-osd bound_service_account_names=rook-ceph-osd bound_service_account_namespaces=openshift-storage policies=odf ttl=1h
Success! Data written to: auth/kubernetes/role/odf-rook-ceph-osd

$ oc get pods|grep osd
rook-ceph-osd-0-68d85f7ccd-mdvct                                  2/2     Running     0          75m
rook-ceph-osd-1-688fc5476-w4wt8                                   2/2     Running     0          75m
rook-ceph-osd-2-5555988759-ks8nl                                  2/2     Running     0          75m

After 1 hour, the OSD pods were re-spun. Since the ttl had expired, the OSD pods were not expected to come up, but there were no issues observed.The pods were up and running after re-spin.

$ oc get pods|grep osd
rook-ceph-osd-0-68d85f7ccd-6z8r9                                  2/2     Running     0          3m14s
rook-ceph-osd-1-688fc5476-b27kh                                   2/2     Running     0          3m4s
rook-ceph-osd-2-5555988759-rww2m                                  2/2     Running     0          2m50s

Deleting the OSD deployments also resulted in creation of new OSD pods which were up and running (11:46:49.285985 in the rook operator logs)

$ oc get pods|grep osd
rook-ceph-osd-0-7dfbc9dbbd-82vlv                                  2/2     Running     0          3m23s
rook-ceph-osd-1-85bcb496ff-rj5js                                  2/2     Running     0          2m17s
rook-ceph-osd-2-5595d4f9b5-7h7bs                                  2/2     Running     0          2m20s


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
OCP: 4.10.0-0.nightly-2022-03-19-230512
ODF: odf-operator.v4.10.0   full_version=4.10.0-199


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
N/A

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes


If this is a regression, please provide more details to justify this: 
Not a regression


Steps to Reproduce:
-------------------
1. Configure kubernetes auth with 1 hour TTL for the Vault roles.
   Eg:

   $ vault write auth/kubernetes/role/odf-role bound_service_account_names=rook-ceph-system,rook-ceph-osd,noobaa bound_service_account_namespaces=openshift-storage policies=odf ttl=1h
   Success! Data written to: auth/kubernetes/role/odf-role

   $ vault write auth/kubernetes/role/odf-rook-ceph-osd bound_service_account_names=rook-ceph-osd bound_service_account_namespaces=openshift-storage policies=odf ttl=1h
   Success! Data written to: auth/kubernetes/role/odf-rook-ceph-osd

2. Deploy an ODF cluster with clusterwide encryption enabled using KMS
3. After the TTL expires, respin the OSD pods


Actual results:
---------------
New OSD pods come up without any failure or errors, even after the TTL expires


Expected results:
-----------------
If the TTL has expired, the OSD pods shouldn't be able to come up.


Additional info:
----------------

Deleting the role in Vault and re-starting the OSD pods, resulted in the pods going into Init:CrashLoopBackOff

Comment 3 Sébastien Han 2022-03-28 12:33:49 UTC
I think the ttl is present for a given internal token, which is renewed every hour so I don't think we are expecting any failures after an hour. Only if we use an outdated token but that's not the case, internally the newly generated one is used.

Comment 4 Sébastien Han 2022-05-18 09:21:02 UTC
Ok after some more diving, the TTL we set is for the role's token, which is automatically renewed after expiring. The token for that role is valid for an hour and is generated upon the initial OSD request.
So when Rook authenticates with Vault through the Kubernetes Service account, the client (Rook library) is being given a token, that is valid for an hour.
When we create a new OSD, a new auth is done and a new token delivered, again for an hour.

I'm closing this now as it's not a bug but more a question about the internals.
Thanks.


Note You need to log in before you can comment on or make changes to this bug.