Description of problem (please be detailed as possible and provide log snippets): --------------------------------------------------------------------------------- When configuring kubernetes auth method for clusterwide encryption using KMS, the ttl value for the roles were set to 1h. $ vault write auth/kubernetes/role/odf-role bound_service_account_names=rook-ceph-system,rook-ceph-osd,noobaa bound_service_account_namespaces=openshift-storage policies=odf ttl=1h Success! Data written to: auth/kubernetes/role/odf-role $ vault write auth/kubernetes/role/odf-rook-ceph-osd bound_service_account_names=rook-ceph-osd bound_service_account_namespaces=openshift-storage policies=odf ttl=1h Success! Data written to: auth/kubernetes/role/odf-rook-ceph-osd $ oc get pods|grep osd rook-ceph-osd-0-68d85f7ccd-mdvct 2/2 Running 0 75m rook-ceph-osd-1-688fc5476-w4wt8 2/2 Running 0 75m rook-ceph-osd-2-5555988759-ks8nl 2/2 Running 0 75m After 1 hour, the OSD pods were re-spun. Since the ttl had expired, the OSD pods were not expected to come up, but there were no issues observed.The pods were up and running after re-spin. $ oc get pods|grep osd rook-ceph-osd-0-68d85f7ccd-6z8r9 2/2 Running 0 3m14s rook-ceph-osd-1-688fc5476-b27kh 2/2 Running 0 3m4s rook-ceph-osd-2-5555988759-rww2m 2/2 Running 0 2m50s Deleting the OSD deployments also resulted in creation of new OSD pods which were up and running (11:46:49.285985 in the rook operator logs) $ oc get pods|grep osd rook-ceph-osd-0-7dfbc9dbbd-82vlv 2/2 Running 0 3m23s rook-ceph-osd-1-85bcb496ff-rj5js 2/2 Running 0 2m17s rook-ceph-osd-2-5595d4f9b5-7h7bs 2/2 Running 0 2m20s Version-Release number of selected component (if applicable): ------------------------------------------------------------- OCP: 4.10.0-0.nightly-2022-03-19-230512 ODF: odf-operator.v4.10.0 full_version=4.10.0-199 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? N/A Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Not a regression Steps to Reproduce: ------------------- 1. Configure kubernetes auth with 1 hour TTL for the Vault roles. Eg: $ vault write auth/kubernetes/role/odf-role bound_service_account_names=rook-ceph-system,rook-ceph-osd,noobaa bound_service_account_namespaces=openshift-storage policies=odf ttl=1h Success! Data written to: auth/kubernetes/role/odf-role $ vault write auth/kubernetes/role/odf-rook-ceph-osd bound_service_account_names=rook-ceph-osd bound_service_account_namespaces=openshift-storage policies=odf ttl=1h Success! Data written to: auth/kubernetes/role/odf-rook-ceph-osd 2. Deploy an ODF cluster with clusterwide encryption enabled using KMS 3. After the TTL expires, respin the OSD pods Actual results: --------------- New OSD pods come up without any failure or errors, even after the TTL expires Expected results: ----------------- If the TTL has expired, the OSD pods shouldn't be able to come up. Additional info: ---------------- Deleting the role in Vault and re-starting the OSD pods, resulted in the pods going into Init:CrashLoopBackOff
I think the ttl is present for a given internal token, which is renewed every hour so I don't think we are expecting any failures after an hour. Only if we use an outdated token but that's not the case, internally the newly generated one is used.
Ok after some more diving, the TTL we set is for the role's token, which is automatically renewed after expiring. The token for that role is valid for an hour and is generated upon the initial OSD request. So when Rook authenticates with Vault through the Kubernetes Service account, the client (Rook library) is being given a token, that is valid for an hour. When we create a new OSD, a new auth is done and a new token delivered, again for an hour. I'm closing this now as it's not a bug but more a question about the internals. Thanks.