Description of problem: Users should be notified if KMS integration is set but the connection to KMS server is lost.
This should manifest itself in ceph alerts with OSDs failing at their next start. Do you at least see ceph health warnings if you try to restart an OSD pod when in this state?
Just to clarify, if the connection to Vault is lost then the following operations will fails: * maintenance, when you evict pods to new nodes, OSDs won't start and fail on the "encryption-kms-get-kek" init container * disk replacement, same error as above If you *only* restart the OSD pod and Vault is down then it will succeed.
(In reply to Travis Nielsen from comment #3) > This should manifest itself in ceph alerts with OSDs failing at their next > start. Do you at least see ceph health warnings if you try to restart an OSD > pod when in this state? As written by Sébastien: If KMS is unavailable and osd is restarted then osd loads correctly and no new alert is generated. When I tried to deploy osd on different node then osd pod is not created and following alerts appear: * KubeDeploymentReplicasMismatch * KubePodNotReady * CephClusterWarningState * CephDataRecoveryTakingTooLong * CephOSDDiskNotResponding
I don't see how we could honestly. "encryption-kms-get-kek" CLBO or "exit" 1 generates an event so can't we watch it and react upon that?
We can't do it from the monitoring side unless there is metric available in Prometheus. Is there a way to identify this reliably so that we can push this data point to Prometheus? Else its a nogo from monitoring perspective.
Need engineering solution. I can't really help here.
Nishanth, if we set the KMS connection status into the CephCluster CR, can you read it?
Yes we can. OCS exporter can read and push it to prometheus and generate alerts. If that's the case, can I move this bz back to you? Or you want track this as a separate bz?
Ok cool, actually I think it makes more sense to do it in the StorageCluster CR since KMS is not only used for Rook but also Noobaa and Ceph-CSI. Moving to ocs-op.
I agree that something should be done in ocs-operator. Specifically, in the metrics exporter in the ocs-operator repo. That said, this is not a high priority issue, so moving to ODF 4.9.
Will be targeting this up for 4.10
What's the current status on this one? Is it still on for 4.10?
I'm working to get the CI fixed right now so that we can get the PR reviewed: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/red-hat-storage_ocs-operator/1419/pull-ci-red-hat-storage-ocs-operator-main-red-hat-storage-ocs-ci-e2e-aws/1488267652865462272#1:build-log.txt%3A259
No RFE in 4.10 at this point of time, moving it to 4.11
(In reply to Pranshu Srivastava from comment #16) > I'm working to get the CI fixed right now so that we can get the PR > reviewed: > https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/red-hat- > storage_ocs-operator/1419/pull-ci-red-hat-storage-ocs-operator-main-red-hat- > storage-ocs-ci-e2e-aws/1488267652865462272#1:build-log.txt%3A259 Any updates?
Providing QA ack and QE assignment after a bug triage.
Pranshu, do we have the alert already or this need to be worked upon?
In version 4.13.0-164 is present an alert KMSServerConnectionAlert with description `Storage Cluster KMS Server is in un-connected state for more than 5s. Please check KMS config.` and message `Storage Cluster KMS Server is in un-connected state. Please check KMS config.`
The alert is raised when a not reachable address is set in ocs-kms-connection-details config map. --> VERIFIED During testing was found a problem with resolving the alert as described in: https://bugzilla.redhat.com/show_bug.cgi?id=2192852 Tested with: OCS 4.13.0-179 OCP 4.13.0-0.nightly-2023-05-02-134729
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742