Bug 1944687 - [RFE] KMS server connection lost alert
Summary: [RFE] KMS server connection lost alert
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.13.0
Assignee: Nikhil Ladha
QA Contact: Filip Balák
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-30 13:48 UTC by Filip Balák
Modified: 2023-08-09 17:00 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-21 15:22:14 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 1419 0 None Merged Notify if KMS connection to the server is lost 2022-06-01 08:23:09 UTC
Github red-hat-storage ocs-operator pull 1923 0 None Merged Add KMS alert 2023-03-06 12:53:52 UTC
Red Hat Bugzilla 2192852 0 unspecified CLOSED KMSServerConnectionAlert is not cleared when connection to kms is restored 2023-11-08 18:50:22 UTC
Red Hat Product Errata RHBA-2023:3742 0 None None None 2023-06-21 15:22:53 UTC

Description Filip Balák 2021-03-30 13:48:14 UTC
Description of problem:
Users should be notified if KMS integration is set but the connection to KMS server is lost.

Comment 3 Travis Nielsen 2021-03-31 05:43:32 UTC
This should manifest itself in ceph alerts with OSDs failing at their next start. Do you at least see ceph health warnings if you try to restart an OSD pod when in this state?

Comment 4 Sébastien Han 2021-04-06 09:11:33 UTC
Just to clarify, if the connection to Vault is lost then the following operations will fails:

* maintenance, when you evict pods to new nodes, OSDs won't start and fail on the "encryption-kms-get-kek" init container
* disk replacement, same error as above

If you *only* restart the OSD pod and Vault is down then it will succeed.

Comment 5 Filip Balák 2021-04-12 13:27:34 UTC
(In reply to Travis Nielsen from comment #3)
> This should manifest itself in ceph alerts with OSDs failing at their next
> start. Do you at least see ceph health warnings if you try to restart an OSD
> pod when in this state?

As written by Sébastien: If KMS is unavailable and osd is restarted then osd loads correctly and no new alert is generated.

When I tried to deploy osd on different node then osd pod is not created and following alerts appear:
 * KubeDeploymentReplicasMismatch
 * KubePodNotReady
 * CephClusterWarningState
 * CephDataRecoveryTakingTooLong
 * CephOSDDiskNotResponding

Comment 7 Sébastien Han 2021-04-26 08:14:19 UTC
I don't see how we could honestly. "encryption-kms-get-kek" CLBO or "exit" 1 generates an event  so can't we watch it and react upon that?

Comment 8 Nishanth Thomas 2021-04-29 09:45:24 UTC
We can't do it from the monitoring side unless there is metric available in Prometheus. Is there a way to identify this reliably so that we can push this data point to Prometheus? Else its a nogo from monitoring perspective.

Comment 9 Eran Tamir 2021-04-29 09:50:47 UTC
Need engineering solution. I can't really help here.

Comment 10 Sébastien Han 2021-04-29 11:04:54 UTC
Nishanth, if we set the KMS connection status into the CephCluster CR, can you read it?

Comment 11 Nishanth Thomas 2021-05-04 15:30:01 UTC
Yes we can. OCS exporter can read and push it to prometheus and generate alerts.
If that's the case, can I move this bz back to you? Or you want track this as a separate bz?

Comment 12 Sébastien Han 2021-05-04 15:41:15 UTC
Ok cool, actually I think it makes more sense to do it in the StorageCluster CR since KMS is not only used for Rook but also Noobaa and Ceph-CSI.
Moving to ocs-op.

Comment 13 Jose A. Rivera 2021-06-07 16:20:59 UTC
I agree that something should be done in ocs-operator. Specifically, in the metrics exporter in the ocs-operator repo.

That said, this is not a high priority issue, so moving to ODF 4.9.

Comment 14 Sanjal Katiyar 2021-08-19 11:47:01 UTC
Will be targeting this up for 4.10

Comment 15 Mudit Agarwal 2022-01-31 17:48:14 UTC
What's the current status on this one? Is it still on for 4.10?

Comment 17 Mudit Agarwal 2022-02-16 03:56:30 UTC
No RFE in 4.10 at this point of time, moving it to 4.11

Comment 18 Yaniv Kaul 2022-05-31 09:24:15 UTC
(In reply to Pranshu Srivastava from comment #16)
> I'm working to get the CI fixed right now so that we can get the PR
> reviewed:
> https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/red-hat-
> storage_ocs-operator/1419/pull-ci-red-hat-storage-ocs-operator-main-red-hat-
> storage-ocs-ci-e2e-aws/1488267652865462272#1:build-log.txt%3A259

Any updates?

Comment 19 Martin Bukatovic 2022-05-31 15:19:08 UTC
Providing QA ack and QE assignment after a bug triage.

Comment 29 Mudit Agarwal 2022-10-31 03:24:53 UTC
Pranshu, do we have the alert already or this need to be worked upon?

Comment 50 Filip Balák 2023-05-02 12:05:38 UTC
In version 4.13.0-164 is present an alert KMSServerConnectionAlert with description `Storage Cluster KMS Server is in un-connected state for more than 5s. Please check KMS config.` and message `Storage Cluster KMS Server is in un-connected state. Please check KMS config.`

Comment 51 Filip Balák 2023-05-03 10:37:37 UTC
The alert is raised when a not reachable address is set in ocs-kms-connection-details config map. --> VERIFIED
During testing was found a problem with resolving the alert as described in: https://bugzilla.redhat.com/show_bug.cgi?id=2192852

Tested with:
OCS 4.13.0-179
OCP 4.13.0-0.nightly-2023-05-02-134729

Comment 52 errata-xmlrpc 2023-06-21 15:22:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742


Note You need to log in before you can comment on or make changes to this bug.