1944687 – [RFE] KMS server connection lost alert

Bug 1944687 - [RFE] KMS server connection lost alert

Summary: [RFE] KMS server connection lost alert

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.13.0
Assignee:	Nikhil Ladha
QA Contact:	Filip Balák
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-30 13:48 UTC by Filip Balák
Modified:	2023-08-09 17:00 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-06-21 15:22:14 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 1419	None	Merged	Notify if KMS connection to the server is lost	2022-06-01 08:23:09 UTC
Github	red-hat-storage ocs-operator pull 1923	None	Merged	Add KMS alert	2023-03-06 12:53:52 UTC
Red Hat Bugzilla	2192852	unspecified	CLOSED	KMSServerConnectionAlert is not cleared when connection to kms is restored	2023-11-08 18:50:22 UTC
Red Hat Product Errata	RHBA-2023:3742	None	None	None	2023-06-21 15:22:53 UTC

Description Filip Balák 2021-03-30 13:48:14 UTC

Description of problem:
Users should be notified if KMS integration is set but the connection to KMS server is lost.

Comment 3 Travis Nielsen 2021-03-31 05:43:32 UTC

This should manifest itself in ceph alerts with OSDs failing at their next start. Do you at least see ceph health warnings if you try to restart an OSD pod when in this state?

Comment 4 Sébastien Han 2021-04-06 09:11:33 UTC

Just to clarify, if the connection to Vault is lost then the following operations will fails:

* maintenance, when you evict pods to new nodes, OSDs won't start and fail on the "encryption-kms-get-kek" init container
* disk replacement, same error as above

If you *only* restart the OSD pod and Vault is down then it will succeed.

Comment 5 Filip Balák 2021-04-12 13:27:34 UTC

(In reply to Travis Nielsen from comment #3)
> This should manifest itself in ceph alerts with OSDs failing at their next
> start. Do you at least see ceph health warnings if you try to restart an OSD
> pod when in this state?

As written by Sébastien: If KMS is unavailable and osd is restarted then osd loads correctly and no new alert is generated.

When I tried to deploy osd on different node then osd pod is not created and following alerts appear:
 * KubeDeploymentReplicasMismatch
 * KubePodNotReady
 * CephClusterWarningState
 * CephDataRecoveryTakingTooLong
 * CephOSDDiskNotResponding

Comment 7 Sébastien Han 2021-04-26 08:14:19 UTC

I don't see how we could honestly. "encryption-kms-get-kek" CLBO or "exit" 1 generates an event  so can't we watch it and react upon that?

Comment 8 Nishanth Thomas 2021-04-29 09:45:24 UTC

We can't do it from the monitoring side unless there is metric available in Prometheus. Is there a way to identify this reliably so that we can push this data point to Prometheus? Else its a nogo from monitoring perspective.

Comment 9 Eran Tamir 2021-04-29 09:50:47 UTC

Need engineering solution. I can't really help here.

Comment 10 Sébastien Han 2021-04-29 11:04:54 UTC

Nishanth, if we set the KMS connection status into the CephCluster CR, can you read it?

Comment 11 Nishanth Thomas 2021-05-04 15:30:01 UTC

Yes we can. OCS exporter can read and push it to prometheus and generate alerts.
If that's the case, can I move this bz back to you? Or you want track this as a separate bz?

Comment 12 Sébastien Han 2021-05-04 15:41:15 UTC

Ok cool, actually I think it makes more sense to do it in the StorageCluster CR since KMS is not only used for Rook but also Noobaa and Ceph-CSI.
Moving to ocs-op.

Comment 13 Jose A. Rivera 2021-06-07 16:20:59 UTC

I agree that something should be done in ocs-operator. Specifically, in the metrics exporter in the ocs-operator repo.

That said, this is not a high priority issue, so moving to ODF 4.9.

Comment 14 Sanjal Katiyar 2021-08-19 11:47:01 UTC

Will be targeting this up for 4.10

Comment 15 Mudit Agarwal 2022-01-31 17:48:14 UTC

What's the current status on this one? Is it still on for 4.10?

Comment 16 Pranshu Srivastava 2022-02-01 12:01:10 UTC

I'm working to get the CI fixed right now so that we can get the PR reviewed: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/red-hat-storage_ocs-operator/1419/pull-ci-red-hat-storage-ocs-operator-main-red-hat-storage-ocs-ci-e2e-aws/1488267652865462272#1:build-log.txt%3A259

Comment 17 Mudit Agarwal 2022-02-16 03:56:30 UTC

No RFE in 4.10 at this point of time, moving it to 4.11

Comment 18 Yaniv Kaul 2022-05-31 09:24:15 UTC

(In reply to Pranshu Srivastava from comment #16)
> I'm working to get the CI fixed right now so that we can get the PR
> reviewed:
> https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/red-hat-
> storage_ocs-operator/1419/pull-ci-red-hat-storage-ocs-operator-main-red-hat-
> storage-ocs-ci-e2e-aws/1488267652865462272#1:build-log.txt%3A259

Any updates?

Comment 19 Martin Bukatovic 2022-05-31 15:19:08 UTC

Providing QA ack and QE assignment after a bug triage.

Comment 29 Mudit Agarwal 2022-10-31 03:24:53 UTC

Pranshu, do we have the alert already or this need to be worked upon?

Comment 50 Filip Balák 2023-05-02 12:05:38 UTC

In version 4.13.0-164 is present an alert KMSServerConnectionAlert with description `Storage Cluster KMS Server is in un-connected state for more than 5s. Please check KMS config.` and message `Storage Cluster KMS Server is in un-connected state. Please check KMS config.`

Comment 51 Filip Balák 2023-05-03 10:37:37 UTC

The alert is raised when a not reachable address is set in ocs-kms-connection-details config map. --> VERIFIED
During testing was found a problem with resolving the alert as described in: https://bugzilla.redhat.com/show_bug.cgi?id=2192852

Tested with:
OCS 4.13.0-179
OCP 4.13.0-0.nightly-2023-05-02-134729

Comment 52 errata-xmlrpc 2023-06-21 15:22:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742

Note You need to log in before you can comment on or make changes to this bug.