Bug 2192852

Summary: KMSServerConnectionAlert is not cleared when connection to kms is restored
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Filip Balák <fbalak>
Component: ceph-monitoringAssignee: arun kumar mohan <amohan>
Status: CLOSED ERRATA QA Contact: Parag Kamble <pakamble>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.13CC: amohan, ebenahar, muagarwa, nthomas, odf-bz-bot
Target Milestone: ---   
Target Release: ODF 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.14.0-111 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-08 18:50:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Filip Balák 2023-05-03 10:34:26 UTC
Description of problem (please be detailed as possible and provide log
snippests):
KMSServerConnectionAlert gets correctly raised but the alert is not cleared when the connection is restored.

Version of all relevant components (if applicable):
OCS 4.13.0-179
OCP 4.13.0-0.nightly-2023-05-02-134729

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproduce from the UI?
yes

Steps to Reproduce:
1. Install OCS cluster with enabled cluster-wide encryption with Vault KMS.
2. Edit ocs-kms-connection-details config map - set VAULT_ADDR to incorrect address.
3. Observe the alert in OCP console.
4. Edit ocs-kms-connection-details config map back to correct address.
5. Observe the alert in OCP console.

Actual results:
First edit of the config map triggers the alert but the alert stays and is not cleared when the address is set to correct value again.

Expected results:
The alert should be cleared when configuration is correct again.

Additional info:
During testing was not tested if the cluster actually resolves the connection (only the alert). The severity of the bug should be raised if the cluster actually can not restore it's connection when there is a downtime with vault kms server.

Comment 3 arun kumar mohan 2023-07-13 07:27:38 UTC
Following RCA,

Alert: KMSServerConnectionAlert
Depends on query: ocs_storagecluster_kms_connection_status{job="ocs-metrics-exporter"} == 1
Metric used here: ocs_storagecluster_kms_connection_status
and 
kms connection status-es are
0: Connected
1: Not Connected
2: KMS not enabled

Connection status is determined (in the code) by checking StorageCluster object's `Status.KMSServerConnection.KMSServerConnectionError` string field and this error-field is set when KMS is unreachable.
But nowhere (in the code) this field is unset/reset when the connection is (re-)established. That means once populated/set this field will remain.

Comment 4 arun kumar mohan 2023-07-13 08:41:34 UTC
Submitting a PR: https://github.com/red-hat-storage/ocs-operator/pull/2108

Comment 6 Mudit Agarwal 2023-08-09 16:03:28 UTC
Please follow up on reviews.

Comment 9 arun kumar mohan 2023-08-16 12:24:17 UTC
Updated the PR

Comment 15 errata-xmlrpc 2023-11-08 18:50:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832