Bug 1986981 - Revise Alert Severity in OCP 4.9
Summary: Revise Alert Severity in OCP 4.9
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.9
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Haoyu Sun
QA Contact: hongyan li
URL:
Whiteboard:
: 1986983 (view as bug list)
Depends On:
Blocks: 1991836
TreeView+ depends on / blocked
 
Reported: 2021-07-28 16:07 UTC by Haoyu Sun
Modified: 2021-10-18 17:43 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1991836 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:43:06 UTC
Target Upstream Version:


Attachments (Terms of Use)
Critical Alert Table (25.77 KB, application/vnd.oasis.opendocument.spreadsheet)
2021-07-28 16:07 UTC, Haoyu Sun
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1310 0 None None None 2021-08-10 07:07:44 UTC
Github openshift cluster-monitoring-operator pull 1317 0 None None None 2021-08-10 13:23:32 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:43:10 UTC

Internal Links: 1986983

Description Haoyu Sun 2021-07-28 16:07:42 UTC
Created attachment 1806820 [details]
Critical Alert Table

Created attachment 1806820 [details]
Critical Alert Table

Description of problem:

After reviewing critical alerts in OCP, we find out the 21 alerts that need adjustments:
- Recommend changing Critical to Warning:  13
  - KubePersistentVolumeErrors
  - PrometheusBadConfig
  - PrometheusRemoteStorageFailures
  - PrometheusRuleFailures
  - AlertmanagerMembersInconsistent
  - AlertmanagerClusterFailedToSendAlerts
  - AlertmanagerConfigInconsistent
  - AlertmanagerClusterDown
  - KubeStateMetricsListErrors
  - KubeStateMetricsWatchErrors
  - ThanosRuleSenderIsFailingAlerts
  - ThanosRuleHighRuleEvaluationFailures
  - ThanosNoRuleEvaluations

- Recommend removing alert:  2
  - PrometheusErrorSendingAlertsToAnyAlertmanager
  - AlertmanagerClusterCrashlooping

- Recommend changing Critical to Info:  1
  - PrometheusRemoteWriteBehind
  
- Threshold Tweaks:  5
  - KubePersistentVolumeFillingUp
  - KubeletDown
  - NodeFilesystemFilesFillingUp
  - NodeFilesystemSpaceFillingUp
  - PrometheusRemoteStorageFailures

Please refer to this table for details(proposed modification are in column F "Comments") : https://docs.google.com/spreadsheets/d/10rL3loHz6a8lBfKsU2W9TVZSrSqndrnVmkzDeA3Z2kI/edit?usp=sharing
This table can be also found in the attachment.


Version-Release number of selected component (if applicable): 4.9


How reproducible:
N/A

Steps to Reproduce:
N/A

Actual results:
N/A

Expected results:
N/A

Additional info:

Comment 2 Damien Grisonnet 2021-08-10 07:13:43 UTC
*** Bug 1986983 has been marked as a duplicate of this bug. ***

Comment 4 Haoyu Sun 2021-08-10 13:06:55 UTC
Need a fix on Thanos related alerts. Set its status to "assigned" for now.

Comment 5 Haoyu Sun 2021-08-16 10:14:57 UTC
Fix in progress:
https://github.com/openshift/cluster-monitoring-operator/pull/1317

Comment 7 hongyan li 2021-08-23 10:01:23 UTC
Test with payload 4.9.0-0.nightly-2021-08-22-070405
Every alerts rules are consistent with doc 
https://docs.google.com/spreadsheets/d/10rL3loHz6a8lBfKsU2W9TVZSrSqndrnVmkzDeA3Z2kI/edit?usp=sharing

Comment 15 errata-xmlrpc 2021-10-18 17:43:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.