Bug 1986981

Summary: Revise Alert Severity in OCP 4.9
Product: OpenShift Container Platform Reporter: Haoyu Sun <hasun>
Component: MonitoringAssignee: Haoyu Sun <hasun>
Status: CLOSED ERRATA QA Contact: hongyan li <hongyli>
Severity: high Docs Contact:
Priority: high    
Version: 4.9CC: amuller, anpicker, aos-bugs, arajkuma, dgrisonn, dofinn, erooth, hasun, jeder, rrackow, spasquie
Target Milestone: ---Keywords: ServiceDeliveryImpact
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1991836 (view as bug list) Environment:
Last Closed: 2021-10-18 17:43:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1991836    
Attachments:
Description Flags
Critical Alert Table none

Description Haoyu Sun 2021-07-28 16:07:42 UTC
Created attachment 1806820 [details]
Critical Alert Table

Created attachment 1806820 [details]
Critical Alert Table

Description of problem:

After reviewing critical alerts in OCP, we find out the 21 alerts that need adjustments:
- Recommend changing Critical to Warning:  13
  - KubePersistentVolumeErrors
  - PrometheusBadConfig
  - PrometheusRemoteStorageFailures
  - PrometheusRuleFailures
  - AlertmanagerMembersInconsistent
  - AlertmanagerClusterFailedToSendAlerts
  - AlertmanagerConfigInconsistent
  - AlertmanagerClusterDown
  - KubeStateMetricsListErrors
  - KubeStateMetricsWatchErrors
  - ThanosRuleSenderIsFailingAlerts
  - ThanosRuleHighRuleEvaluationFailures
  - ThanosNoRuleEvaluations

- Recommend removing alert:  2
  - PrometheusErrorSendingAlertsToAnyAlertmanager
  - AlertmanagerClusterCrashlooping

- Recommend changing Critical to Info:  1
  - PrometheusRemoteWriteBehind
  
- Threshold Tweaks:  5
  - KubePersistentVolumeFillingUp
  - KubeletDown
  - NodeFilesystemFilesFillingUp
  - NodeFilesystemSpaceFillingUp
  - PrometheusRemoteStorageFailures

Please refer to this table for details(proposed modification are in column F "Comments") : https://docs.google.com/spreadsheets/d/10rL3loHz6a8lBfKsU2W9TVZSrSqndrnVmkzDeA3Z2kI/edit?usp=sharing
This table can be also found in the attachment.


Version-Release number of selected component (if applicable): 4.9


How reproducible:
N/A

Steps to Reproduce:
N/A

Actual results:
N/A

Expected results:
N/A

Additional info:

Comment 2 Damien Grisonnet 2021-08-10 07:13:43 UTC
*** Bug 1986983 has been marked as a duplicate of this bug. ***

Comment 4 Haoyu Sun 2021-08-10 13:06:55 UTC
Need a fix on Thanos related alerts. Set its status to "assigned" for now.

Comment 5 Haoyu Sun 2021-08-16 10:14:57 UTC
Fix in progress:
https://github.com/openshift/cluster-monitoring-operator/pull/1317

Comment 7 hongyan li 2021-08-23 10:01:23 UTC
Test with payload 4.9.0-0.nightly-2021-08-22-070405
Every alerts rules are consistent with doc 
https://docs.google.com/spreadsheets/d/10rL3loHz6a8lBfKsU2W9TVZSrSqndrnVmkzDeA3Z2kI/edit?usp=sharing

Comment 15 errata-xmlrpc 2021-10-18 17:43:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759