This is a backport of modifications that we are going to make to fix bug 1986981.
Description of problem:
After reviewing critical alerts in OCP, we find out the 21 alerts that need adjustments:
- Recommend changing Critical to Warning: 13
- KubePersistentVolumeErrors
- PrometheusBadConfig
- PrometheusRemoteStorageFailures
- PrometheusRuleFailures
- AlertmanagerMembersInconsistent
- AlertmanagerClusterFailedToSendAlerts
- AlertmanagerConfigInconsistent
- AlertmanagerClusterDown
- KubeStateMetricsListErrors
- KubeStateMetricsWatchErrors
- ThanosRuleSenderIsFailingAlerts
- ThanosRuleHighRuleEvaluationFailures
- ThanosNoRuleEvaluations
- Recommend removing alert: 2
- PrometheusErrorSendingAlertsToAnyAlertmanager
- AlertmanagerClusterCrashlooping
- Recommend changing Critical to Info: 1
- PrometheusRemoteWriteBehind
- Threshold Tweaks: 5
- KubePersistentVolumeFillingUp
- KubeletDown
- NodeFilesystemFilesFillingUp
- NodeFilesystemSpaceFillingUp
- PrometheusRemoteStorageFailures
Please refer to this table for details(proposed modification are in column F "Comments") : https://docs.google.com/spreadsheets/d/10rL3loHz6a8lBfKsU2W9TVZSrSqndrnVmkzDeA3Z2kI/edit?usp=sharing
This table can be also found in the attachment.
Version-Release number of selected component (if applicable): 4.8
How reproducible:
N/A
Steps to Reproduce:
N/A
Actual results:
N/A
Expected results:
N/A
Additional info: