Bug 1889689 - AggregatedAPIErrors alert may never fire
Summary: AggregatedAPIErrors alert may never fire
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.8.0
Assignee: Pawel Krupa
QA Contact: hongyan li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-20 11:39 UTC by Sergiusz Urbaniak
Modified: 2021-07-27 22:34 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:33:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes-monitoring kubernetes-mixin pull 574 0 None closed Update AggregatedAPIErrors after Kubernetes 1.19 changes 2021-04-14 09:38:35 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:34:13 UTC

Description Sergiusz Urbaniak 2020-10-20 11:39:13 UTC
Investigating metrics changes in k8s 1.19 revealed that aggregator_unavailable_apiservice_count metric was renamed to aggregator_unavailable_apiservice_total which is used in our stack in the "AggregatedAPIErrors" alert: https://github.com/openshift/cluster-monitoring-operator/blob/57a33cb45dc97d23f0b77885c2acd10fd8b60717/assets/prometheus-k8s/rules.yaml#L1680-L1687

We need to fix upstream, vendor downstream and backport to 4.6.

As this is a symptom based alert with alerting severity warning I am setting the BZ severity to low (not release blocking).

Comment 2 Sergiusz Urbaniak 2020-11-13 09:03:43 UTC
UpcomingSprint: We don't have enough capacity to tackle this one in the next sprint (193).

Comment 9 hongyan li 2021-04-16 04:49:12 UTC
Rule is sum by(name, namespace) (increase(aggregator_unavailable_apiservice_count[10m])) > 4
metric aggregator_unavailable_apiservice_count doesn't exist

Comment 10 Damien Grisonnet 2021-04-19 12:49:39 UTC
As the fix is already merged upstream, bumping kubernetes-mixin downstream should resolve this BZ. Thus, I'm reassigning this bug to Pawel as he is reponsible for the 4.8 bumps.

Comment 12 hongyan li 2021-05-06 07:19:40 UTC
Test with payload

# oc get cm prometheus-k8s-rulefiles-0 -oyaml|grep -A10 AggregatedAPIErrors
      - alert: AggregatedAPIErrors
        annotations:
          description: An aggregated API {{ $labels.name }}/{{ $labels.namespace }} has
            reported errors. It has appeared unavailable {{ $value | humanize }} times
            averaged over the past 10m.
          summary: An aggregated API has reported errors.
        expr: |
          sum by(name, namespace)(increase(aggregator_unavailable_apiservice_total[10m])) > 4
        labels:
          severity: warning
      - alert: AggregatedAPIDown

Comment 15 errata-xmlrpc 2021-07-27 22:33:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.