Bug 1889689

Summary: AggregatedAPIErrors alert may never fire
Product: OpenShift Container Platform Reporter: Sergiusz Urbaniak <surbania>
Component: MonitoringAssignee: Pawel Krupa <pkrupa>
Status: CLOSED ERRATA QA Contact: hongyan li <hongyli>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.6CC: alegrand, anpicker, dgrisonn, erooth, hongyli, jfajersk, kakkoyun, lcosic, pkrupa, spasquie
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:33:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sergiusz Urbaniak 2020-10-20 11:39:13 UTC
Investigating metrics changes in k8s 1.19 revealed that aggregator_unavailable_apiservice_count metric was renamed to aggregator_unavailable_apiservice_total which is used in our stack in the "AggregatedAPIErrors" alert: https://github.com/openshift/cluster-monitoring-operator/blob/57a33cb45dc97d23f0b77885c2acd10fd8b60717/assets/prometheus-k8s/rules.yaml#L1680-L1687

We need to fix upstream, vendor downstream and backport to 4.6.

As this is a symptom based alert with alerting severity warning I am setting the BZ severity to low (not release blocking).

Comment 2 Sergiusz Urbaniak 2020-11-13 09:03:43 UTC
UpcomingSprint: We don't have enough capacity to tackle this one in the next sprint (193).

Comment 9 hongyan li 2021-04-16 04:49:12 UTC
Rule is sum by(name, namespace) (increase(aggregator_unavailable_apiservice_count[10m])) > 4
metric aggregator_unavailable_apiservice_count doesn't exist

Comment 10 Damien Grisonnet 2021-04-19 12:49:39 UTC
As the fix is already merged upstream, bumping kubernetes-mixin downstream should resolve this BZ. Thus, I'm reassigning this bug to Pawel as he is reponsible for the 4.8 bumps.

Comment 12 hongyan li 2021-05-06 07:19:40 UTC
Test with payload

# oc get cm prometheus-k8s-rulefiles-0 -oyaml|grep -A10 AggregatedAPIErrors
      - alert: AggregatedAPIErrors
        annotations:
          description: An aggregated API {{ $labels.name }}/{{ $labels.namespace }} has
            reported errors. It has appeared unavailable {{ $value | humanize }} times
            averaged over the past 10m.
          summary: An aggregated API has reported errors.
        expr: |
          sum by(name, namespace)(increase(aggregator_unavailable_apiservice_total[10m])) > 4
        labels:
          severity: warning
      - alert: AggregatedAPIDown

Comment 15 errata-xmlrpc 2021-07-27 22:33:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438