1889689 – AggregatedAPIErrors alert may never fire

Bug 1889689 - AggregatedAPIErrors alert may never fire

Summary: AggregatedAPIErrors alert may never fire

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Pawel Krupa
QA Contact:	hongyan li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-20 11:39 UTC by Sergiusz Urbaniak
Modified:	2021-07-27 22:34 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:33:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	kubernetes-monitoring kubernetes-mixin pull 574	0	None	closed	Update AggregatedAPIErrors after Kubernetes 1.19 changes	2021-04-14 09:38:35 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:34:13 UTC

Description Sergiusz Urbaniak 2020-10-20 11:39:13 UTC

Investigating metrics changes in k8s 1.19 revealed that aggregator_unavailable_apiservice_count metric was renamed to aggregator_unavailable_apiservice_total which is used in our stack in the "AggregatedAPIErrors" alert: https://github.com/openshift/cluster-monitoring-operator/blob/57a33cb45dc97d23f0b77885c2acd10fd8b60717/assets/prometheus-k8s/rules.yaml#L1680-L1687

We need to fix upstream, vendor downstream and backport to 4.6.

As this is a symptom based alert with alerting severity warning I am setting the BZ severity to low (not release blocking).

Comment 2 Sergiusz Urbaniak 2020-11-13 09:03:43 UTC

UpcomingSprint: We don't have enough capacity to tackle this one in the next sprint (193).

Comment 9 hongyan li 2021-04-16 04:49:12 UTC

Rule is sum by(name, namespace) (increase(aggregator_unavailable_apiservice_count[10m])) > 4
metric aggregator_unavailable_apiservice_count doesn't exist

Comment 10 Damien Grisonnet 2021-04-19 12:49:39 UTC

As the fix is already merged upstream, bumping kubernetes-mixin downstream should resolve this BZ. Thus, I'm reassigning this bug to Pawel as he is reponsible for the 4.8 bumps.

Comment 12 hongyan li 2021-05-06 07:19:40 UTC

Test with payload

# oc get cm prometheus-k8s-rulefiles-0 -oyaml|grep -A10 AggregatedAPIErrors
      - alert: AggregatedAPIErrors
        annotations:
          description: An aggregated API {{ $labels.name }}/{{ $labels.namespace }} has
            reported errors. It has appeared unavailable {{ $value | humanize }} times
            averaged over the past 10m.
          summary: An aggregated API has reported errors.
        expr: |
          sum by(name, namespace)(increase(aggregator_unavailable_apiservice_total[10m])) > 4
        labels:
          severity: warning
      - alert: AggregatedAPIDown

Comment 15 errata-xmlrpc 2021-07-27 22:33:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.