Investigating metrics changes in k8s 1.19 revealed that aggregator_unavailable_apiservice_count metric was renamed to aggregator_unavailable_apiservice_total which is used in our stack in the "AggregatedAPIErrors" alert: https://github.com/openshift/cluster-monitoring-operator/blob/57a33cb45dc97d23f0b77885c2acd10fd8b60717/assets/prometheus-k8s/rules.yaml#L1680-L1687 We need to fix upstream, vendor downstream and backport to 4.6. As this is a symptom based alert with alerting severity warning I am setting the BZ severity to low (not release blocking).
UpcomingSprint: We don't have enough capacity to tackle this one in the next sprint (193).
Rule is sum by(name, namespace) (increase(aggregator_unavailable_apiservice_count[10m])) > 4 metric aggregator_unavailable_apiservice_count doesn't exist
As the fix is already merged upstream, bumping kubernetes-mixin downstream should resolve this BZ. Thus, I'm reassigning this bug to Pawel as he is reponsible for the 4.8 bumps.
Test with payload # oc get cm prometheus-k8s-rulefiles-0 -oyaml|grep -A10 AggregatedAPIErrors - alert: AggregatedAPIErrors annotations: description: An aggregated API {{ $labels.name }}/{{ $labels.namespace }} has reported errors. It has appeared unavailable {{ $value | humanize }} times averaged over the past 10m. summary: An aggregated API has reported errors. expr: | sum by(name, namespace)(increase(aggregator_unavailable_apiservice_total[10m])) > 4 labels: severity: warning - alert: AggregatedAPIDown
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438