Bug 1888866

Summary:

AggregatedAPIDown permanently firing after removing APIService

Product:

OpenShift Container Platform

Reporter:

Seth Jennings <sjenning>

Component:

kube-apiserver

Assignee:

Damien Grisonnet <dgrisonn>

Status:

CLOSED ERRATA

QA Contact:

Ke Wang <kewang>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.6

CC:

alegrand, anpicker, aos-bugs, dgrisonn, erooth, kakkoyun, kechung, lcosic, lmartinh, mfojtik, palonsor, pkrupa, spasquie, surbania, xxia

Target Milestone:

---

Target Release:

4.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Clones:

1915247 1916660 (view as bug list)

Environment:

Last Closed:

2021-02-24 15:26:23 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1915247

Attachments:

Description	Flags
console.png	none

Description Seth Jennings 2020-10-16 01:27:32 UTC

Description of problem:
AggregatedAPIDown permanently firing after removing APIService

Version-Release number of selected component (if applicable):
4.6.0-rc.4

How reproducible:
Always

Steps to Reproduce:
1. Create an APIService (for me, I installed ACM which created one)
2. Remove the APIService
3. Observe the alert

Actual results:
AggregatedAPIDown permanently firing after removing APIService

Expected results:
AggregatedAPIDown alert only checks for APIServices that actually still exist

Additional info:
https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/397
https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/406
https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/407
https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/408

Comment 1 Seth Jennings 2020-10-16 01:31:02 UTC

Created attachment 1721947 [details]
console.png

See attached console screenshot. APIServices triggering the alert do not exist anymore.

$ oc get apiservices | grep False
<nothing returned, all apiservices are responding>

$ oc get apiservices | grep v1.admission.hive.openshift.io
<nothing returned, apiservice triggering alert does not exist>

Comment 4 Kevin Chung 2020-11-09 17:48:29 UTC

I'm observing this issue as well, does a workaround (even an unsupported one) exist?  I have an ephemeral monitoring stack and tried everything from deleting pods, prometheusrules, even 'oc delete --raw /metrics' but I wait a few minutes and this alert still ends up triggering in my dashboard:

An aggregated API <name of the apiservice>/default has been only 0% available over the last 5m.

Also, this was previously reported upstream in GitHub, but there doesn't seem to be progress there.  I'm going to link it to this BZ.

Comment 5 Damien Grisonnet 2020-11-09 17:56:51 UTC

There is, unfortunately, no workaround for this. However, I noticed that this only affect the API services that were deleted while being unavailable. Maybe this information can help you somehow.

> Also, this was previously reported upstream in GitHub, but there doesn't seem to be progress there.  I'm going to link it to this BZ.

Yes, I tried to give it some traction but without any result. Hence, I am currently working on a PR to fix the issue.

Comment 6 Damien Grisonnet 2020-11-10 18:25:42 UTC

I linked the PR I opened against Kubernetes to this BZ.

Comment 7 Damien Grisonnet 2020-11-26 10:38:46 UTC

The upstream PR being LGTM, it is now in the hand of the api team to cherry-pick the fix.

Comment 9 Stefan Schimanski 2021-01-11 09:40:23 UTC

We need 4.6 and 4.5 backports of this.

Comment 10 Damien Grisonnet 2021-01-11 09:47:10 UTC

A workaround would be to restart the kube-apiservers after deleting the APIService. It should allow to silence the AggregatedAPIDown alert.

Lowering to high priority and severity as a workaround exists.

Comment 11 Damien Grisonnet 2021-01-15 11:43:29 UTC

Manually moving this BZ to MODIFIED as the upstream PR was synced in 4.7 by the 1.20 rebase.

https://github.com/openshift/kubernetes/pull/471/commits/b525f9e0ed0003471438fb42fa37ff4ebe36d653

Comment 15 Ke Wang 2021-01-19 10:45:32 UTC

$ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-18-214951  | grep 'hyperkube'
  hyperkube           https://github.com/openshift/kubernetes         d9c52cc4e02894215b0d1c2aeea240fe77765c66

$ cd kubernetes
$ git pull

$ git log --date=local --pretty="%h %an %cd - %s" d9c52cc4 | grep  '#96421'
5ed4b76a03b Kubernetes Prow Robot Thu Nov 26 23:24:19 2020 - Merge pull request #96421 from dgrisonnet/fix-apiservice-availability

$ git log --date=local --pretty="%h %an %cd - %s" d9c52cc4 | grep  '#92671'
No results found.

The PR 92671 has not been loaded on the latest OCP 4.7 payload, will wait it loading.

Comment 21 errata-xmlrpc 2021-02-24 15:26:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633