1888866 – AggregatedAPIDown permanently firing after removing APIService

Bug 1888866 - AggregatedAPIDown permanently firing after removing APIService

Summary: AggregatedAPIDown permanently firing after removing APIService

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Damien Grisonnet
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1915247
TreeView+	depends on / blocked

Reported:	2020-10-16 01:27 UTC by Seth Jennings
Modified:	2024-03-25 16:44 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1915247 1916660 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:26:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
console.png (59.00 KB, image/png) 2020-10-16 01:31 UTC, Seth Jennings	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes issues 92671	None	closed	After the aggregation api is deleted, the deleted api still exists in the kube-apiserver metrics	2021-02-09 07:41:38 UTC
Github	kubernetes kubernetes pull 96421	None	closed	Fix aggregator_unavailable_apiservice gauge	2021-02-09 07:41:39 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:26:53 UTC

Description Seth Jennings 2020-10-16 01:27:32 UTC

Description of problem:
AggregatedAPIDown permanently firing after removing APIService

Version-Release number of selected component (if applicable):
4.6.0-rc.4

How reproducible:
Always

Steps to Reproduce:
1. Create an APIService (for me, I installed ACM which created one)
2. Remove the APIService
3. Observe the alert

Actual results:
AggregatedAPIDown permanently firing after removing APIService

Expected results:
AggregatedAPIDown alert only checks for APIServices that actually still exist

Additional info:
https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/397
https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/406
https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/407
https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/408

Comment 1 Seth Jennings 2020-10-16 01:31:02 UTC

Created attachment 1721947 [details]
console.png

See attached console screenshot. APIServices triggering the alert do not exist anymore.

$ oc get apiservices | grep False
<nothing returned, all apiservices are responding>

$ oc get apiservices | grep v1.admission.hive.openshift.io
<nothing returned, apiservice triggering alert does not exist>

Comment 4 Kevin Chung 2020-11-09 17:48:29 UTC

I'm observing this issue as well, does a workaround (even an unsupported one) exist?  I have an ephemeral monitoring stack and tried everything from deleting pods, prometheusrules, even 'oc delete --raw /metrics' but I wait a few minutes and this alert still ends up triggering in my dashboard:

An aggregated API <name of the apiservice>/default has been only 0% available over the last 5m.

Also, this was previously reported upstream in GitHub, but there doesn't seem to be progress there.  I'm going to link it to this BZ.

Comment 5 Damien Grisonnet 2020-11-09 17:56:51 UTC

There is, unfortunately, no workaround for this. However, I noticed that this only affect the API services that were deleted while being unavailable. Maybe this information can help you somehow.

> Also, this was previously reported upstream in GitHub, but there doesn't seem to be progress there.  I'm going to link it to this BZ.

Yes, I tried to give it some traction but without any result. Hence, I am currently working on a PR to fix the issue.

Comment 6 Damien Grisonnet 2020-11-10 18:25:42 UTC

I linked the PR I opened against Kubernetes to this BZ.

Comment 7 Damien Grisonnet 2020-11-26 10:38:46 UTC

The upstream PR being LGTM, it is now in the hand of the api team to cherry-pick the fix.

Comment 9 Stefan Schimanski 2021-01-11 09:40:23 UTC

We need 4.6 and 4.5 backports of this.

Comment 10 Damien Grisonnet 2021-01-11 09:47:10 UTC

A workaround would be to restart the kube-apiservers after deleting the APIService. It should allow to silence the AggregatedAPIDown alert.

Lowering to high priority and severity as a workaround exists.

Comment 11 Damien Grisonnet 2021-01-15 11:43:29 UTC

Manually moving this BZ to MODIFIED as the upstream PR was synced in 4.7 by the 1.20 rebase.

https://github.com/openshift/kubernetes/pull/471/commits/b525f9e0ed0003471438fb42fa37ff4ebe36d653

Comment 15 Ke Wang 2021-01-19 10:45:32 UTC

$ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-18-214951  | grep 'hyperkube'
  hyperkube           https://github.com/openshift/kubernetes         d9c52cc4e02894215b0d1c2aeea240fe77765c66

$ cd kubernetes
$ git pull

$ git log --date=local --pretty="%h %an %cd - %s" d9c52cc4 | grep  '#96421'
5ed4b76a03b Kubernetes Prow Robot Thu Nov 26 23:24:19 2020 - Merge pull request #96421 from dgrisonnet/fix-apiservice-availability

$ git log --date=local --pretty="%h %an %cd - %s" d9c52cc4 | grep  '#92671'
No results found.

The PR 92671 has not been loaded on the latest OCP 4.7 payload, will wait it loading.

Comment 21 errata-xmlrpc 2021-02-24 15:26:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.