Bug 1888866 - AggregatedAPIDown permanently firing after removing APIService
Summary: AggregatedAPIDown permanently firing after removing APIService
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Damien Grisonnet
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks: 1915247
TreeView+ depends on / blocked
 
Reported: 2020-10-16 01:27 UTC by Seth Jennings
Modified: 2024-03-25 16:44 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1915247 1916660 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:26:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
console.png (59.00 KB, image/png)
2020-10-16 01:31 UTC, Seth Jennings
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes issues 92671 0 None closed After the aggregation api is deleted, the deleted api still exists in the kube-apiserver metrics 2021-02-09 07:41:38 UTC
Github kubernetes kubernetes pull 96421 0 None closed Fix aggregator_unavailable_apiservice gauge 2021-02-09 07:41:39 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:26:53 UTC

Description Seth Jennings 2020-10-16 01:27:32 UTC
Description of problem:
AggregatedAPIDown permanently firing after removing APIService

Version-Release number of selected component (if applicable):
4.6.0-rc.4

How reproducible:
Always

Steps to Reproduce:
1. Create an APIService (for me, I installed ACM which created one)
2. Remove the APIService
3. Observe the alert

Actual results:
AggregatedAPIDown permanently firing after removing APIService

Expected results:
AggregatedAPIDown alert only checks for APIServices that actually still exist

Additional info:
https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/397
https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/406
https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/407
https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/408

Comment 1 Seth Jennings 2020-10-16 01:31:02 UTC
Created attachment 1721947 [details]
console.png

See attached console screenshot. APIServices triggering the alert do not exist anymore.

$ oc get apiservices | grep False
<nothing returned, all apiservices are responding>

$ oc get apiservices | grep v1.admission.hive.openshift.io
<nothing returned, apiservice triggering alert does not exist>

Comment 4 Kevin Chung 2020-11-09 17:48:29 UTC
I'm observing this issue as well, does a workaround (even an unsupported one) exist?  I have an ephemeral monitoring stack and tried everything from deleting pods, prometheusrules, even 'oc delete --raw /metrics' but I wait a few minutes and this alert still ends up triggering in my dashboard:

An aggregated API <name of the apiservice>/default has been only 0% available over the last 5m.

Also, this was previously reported upstream in GitHub, but there doesn't seem to be progress there.  I'm going to link it to this BZ.

Comment 5 Damien Grisonnet 2020-11-09 17:56:51 UTC
There is, unfortunately, no workaround for this. However, I noticed that this only affect the API services that were deleted while being unavailable. Maybe this information can help you somehow.

> Also, this was previously reported upstream in GitHub, but there doesn't seem to be progress there.  I'm going to link it to this BZ.

Yes, I tried to give it some traction but without any result. Hence, I am currently working on a PR to fix the issue.

Comment 6 Damien Grisonnet 2020-11-10 18:25:42 UTC
I linked the PR I opened against Kubernetes to this BZ.

Comment 7 Damien Grisonnet 2020-11-26 10:38:46 UTC
The upstream PR being LGTM, it is now in the hand of the api team to cherry-pick the fix.

Comment 9 Stefan Schimanski 2021-01-11 09:40:23 UTC
We need 4.6 and 4.5 backports of this.

Comment 10 Damien Grisonnet 2021-01-11 09:47:10 UTC
A workaround would be to restart the kube-apiservers after deleting the APIService. It should allow to silence the AggregatedAPIDown alert.

Lowering to high priority and severity as a workaround exists.

Comment 11 Damien Grisonnet 2021-01-15 11:43:29 UTC
Manually moving this BZ to MODIFIED as the upstream PR was synced in 4.7 by the 1.20 rebase.

https://github.com/openshift/kubernetes/pull/471/commits/b525f9e0ed0003471438fb42fa37ff4ebe36d653

Comment 15 Ke Wang 2021-01-19 10:45:32 UTC
$ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-18-214951  | grep 'hyperkube'
  hyperkube           https://github.com/openshift/kubernetes         d9c52cc4e02894215b0d1c2aeea240fe77765c66

$ cd kubernetes
$ git pull

$ git log --date=local --pretty="%h %an %cd - %s" d9c52cc4 | grep  '#96421'
5ed4b76a03b Kubernetes Prow Robot Thu Nov 26 23:24:19 2020 - Merge pull request #96421 from dgrisonnet/fix-apiservice-availability

$ git log --date=local --pretty="%h %an %cd - %s" d9c52cc4 | grep  '#92671'
No results found.

The PR 92671 has not been loaded on the latest OCP 4.7 payload, will wait it loading.

Comment 21 errata-xmlrpc 2021-02-24 15:26:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.