Bug 1888866

Summary: AggregatedAPIDown permanently firing after removing APIService
Product: OpenShift Container Platform Reporter: Seth Jennings <sjenning>
Component: kube-apiserverAssignee: Damien Grisonnet <dgrisonn>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.6CC: alegrand, anpicker, aos-bugs, dgrisonn, erooth, kakkoyun, kechung, lcosic, lmartinh, mfojtik, palonsor, pkrupa, spasquie, surbania, xxia
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1915247 1916660 (view as bug list) Environment:
Last Closed: 2021-02-24 15:26:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1915247    
Attachments:
Description Flags
console.png none

Description Seth Jennings 2020-10-16 01:27:32 UTC
Description of problem:
AggregatedAPIDown permanently firing after removing APIService

Version-Release number of selected component (if applicable):
4.6.0-rc.4

How reproducible:
Always

Steps to Reproduce:
1. Create an APIService (for me, I installed ACM which created one)
2. Remove the APIService
3. Observe the alert

Actual results:
AggregatedAPIDown permanently firing after removing APIService

Expected results:
AggregatedAPIDown alert only checks for APIServices that actually still exist

Additional info:
https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/397
https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/406
https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/407
https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/408

Comment 1 Seth Jennings 2020-10-16 01:31:02 UTC
Created attachment 1721947 [details]
console.png

See attached console screenshot. APIServices triggering the alert do not exist anymore.

$ oc get apiservices | grep False
<nothing returned, all apiservices are responding>

$ oc get apiservices | grep v1.admission.hive.openshift.io
<nothing returned, apiservice triggering alert does not exist>

Comment 4 Kevin Chung 2020-11-09 17:48:29 UTC
I'm observing this issue as well, does a workaround (even an unsupported one) exist?  I have an ephemeral monitoring stack and tried everything from deleting pods, prometheusrules, even 'oc delete --raw /metrics' but I wait a few minutes and this alert still ends up triggering in my dashboard:

An aggregated API <name of the apiservice>/default has been only 0% available over the last 5m.

Also, this was previously reported upstream in GitHub, but there doesn't seem to be progress there.  I'm going to link it to this BZ.

Comment 5 Damien Grisonnet 2020-11-09 17:56:51 UTC
There is, unfortunately, no workaround for this. However, I noticed that this only affect the API services that were deleted while being unavailable. Maybe this information can help you somehow.

> Also, this was previously reported upstream in GitHub, but there doesn't seem to be progress there.  I'm going to link it to this BZ.

Yes, I tried to give it some traction but without any result. Hence, I am currently working on a PR to fix the issue.

Comment 6 Damien Grisonnet 2020-11-10 18:25:42 UTC
I linked the PR I opened against Kubernetes to this BZ.

Comment 7 Damien Grisonnet 2020-11-26 10:38:46 UTC
The upstream PR being LGTM, it is now in the hand of the api team to cherry-pick the fix.

Comment 9 Stefan Schimanski 2021-01-11 09:40:23 UTC
We need 4.6 and 4.5 backports of this.

Comment 10 Damien Grisonnet 2021-01-11 09:47:10 UTC
A workaround would be to restart the kube-apiservers after deleting the APIService. It should allow to silence the AggregatedAPIDown alert.

Lowering to high priority and severity as a workaround exists.

Comment 11 Damien Grisonnet 2021-01-15 11:43:29 UTC
Manually moving this BZ to MODIFIED as the upstream PR was synced in 4.7 by the 1.20 rebase.

https://github.com/openshift/kubernetes/pull/471/commits/b525f9e0ed0003471438fb42fa37ff4ebe36d653

Comment 15 Ke Wang 2021-01-19 10:45:32 UTC
$ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-18-214951  | grep 'hyperkube'
  hyperkube           https://github.com/openshift/kubernetes         d9c52cc4e02894215b0d1c2aeea240fe77765c66

$ cd kubernetes
$ git pull

$ git log --date=local --pretty="%h %an %cd - %s" d9c52cc4 | grep  '#96421'
5ed4b76a03b Kubernetes Prow Robot Thu Nov 26 23:24:19 2020 - Merge pull request #96421 from dgrisonnet/fix-apiservice-availability

$ git log --date=local --pretty="%h %an %cd - %s" d9c52cc4 | grep  '#92671'
No results found.

The PR 92671 has not been loaded on the latest OCP 4.7 payload, will wait it loading.

Comment 21 errata-xmlrpc 2021-02-24 15:26:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633