Description of problem: AggregatedAPIDown permanently firing after removing APIService Version-Release number of selected component (if applicable): 4.6.0-rc.4 How reproducible: Always Steps to Reproduce: 1. Create an APIService (for me, I installed ACM which created one) 2. Remove the APIService 3. Observe the alert Actual results: AggregatedAPIDown permanently firing after removing APIService Expected results: AggregatedAPIDown alert only checks for APIServices that actually still exist Additional info: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/397 https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/406 https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/407 https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/408
Created attachment 1721947 [details] console.png See attached console screenshot. APIServices triggering the alert do not exist anymore. $ oc get apiservices | grep False <nothing returned, all apiservices are responding> $ oc get apiservices | grep v1.admission.hive.openshift.io <nothing returned, apiservice triggering alert does not exist>
I'm observing this issue as well, does a workaround (even an unsupported one) exist? I have an ephemeral monitoring stack and tried everything from deleting pods, prometheusrules, even 'oc delete --raw /metrics' but I wait a few minutes and this alert still ends up triggering in my dashboard: An aggregated API <name of the apiservice>/default has been only 0% available over the last 5m. Also, this was previously reported upstream in GitHub, but there doesn't seem to be progress there. I'm going to link it to this BZ.
There is, unfortunately, no workaround for this. However, I noticed that this only affect the API services that were deleted while being unavailable. Maybe this information can help you somehow. > Also, this was previously reported upstream in GitHub, but there doesn't seem to be progress there. I'm going to link it to this BZ. Yes, I tried to give it some traction but without any result. Hence, I am currently working on a PR to fix the issue.
I linked the PR I opened against Kubernetes to this BZ.
The upstream PR being LGTM, it is now in the hand of the api team to cherry-pick the fix.
We need 4.6 and 4.5 backports of this.
A workaround would be to restart the kube-apiservers after deleting the APIService. It should allow to silence the AggregatedAPIDown alert. Lowering to high priority and severity as a workaround exists.
Manually moving this BZ to MODIFIED as the upstream PR was synced in 4.7 by the 1.20 rebase. https://github.com/openshift/kubernetes/pull/471/commits/b525f9e0ed0003471438fb42fa37ff4ebe36d653
$ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-18-214951 | grep 'hyperkube' hyperkube https://github.com/openshift/kubernetes d9c52cc4e02894215b0d1c2aeea240fe77765c66 $ cd kubernetes $ git pull $ git log --date=local --pretty="%h %an %cd - %s" d9c52cc4 | grep '#96421' 5ed4b76a03b Kubernetes Prow Robot Thu Nov 26 23:24:19 2020 - Merge pull request #96421 from dgrisonnet/fix-apiservice-availability $ git log --date=local --pretty="%h %an %cd - %s" d9c52cc4 | grep '#92671' No results found. The PR 92671 has not been loaded on the latest OCP 4.7 payload, will wait it loading.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633