We have metrics for aggregated API server status, but we should have alerts for the same. aggregator_unavailable_apiservice_count aggregator_unavailable_apiserver_gauge from https://github.com/kubernetes/kubernetes/pull/71380/files#diff-ff2f46bf6d90de801158ee135cae230eR23-R39
I am assigning it to Lili as she will use it to merge https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/358 into the master branch.
Checked the latest OCP 4.4 nightly build, the related PR https://github.com/openshift/cluster-kube-apiserver-operator/pull/746 has not been merged in. [ke@ke-fedora cluster-kube-apiserver-operator]$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-03-26-041820 |grep kube-apiserver cluster-kube-apiserver-operator https://github.com/openshift/cluster-kube-apiserver-operator c1adad084525c6252b5725d237b59d1068d145a2 [ke@ke-fedora cluster-kube-apiserver-operator]$ git log --date local --pretty="%h %an %cd - %s" c1adad08 | grep '#746' Nothing found.
@Ke the alerts have been added in https://github.com/openshift/cluster-monitoring-operator/pull/669, could you check one more time?
@Lukasz, I checked PR 669 with OCP build 4.4.0-0.nightly-2020-03-26-041820, it's already in. [ke@ke-fedora cluster-monitoring-operator]$ git log --date local --pretty="%h %an %cd - %s" 76b306f2 | grep '#669' dfb08550 OpenShift Merge Robot Sat Mar 7 05:52:14 2020 - Merge pull request #669 from lilic/update-deps-4.4 Verified with OCP build 4.4.0-0.nightly-2020-03-26-225521. Verification steps, 1. Make some apiservice fail, e.g. remove openshift-apiserver by: $ oc patch openshiftapiserver cluster --type=json -p '[{"op": "replace", "path": "/spec/managementState", "value": "Removed"}]' 2. Wait for a while about more than 5 minutes, try the following command line, $ TK=`oc sa get-token cluster-monitoring-operator -n openshift-monitoring` $ oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- curl -s -k -H "Authorization: Bearer $TK" https://localhost:9095/api/v1/alerts | jq -r '.data[] | select(.labels.alertname=="AggregatedAPIDown") | .labels' { "alertname": "AggregatedAPIDown", "name": "v1.security.openshift.io", "namespace": "default", "prometheus": "openshift-monitoring/k8s", "severity": "warning" } { "alertname": "AggregatedAPIDown", "name": "v1.authorization.openshift.io", "namespace": "default", "prometheus": "openshift-monitoring/k8s", "severity": "warning" } { "alertname": "AggregatedAPIDown", "name": "v1.oauth.openshift.io", "namespace": "default", "prometheus": "openshift-monitoring/k8s", "severity": "warning" } { "alertname": "AggregatedAPIDown", "name": "v1.image.openshift.io", "namespace": "default", "prometheus": "openshift-monitoring/k8s", "severity": "warning" } { "alertname": "AggregatedAPIDown", "name": "v1.route.openshift.io", "namespace": "default", "prometheus": "openshift-monitoring/k8s", "severity": "warning" } { "alertname": "AggregatedAPIDown", "name": "v1.project.openshift.io", "namespace": "default", "prometheus": "openshift-monitoring/k8s", "severity": "warning" } { "alertname": "AggregatedAPIDown", "name": "v1.build.openshift.io", "namespace": "default", "prometheus": "openshift-monitoring/k8s", "severity": "warning" } { "alertname": "AggregatedAPIDown", "name": "v1.apps.openshift.io", "namespace": "default", "prometheus": "openshift-monitoring/k8s", "severity": "warning" } { "alertname": "AggregatedAPIDown", "name": "v1.user.openshift.io", "namespace": "default", "prometheus": "openshift-monitoring/k8s", "severity": "warning" } { "alertname": "AggregatedAPIDown", "name": "v1.template.openshift.io", "namespace": "default", "prometheus": "openshift-monitoring/k8s", "severity": "warning" } { "alertname": "AggregatedAPIDown", "name": "v1.quota.openshift.io", "namespace": "default", "prometheus": "openshift-monitoring/k8s", "severity": "warning" } Total 11 APIs down. We can see the feature works well with PR merged into OCP build.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581