Bug 1772564

Summary: need alerts for aggregated API metrics
Product: OpenShift Container Platform Reporter: David Eads <deads>
Component: kube-apiserverAssignee: Lili Cosic <lcosic>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.3.0CC: aos-bugs, lszaszki, mfojtik, nagrawal, sttts, xxia
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Release Note
Doc Text:
Added AggregatedAPIErrors prometheus alert: An aggregated API has reported errors. The number of errors have increased for it in the past five minutes. High values indicate that the availability of the service changes too often.
Story Points: ---
Clone Of:
: 1810424 (view as bug list) Environment:
Last Closed: 2020-05-04 11:15:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1810424    
Bug Blocks:    

Description David Eads 2019-11-14 15:58:30 UTC
We have metrics for aggregated API server status, but we should have alerts for the same.

 aggregator_unavailable_apiservice_count
 aggregator_unavailable_apiserver_gauge

from https://github.com/kubernetes/kubernetes/pull/71380/files#diff-ff2f46bf6d90de801158ee135cae230eR23-R39

Comment 2 Lukasz Szaszkiewicz 2020-02-17 12:15:10 UTC
I am assigning it to Lili as she will use it to merge https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/358 into the master branch.

Comment 5 Ke Wang 2020-03-26 10:00:43 UTC
Checked the latest OCP 4.4 nightly build, the related PR https://github.com/openshift/cluster-kube-apiserver-operator/pull/746 has not been merged in.

[ke@ke-fedora cluster-kube-apiserver-operator]$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-03-26-041820 |grep kube-apiserver
  cluster-kube-apiserver-operator                https://github.com/openshift/cluster-kube-apiserver-operator                c1adad084525c6252b5725d237b59d1068d145a2

[ke@ke-fedora cluster-kube-apiserver-operator]$ git log --date local --pretty="%h %an %cd - %s" c1adad08 | grep '#746'

Nothing found.

Comment 6 Lukasz Szaszkiewicz 2020-03-26 10:07:06 UTC
@Ke the alerts have been added in https://github.com/openshift/cluster-monitoring-operator/pull/669, could you check one more time?

Comment 7 Ke Wang 2020-03-27 07:57:09 UTC
@Lukasz, I checked PR 669 with OCP build 4.4.0-0.nightly-2020-03-26-041820, it's already in.

[ke@ke-fedora cluster-monitoring-operator]$ git log --date local --pretty="%h %an %cd - %s" 76b306f2 | grep '#669'
dfb08550 OpenShift Merge Robot Sat Mar 7 05:52:14 2020 - Merge pull request #669 from lilic/update-deps-4.4

Verified with OCP build 4.4.0-0.nightly-2020-03-26-225521.

Verification steps,
1. Make some apiservice fail, e.g. remove openshift-apiserver by:
 $ oc patch openshiftapiserver cluster --type=json -p '[{"op": "replace", "path": "/spec/managementState", "value": "Removed"}]'
 
2. Wait for a while about more than 5 minutes, try the following command line,

$ TK=`oc sa get-token cluster-monitoring-operator -n openshift-monitoring` 
$ oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- curl -s -k -H "Authorization: Bearer $TK" https://localhost:9095/api/v1/alerts | jq -r '.data[] | select(.labels.alertname=="AggregatedAPIDown") | .labels'
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.security.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.authorization.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.oauth.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.image.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.route.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.project.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.build.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.apps.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.user.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.template.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.quota.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}

Total 11 APIs down.

We can see the feature works well with PR merged  into OCP build.

Comment 9 errata-xmlrpc 2020-05-04 11:15:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581