Bug 1772564 - need alerts for aggregated API metrics
Summary: need alerts for aggregated API metrics
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.4.0
Assignee: Lili Cosic
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On: 1810424
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-14 15:58 UTC by David Eads
Modified: 2020-05-04 11:16 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Release Note
Doc Text:
Added AggregatedAPIErrors prometheus alert: An aggregated API has reported errors. The number of errors have increased for it in the past five minutes. High values indicate that the availability of the service changes too often.
Clone Of:
: 1810424 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:15:35 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-apiserver-operator pull 746 0 None closed Bug 1798450: alerts for aggregated API metrics 2021-01-18 23:58:54 UTC
Github openshift cluster-monitoring-operator pull 669 0 None closed Bug 1772564: Bring in new alert via dependency 2021-01-18 23:58:54 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:16:06 UTC

Description David Eads 2019-11-14 15:58:30 UTC
We have metrics for aggregated API server status, but we should have alerts for the same.

 aggregator_unavailable_apiservice_count
 aggregator_unavailable_apiserver_gauge

from https://github.com/kubernetes/kubernetes/pull/71380/files#diff-ff2f46bf6d90de801158ee135cae230eR23-R39

Comment 2 Lukasz Szaszkiewicz 2020-02-17 12:15:10 UTC
I am assigning it to Lili as she will use it to merge https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/358 into the master branch.

Comment 5 Ke Wang 2020-03-26 10:00:43 UTC
Checked the latest OCP 4.4 nightly build, the related PR https://github.com/openshift/cluster-kube-apiserver-operator/pull/746 has not been merged in.

[ke@ke-fedora cluster-kube-apiserver-operator]$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-03-26-041820 |grep kube-apiserver
  cluster-kube-apiserver-operator                https://github.com/openshift/cluster-kube-apiserver-operator                c1adad084525c6252b5725d237b59d1068d145a2

[ke@ke-fedora cluster-kube-apiserver-operator]$ git log --date local --pretty="%h %an %cd - %s" c1adad08 | grep '#746'

Nothing found.

Comment 6 Lukasz Szaszkiewicz 2020-03-26 10:07:06 UTC
@Ke the alerts have been added in https://github.com/openshift/cluster-monitoring-operator/pull/669, could you check one more time?

Comment 7 Ke Wang 2020-03-27 07:57:09 UTC
@Lukasz, I checked PR 669 with OCP build 4.4.0-0.nightly-2020-03-26-041820, it's already in.

[ke@ke-fedora cluster-monitoring-operator]$ git log --date local --pretty="%h %an %cd - %s" 76b306f2 | grep '#669'
dfb08550 OpenShift Merge Robot Sat Mar 7 05:52:14 2020 - Merge pull request #669 from lilic/update-deps-4.4

Verified with OCP build 4.4.0-0.nightly-2020-03-26-225521.

Verification steps,
1. Make some apiservice fail, e.g. remove openshift-apiserver by:
 $ oc patch openshiftapiserver cluster --type=json -p '[{"op": "replace", "path": "/spec/managementState", "value": "Removed"}]'
 
2. Wait for a while about more than 5 minutes, try the following command line,

$ TK=`oc sa get-token cluster-monitoring-operator -n openshift-monitoring` 
$ oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- curl -s -k -H "Authorization: Bearer $TK" https://localhost:9095/api/v1/alerts | jq -r '.data[] | select(.labels.alertname=="AggregatedAPIDown") | .labels'
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.security.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.authorization.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.oauth.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.image.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.route.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.project.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.build.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.apps.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.user.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.template.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}
{
  "alertname": "AggregatedAPIDown",
  "name": "v1.quota.openshift.io",
  "namespace": "default",
  "prometheus": "openshift-monitoring/k8s",
  "severity": "warning"
}

Total 11 APIs down.

We can see the feature works well with PR merged  into OCP build.

Comment 9 errata-xmlrpc 2020-05-04 11:15:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.