1893850 – Add an alert for requests rejected by the apiserver

Bug 1893850 - Add an alert for requests rejected by the apiserver [NEEDINFO]

Summary: Add an alert for requests rejected by the apiserver

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Stefan Schimanski
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:	LifecycleReset
Depends On:
Blocks:	1893854
TreeView+	depends on / blocked

Reported:	2020-11-02 19:08 UTC by Abu Kashem
Modified:	2021-07-27 22:34 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1893854 (view as bug list)
Environment:
Last Closed:	2021-07-27 22:34:10 UTC
Target Upstream Version:
Embargoed:
Flags:	mfojtik: needinfo?

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:34:30 UTC

Description Abu Kashem 2020-11-02 19:08:13 UTC

The current alert 'KubeAPIErrorsHigh' does not take into account the requests rejected by the apiserver. This alert checks the 'apiserver_request_total' metrics for  5xx errors. When a request is rejected by the apiserver it does not record the 'apiserver_request_total' metric, on the other hand it records the 'apiserver_request_terminations_total' metric. 

So 'KubeAPIErrorsHigh' is not aware of any requests rejected by the apiserver. We need to add an alert that inspects the 'apiserver_request_terminations_total' metric and alerts if requests are being rejected.

Note:
- 'KubeAPIErrorsHigh' has been removed in 4.6 and replaced with 'KubeAPIErrorBudgetBurn'.
- The alert should be added to mixin first and then imported to OpenShift. This is where we would add the new alert - https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/kube_apiserver.libsonnet#L19

Comment 1 Michal Fojtik 2020-12-02 19:46:19 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 2 Ivan Sim 2020-12-20 18:52:30 UTC

@akashem I have doing some testing on the `apiserver_request_termination_total` 
metric. My current thought is that the new alert rule should be expressed as a percentage 
calculated by the total rejected requests over the total requests, over a predefined 
duration, i.e., 

sum(rate(apiserver_request_termination_total[15m]) / ( sum(rate(apiserver_request_total[15m])) + sum(rate(apiserver_request_termination_total[15m])) )

The denominator needs to be a summation of the two metrics because apiserver_request_total
doesn't capture the 429 errors.

This feels more reasonable to me that an alert rule that relies on comparing the 
`apiserver_request_termination_total` with a const, since the threshold may be different 
for different environments.

WDYT?

Comment 3 Michal Fojtik 2020-12-20 18:58:28 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 4 Ivan Sim 2021-01-04 21:20:44 UTC

The upstream PR[1] to the kubernetes-mixin repo has been reviewed and merged. 
The next step is to update the jsonnet dependencies in the 
cluster-monitoring-operator repo. AFAICT, the kubernetes-mixin is pulled in
as a dependency of the kube-prometheus config[2][3]. 

I will confirm with the monitoring team regarding the steps to update these
jsonnet assets.


[1] https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/538
[2] https://github.com/openshift/cluster-monitoring-operator/blob/master/jsonnet/jsonnetfile.json#L13-L21
[3] https://github.com/prometheus-operator/kube-prometheus/blob/808c2f8c3d760c7c02ab4ef1b41987da83d54f90/jsonnet/kube-prometheus/jsonnetfile.json#L41-L48

Comment 5 Ivan Sim 2021-01-06 05:17:36 UTC

The monitoring team confirmed that the upstream kubernetes-mixin
jsonnet assets are pulled in as part of their release process where
the kube-prometheus dependency is upgraded. This BZ will have to be
put on-hold until the next CMO release.

Comment 6 Abu Kashem 2021-02-03 22:26:39 UTC

isim,
can you check with the monitoring team whether this alert is going to make it to 4.7? Otherwise, we will need to set the target release to 4.8.

Comment 7 Ivan Sim 2021-02-04 21:15:45 UTC

@akashem given that the code freeze starts tomorrow (Feb 5), and the CMO is still
pointing to the December release of kube-prometheus, I'd say it's unlikely that this will be
included in 4.7. I reached out to the CMO team but haven't received any response. I think we
can move this to 4.8. If anything changes, I will let you know.

Comment 8 Abu Kashem 2021-02-05 06:02:44 UTC

changing the target release to 4.8 per Ivan's feedback.

Comment 9 Michal Fojtik 2021-03-07 06:26:53 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 10 Ivan Sim 2021-04-22 21:14:10 UTC

The `KubeAPITerminatedRequests` alert landed in CMO about 2 months ago[1]. This
alert should now be visible in the OpenShift monitoring alerting console. 

To test, deploy this[2] custom controller to the cluster, and it will send a 
series of loads to the API server, to trigger P&F throttling. The TICK_INTERVAL 
env var can be decrease/increase to increase/decrease the load qps. The alert 
should be activated when P&F starts rejecting the requests. To recover from the
alert, scale the custom controller deployment down to zero pods.

[1] https://github.com/openshift/cluster-monitoring-operator/pull/1044
[2] https://github.com/ihcsim/controllers/blob/master/podlister/deployment.yaml

Comment 12 Michal Fojtik 2021-04-22 22:06:18 UTC

The LifecycleStale keyword was removed because the bug moved to QE.
The bug assignee was notified.

Comment 13 Ke Wang 2021-04-30 09:12:37 UTC

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-30-031222   True        False         3h4m    Cluster version is 4.8.0-0.nightly-2021-04-30-031222

$ oc get PrometheusRule -n openshift-monitoring -o yaml | grep -A8 'KubeAPITerminatedRequests'
      - alert: KubeAPITerminatedRequests
        annotations:
          description: The apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests.
          summary: The apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests.
        expr: |
          sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m]))  / (  sum(rate(apiserver_request_total{job="apiserver"}[10m])) + sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m])) ) > 0.20
        for: 5m
        labels:
          severity: warning

This alert also can be seen in web-console UI, so move the bug VERIFIED.

Comment 17 errata-xmlrpc 2021-07-27 22:34:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.