Bug 1893850
| Summary: | Add an alert for requests rejected by the apiserver | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Abu Kashem <akashem> | |
| Component: | kube-apiserver | Assignee: | Stefan Schimanski <sttts> | |
| Status: | CLOSED ERRATA | QA Contact: | Ke Wang <kewang> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | low | |||
| Version: | 4.5 | CC: | aos-bugs, mfojtik, xxia | |
| Target Milestone: | --- | Flags: | mfojtik:
needinfo?
|
|
| Target Release: | 4.8.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | LifecycleReset | |||
| Fixed In Version: | Doc Type: | No Doc Update | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1893854 (view as bug list) | Environment: | ||
| Last Closed: | 2021-07-27 22:34:10 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1893854 | |||
|
Description
Abu Kashem
2020-11-02 19:08:13 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. @akashem I have doing some testing on the `apiserver_request_termination_total` metric. My current thought is that the new alert rule should be expressed as a percentage calculated by the total rejected requests over the total requests, over a predefined duration, i.e., sum(rate(apiserver_request_termination_total[15m]) / ( sum(rate(apiserver_request_total[15m])) + sum(rate(apiserver_request_termination_total[15m])) ) The denominator needs to be a summation of the two metrics because apiserver_request_total doesn't capture the 429 errors. This feels more reasonable to me that an alert rule that relies on comparing the `apiserver_request_termination_total` with a const, since the threshold may be different for different environments. WDYT? The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified. The upstream PR[1] to the kubernetes-mixin repo has been reviewed and merged. The next step is to update the jsonnet dependencies in the cluster-monitoring-operator repo. AFAICT, the kubernetes-mixin is pulled in as a dependency of the kube-prometheus config[2][3]. I will confirm with the monitoring team regarding the steps to update these jsonnet assets. [1] https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/538 [2] https://github.com/openshift/cluster-monitoring-operator/blob/master/jsonnet/jsonnetfile.json#L13-L21 [3] https://github.com/prometheus-operator/kube-prometheus/blob/808c2f8c3d760c7c02ab4ef1b41987da83d54f90/jsonnet/kube-prometheus/jsonnetfile.json#L41-L48 The monitoring team confirmed that the upstream kubernetes-mixin jsonnet assets are pulled in as part of their release process where the kube-prometheus dependency is upgraded. This BZ will have to be put on-hold until the next CMO release. isim, can you check with the monitoring team whether this alert is going to make it to 4.7? Otherwise, we will need to set the target release to 4.8. @akashem given that the code freeze starts tomorrow (Feb 5), and the CMO is still pointing to the December release of kube-prometheus, I'd say it's unlikely that this will be included in 4.7. I reached out to the CMO team but haven't received any response. I think we can move this to 4.8. If anything changes, I will let you know. changing the target release to 4.8 per Ivan's feedback. This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. The `KubeAPITerminatedRequests` alert landed in CMO about 2 months ago[1]. This alert should now be visible in the OpenShift monitoring alerting console. To test, deploy this[2] custom controller to the cluster, and it will send a series of loads to the API server, to trigger P&F throttling. The TICK_INTERVAL env var can be decrease/increase to increase/decrease the load qps. The alert should be activated when P&F starts rejecting the requests. To recover from the alert, scale the custom controller deployment down to zero pods. [1] https://github.com/openshift/cluster-monitoring-operator/pull/1044 [2] https://github.com/ihcsim/controllers/blob/master/podlister/deployment.yaml The LifecycleStale keyword was removed because the bug moved to QE. The bug assignee was notified. $ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.8.0-0.nightly-2021-04-30-031222 True False 3h4m Cluster version is 4.8.0-0.nightly-2021-04-30-031222
$ oc get PrometheusRule -n openshift-monitoring -o yaml | grep -A8 'KubeAPITerminatedRequests'
- alert: KubeAPITerminatedRequests
annotations:
description: The apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests.
summary: The apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests.
expr: |
sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m])) / ( sum(rate(apiserver_request_total{job="apiserver"}[10m])) + sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m])) ) > 0.20
for: 5m
labels:
severity: warning
This alert also can be seen in web-console UI, so move the bug VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |