The current alert 'KubeAPIErrorsHigh' does not take into account the requests rejected by the apiserver. This alert checks the 'apiserver_request_total' metrics for 5xx errors. When a request is rejected by the apiserver it does not record the 'apiserver_request_total' metric, on the other hand it records the 'apiserver_request_terminations_total' metric. So 'KubeAPIErrorsHigh' is not aware of any requests rejected by the apiserver. We need to add an alert that inspects the 'apiserver_request_terminations_total' metric and alerts if requests are being rejected. Note: - 'KubeAPIErrorsHigh' has been removed in 4.6 and replaced with 'KubeAPIErrorBudgetBurn'. - The alert should be added to mixin first and then imported to OpenShift. This is where we would add the new alert - https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/kube_apiserver.libsonnet#L19
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
@akashem I have doing some testing on the `apiserver_request_termination_total` metric. My current thought is that the new alert rule should be expressed as a percentage calculated by the total rejected requests over the total requests, over a predefined duration, i.e., sum(rate(apiserver_request_termination_total[15m]) / ( sum(rate(apiserver_request_total[15m])) + sum(rate(apiserver_request_termination_total[15m])) ) The denominator needs to be a summation of the two metrics because apiserver_request_total doesn't capture the 429 errors. This feels more reasonable to me that an alert rule that relies on comparing the `apiserver_request_termination_total` with a const, since the threshold may be different for different environments. WDYT?
The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified.
The upstream PR[1] to the kubernetes-mixin repo has been reviewed and merged. The next step is to update the jsonnet dependencies in the cluster-monitoring-operator repo. AFAICT, the kubernetes-mixin is pulled in as a dependency of the kube-prometheus config[2][3]. I will confirm with the monitoring team regarding the steps to update these jsonnet assets. [1] https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/538 [2] https://github.com/openshift/cluster-monitoring-operator/blob/master/jsonnet/jsonnetfile.json#L13-L21 [3] https://github.com/prometheus-operator/kube-prometheus/blob/808c2f8c3d760c7c02ab4ef1b41987da83d54f90/jsonnet/kube-prometheus/jsonnetfile.json#L41-L48
The monitoring team confirmed that the upstream kubernetes-mixin jsonnet assets are pulled in as part of their release process where the kube-prometheus dependency is upgraded. This BZ will have to be put on-hold until the next CMO release.
isim, can you check with the monitoring team whether this alert is going to make it to 4.7? Otherwise, we will need to set the target release to 4.8.
@akashem given that the code freeze starts tomorrow (Feb 5), and the CMO is still pointing to the December release of kube-prometheus, I'd say it's unlikely that this will be included in 4.7. I reached out to the CMO team but haven't received any response. I think we can move this to 4.8. If anything changes, I will let you know.
changing the target release to 4.8 per Ivan's feedback.
The `KubeAPITerminatedRequests` alert landed in CMO about 2 months ago[1]. This alert should now be visible in the OpenShift monitoring alerting console. To test, deploy this[2] custom controller to the cluster, and it will send a series of loads to the API server, to trigger P&F throttling. The TICK_INTERVAL env var can be decrease/increase to increase/decrease the load qps. The alert should be activated when P&F starts rejecting the requests. To recover from the alert, scale the custom controller deployment down to zero pods. [1] https://github.com/openshift/cluster-monitoring-operator/pull/1044 [2] https://github.com/ihcsim/controllers/blob/master/podlister/deployment.yaml
The LifecycleStale keyword was removed because the bug moved to QE. The bug assignee was notified.
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-30-031222 True False 3h4m Cluster version is 4.8.0-0.nightly-2021-04-30-031222 $ oc get PrometheusRule -n openshift-monitoring -o yaml | grep -A8 'KubeAPITerminatedRequests' - alert: KubeAPITerminatedRequests annotations: description: The apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests. summary: The apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests. expr: | sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m])) / ( sum(rate(apiserver_request_total{job="apiserver"}[10m])) + sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m])) ) > 0.20 for: 5m labels: severity: warning This alert also can be seen in web-console UI, so move the bug VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438