Bug 2057967
Summary: | KubeJobCompletion does not account for possible job states | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | 4.6 | CC: | amuller, anpicker, aos-bugs, juzhao, kgordeev, spasquie |
Target Milestone: | --- | ||
Target Release: | 4.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-08-10 10:51:15 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
W. Trevor King
2022-02-24 09:01:11 UTC
4.10 is KubeJobCompletion - alert: KubeJobCompletion annotations: description: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than 12 hours to complete. summary: Job did not complete in time expr: | kube_job_spec_completions{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"} - kube_job_status_succeeded{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"} > 0 for: 12h labels: severity: warning 4.11.0-0.nightly-2022-04-12-072444, KubeJobCompletion renamed to KubeJobNotCompleted, expression see below: - alert: KubeJobNotCompleted annotations: description: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than {{ "43200" | humanizeDuration }} to complete. summary: Job did not complete in time expr: | time() - max by(namespace, job_name) (kube_job_status_start_time{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"} and kube_job_status_active{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"} > 0) > 43200 labels: severity: warning with such expr, the alert will never in Pending status, we only could see it firing after 12 hours, the expr is fine, but I think with the for setting is better. WDYT, @Arunprasad Rajkumar @juzhao , Thanks for testing this. This alert expression simply relies on that job start time value from k8s/etcd instead of prometheus keeping track of it. This has been discussed here in https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/746#discussion_r818861718 and decided to remove the `for` clause. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |