Since it landed in 2018, the expr has been basically [1]: kube_job_spec_completions - kube_job_status_succeeded However, there are more job states [2]: * kube_job_status_active * kube_job_status_succeeded * kube_job_status_failed * suspended? Beta in v1.22 [3]. Doesn't seem to have kube-state-metrics handling. Possibly rolled in with "active". * aborted? E.g. if the parent CronJob has 'concurrencyPolicy: Replace' [4]. Possibly rolled in with "failed"? Poking at a CI run [5] in PromeCIeus: max by (namespace,job_name) (kube_job_status_start_time{job_name=~"ip-reconciler-.*"}) gives: {job_name="ip-reconciler-27427395", namespace="openshift-multus"} 1645643712 (19:15) for pretty much the life of the cluster, with a small gap while the Prom that was gathered gets rolled late in the update. Over that same time range, the other metrics for this job were: kube_job_status_active{job_name=~"ip-reconciler-.*"} -> 0 kube_job_status_succeeded{job_name=~"ip-reconciler-.*"} -> 0 kube_job_status_failed{job_name=~"ip-reconciler-.*"} -> 1 which looks a lot more like "job failed" to me than the nominal "taking more than 12 hours to complete" [6]. And indeed: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1496559844449062912/artifacts/e2e-aws-upgrade/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5ed35298ab46f53dcc1a6c68452a87245e88da245f1b9bfb0b60b00e11fc0a93/namespaces/openshift-multus/batch/jobs.yaml | yaml2json | jq -r '.items[].status' { "conditions": [ { "lastProbeTime": "2022-02-23T19:17:10Z", "lastTransitionTime": "2022-02-23T19:17:10Z", "message": "Job has reached the specified backoff limit", "reason": "BackoffLimitExceeded", "status": "True", "type": "Failed" } ], "failed": 1, "startTime": "2022-02-23T19:15:12Z", "uncountedTerminatedPods": {} } Looking at events: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1496559844449062912/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.involvedObject.name // "" | contains("ip-reconciler"))] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + .involvedObject.name + " " + .reason + ": " + .message' 2022-02-23T19:15:12Z ip-reconciler-27427395-lkxwz FailedScheduling: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. ... 2022-02-23T19:16:51Z ip-reconciler-27427395-lkxwz Scheduled: Successfully assigned openshift-multus/ip-reconciler-27427395-lkxwz to ip-10-0-190-112.us-west-1.compute.internal ... 2022-02-23T19:17:00Z ip-reconciler-27427395-lkxwz Started: Started container whereabouts 2022-02-23T19:17:10Z ip-reconciler-27427395 BackoffLimitExceeded: Job has reached the specified backoff limit 2022-02-23T19:17:10Z ip-reconciler SawCompletedJob: Saw completed job: ip-reconciler-27427395, status: Failed 2022-02-23T19:30:00Z ip-reconciler-27427410-6b4j9 Scheduled: Successfully assigned openshift-multus/ip-reconciler-27427410-6b4j9 to ip-10-0-167-184.us-west-1.compute.internal 2022-02-23T19:30:01Z ip-reconciler-27427410 Completed: Job completed 2022-02-23T19:30:01Z ip-reconciler SawCompletedJob: Saw completed job: ip-reconciler-27427410, status: Complete ... So the CronJob launched one job early in install which failed, subsequent jobs succeeded, the failed job stuck around for inspection (although apparently the pod has not?), and the alert is pending to eventually fire on that old, failed job. Possibly something like: time() - max by (namespace,job_name) ( kube_job_status_start_time * on (namespace,job_name) group_left (blah) group by (namespace,job_name) (kube_job_status_active > 0) ) > 12*60*60 and then use the value to say "this job has been active for $DURATION"? With that approach, the duration would be served out of kube-state-metrics, and you wouldn't need to use 'for' and rely on 12h of gapless Prom data. I think the expr weakness was somewhat papered over in 4.6's bug 1859793, because if any one job still has metrics at all after 12h, it's probably not in a happy place. Other random nits: * Runbook has a stale 1h entry [6], while the alert itself has been bumped to 12h [6], bug 1859793. * KubeJobCompletion sounds like "hooray!", not "oh, no!". I would have expected KubeJobIncomplete or KubeJobSlow or something for "taking more than $DURATION to complete". [1]: https://github.com/kubernetes-monitoring/kubernetes-mixin/commit/55214b1f02f4a646002a827f522ece47debd41e4#diff-c511c672e0359c4c2fbe255e8cc894caf1c05b8f3a0db74edc1c38e797e23afcR152 [2]: https://github.com/kubernetes/kube-state-metrics/blob/5db4c85d4e1b4655da68a3f3f882a6a0bcabde63/docs/job-metrics.md [3]: https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job [4]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.23/#cronjobspec-v1-batch [5]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1496559844449062912 [6]: https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/177bc8ec789fa049a9585713d232035b159f8c92/alerts/apps_alerts.libsonnet#L247 [7]: https://github.com/kubernetes-monitoring/kubernetes-mixin/blame/177bc8ec789fa049a9585713d232035b159f8c92/runbook.md#L61
4.10 is KubeJobCompletion - alert: KubeJobCompletion annotations: description: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than 12 hours to complete. summary: Job did not complete in time expr: | kube_job_spec_completions{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"} - kube_job_status_succeeded{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"} > 0 for: 12h labels: severity: warning 4.11.0-0.nightly-2022-04-12-072444, KubeJobCompletion renamed to KubeJobNotCompleted, expression see below: - alert: KubeJobNotCompleted annotations: description: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than {{ "43200" | humanizeDuration }} to complete. summary: Job did not complete in time expr: | time() - max by(namespace, job_name) (kube_job_status_start_time{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"} and kube_job_status_active{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"} > 0) > 43200 labels: severity: warning with such expr, the alert will never in Pending status, we only could see it firing after 12 hours, the expr is fine, but I think with the for setting is better. WDYT, @Arunprasad Rajkumar
@juzhao , Thanks for testing this. This alert expression simply relies on that job start time value from k8s/etcd instead of prometheus keeping track of it. This has been discussed here in https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/746#discussion_r818861718 and decided to remove the `for` clause.
based on Comment 4 - 6, set to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069