Bug 2057967 - KubeJobCompletion does not account for possible job states
Summary: KubeJobCompletion does not account for possible job states
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: 4.11.0
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-24 09:01 UTC by W. Trevor King
Modified: 2022-08-10 10:51 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 10:51:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes-monitoring kubernetes-mixin pull 746 0 None Merged Rename KubeJobCompletion to KubeJobNotCompleted and fix alert expression 2022-04-14 21:25:51 UTC
Github openshift cluster-monitoring-operator pull 1628 0 None Merged Bug 2057967: [bot] Update jsonnet dependencies 2022-04-14 21:30:07 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:51:34 UTC

Description W. Trevor King 2022-02-24 09:01:11 UTC
Since it landed in 2018, the expr has been basically [1]:

  kube_job_spec_completions - kube_job_status_succeeded

However, there are more job states [2]:

* kube_job_status_active
* kube_job_status_succeeded
* kube_job_status_failed
* suspended?  Beta in v1.22 [3].  Doesn't seem to have kube-state-metrics handling.  Possibly rolled in with "active".
* aborted?  E.g. if the parent CronJob has 'concurrencyPolicy: Replace' [4].  Possibly rolled in with "failed"?

Poking at a CI run [5] in PromeCIeus:

  max by (namespace,job_name) (kube_job_status_start_time{job_name=~"ip-reconciler-.*"})

gives:

  {job_name="ip-reconciler-27427395", namespace="openshift-multus"}  1645643712 (19:15)

for pretty much the life of the cluster, with a small gap while the Prom that was gathered gets rolled late in the update.

Over that same time range, the other metrics for this job were:

  kube_job_status_active{job_name=~"ip-reconciler-.*"} -> 0
  kube_job_status_succeeded{job_name=~"ip-reconciler-.*"} -> 0
  kube_job_status_failed{job_name=~"ip-reconciler-.*"} -> 1

which looks a lot more like "job failed" to me than the nominal "taking more than 12 hours to complete" [6].  And indeed:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1496559844449062912/artifacts/e2e-aws-upgrade/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5ed35298ab46f53dcc1a6c68452a87245e88da245f1b9bfb0b60b00e11fc0a93/namespaces/openshift-multus/batch/jobs.yaml | yaml2json  | jq -r '.items[].status'
  {
    "conditions": [
      {
        "lastProbeTime": "2022-02-23T19:17:10Z",
        "lastTransitionTime": "2022-02-23T19:17:10Z",
        "message": "Job has reached the specified backoff limit",
        "reason": "BackoffLimitExceeded",
        "status": "True",
        "type": "Failed"
      }
    ],
    "failed": 1,
    "startTime": "2022-02-23T19:15:12Z",
    "uncountedTerminatedPods": {}
  }

Looking at events:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1496559844449062912/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.involvedObject.name // "" | contains("ip-reconciler"))] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + .involvedObject.name + " " + .reason + ": " + .message'
  2022-02-23T19:15:12Z ip-reconciler-27427395-lkxwz FailedScheduling: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  ...
  2022-02-23T19:16:51Z ip-reconciler-27427395-lkxwz Scheduled: Successfully assigned openshift-multus/ip-reconciler-27427395-lkxwz to ip-10-0-190-112.us-west-1.compute.internal
  ...
  2022-02-23T19:17:00Z ip-reconciler-27427395-lkxwz Started: Started container whereabouts
  2022-02-23T19:17:10Z ip-reconciler-27427395 BackoffLimitExceeded: Job has reached the specified backoff limit
  2022-02-23T19:17:10Z ip-reconciler SawCompletedJob: Saw completed job: ip-reconciler-27427395, status: Failed
  2022-02-23T19:30:00Z ip-reconciler-27427410-6b4j9 Scheduled: Successfully assigned openshift-multus/ip-reconciler-27427410-6b4j9 to ip-10-0-167-184.us-west-1.compute.internal
  2022-02-23T19:30:01Z ip-reconciler-27427410 Completed: Job completed
  2022-02-23T19:30:01Z ip-reconciler SawCompletedJob: Saw completed job: ip-reconciler-27427410, status: Complete
  ...

So the CronJob launched one job early in install which failed, subsequent jobs succeeded, the failed job stuck around for inspection (although apparently the pod has not?), and the alert is pending to eventually fire on that old, failed job.

Possibly something like:

  time() - max by (namespace,job_name) (
    kube_job_status_start_time
    * on (namespace,job_name) group_left (blah)
    group by (namespace,job_name) (kube_job_status_active > 0)
  ) > 12*60*60

and then use the value to say "this job has been active for $DURATION"?  With that approach, the duration would be served out of kube-state-metrics, and you wouldn't need to use 'for' and rely on 12h of gapless Prom data.

I think the expr weakness was somewhat papered over in 4.6's bug 1859793, because if any one job still has metrics at all after 12h, it's probably not in a happy place.

Other random nits:

* Runbook has a stale 1h entry [6], while the alert itself has been bumped to 12h [6], bug 1859793.
* KubeJobCompletion sounds like "hooray!", not "oh, no!".  I would have expected KubeJobIncomplete or KubeJobSlow or something for "taking more than $DURATION to complete".

[1]: https://github.com/kubernetes-monitoring/kubernetes-mixin/commit/55214b1f02f4a646002a827f522ece47debd41e4#diff-c511c672e0359c4c2fbe255e8cc894caf1c05b8f3a0db74edc1c38e797e23afcR152
[2]: https://github.com/kubernetes/kube-state-metrics/blob/5db4c85d4e1b4655da68a3f3f882a6a0bcabde63/docs/job-metrics.md
[3]: https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job
[4]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.23/#cronjobspec-v1-batch
[5]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1496559844449062912
[6]: https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/177bc8ec789fa049a9585713d232035b159f8c92/alerts/apps_alerts.libsonnet#L247
[7]: https://github.com/kubernetes-monitoring/kubernetes-mixin/blame/177bc8ec789fa049a9585713d232035b159f8c92/runbook.md#L61

Comment 4 Junqi Zhao 2022-04-13 12:13:39 UTC
4.10 is KubeJobCompletion
    - alert: KubeJobCompletion
      annotations:
        description: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking
          more than 12 hours to complete.
        summary: Job did not complete in time
      expr: |
        kube_job_spec_completions{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"} - kube_job_status_succeeded{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"}  > 0
      for: 12h
      labels:
        severity: warning

4.11.0-0.nightly-2022-04-12-072444, KubeJobCompletion renamed to KubeJobNotCompleted, expression see below:
        - alert: KubeJobNotCompleted
          annotations:
            description: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more
              than {{ "43200" | humanizeDuration }} to complete.
            summary: Job did not complete in time
          expr: |
            time() - max by(namespace, job_name) (kube_job_status_start_time{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"}
              and
            kube_job_status_active{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"} > 0) > 43200
          labels:
            severity: warning

with such expr, the alert will never in Pending status, we only could see it firing after 12 hours, the expr is fine, but I think with the for setting is better.
WDYT, @Arunprasad Rajkumar

Comment 6 Arunprasad Rajkumar 2022-04-14 06:37:23 UTC
@juzhao , Thanks for testing this. This alert expression simply relies on that job start time value from k8s/etcd instead of prometheus keeping track of it. This has been discussed here in https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/746#discussion_r818861718 and decided to remove the `for` clause.

Comment 7 Junqi Zhao 2022-04-14 07:19:13 UTC
based on Comment 4 - 6, set to VERIFIED

Comment 12 errata-xmlrpc 2022-08-10 10:51:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.