Bug 2057967

Summary: KubeJobCompletion does not account for possible job states
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: MonitoringAssignee: Simon Pasquier <spasquie>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: low    
Version: 4.6CC: amuller, anpicker, aos-bugs, juzhao, kgordeev, spasquie
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:51:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2022-02-24 09:01:11 UTC
Since it landed in 2018, the expr has been basically [1]:

  kube_job_spec_completions - kube_job_status_succeeded

However, there are more job states [2]:

* kube_job_status_active
* kube_job_status_succeeded
* kube_job_status_failed
* suspended?  Beta in v1.22 [3].  Doesn't seem to have kube-state-metrics handling.  Possibly rolled in with "active".
* aborted?  E.g. if the parent CronJob has 'concurrencyPolicy: Replace' [4].  Possibly rolled in with "failed"?

Poking at a CI run [5] in PromeCIeus:

  max by (namespace,job_name) (kube_job_status_start_time{job_name=~"ip-reconciler-.*"})

gives:

  {job_name="ip-reconciler-27427395", namespace="openshift-multus"}  1645643712 (19:15)

for pretty much the life of the cluster, with a small gap while the Prom that was gathered gets rolled late in the update.

Over that same time range, the other metrics for this job were:

  kube_job_status_active{job_name=~"ip-reconciler-.*"} -> 0
  kube_job_status_succeeded{job_name=~"ip-reconciler-.*"} -> 0
  kube_job_status_failed{job_name=~"ip-reconciler-.*"} -> 1

which looks a lot more like "job failed" to me than the nominal "taking more than 12 hours to complete" [6].  And indeed:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1496559844449062912/artifacts/e2e-aws-upgrade/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5ed35298ab46f53dcc1a6c68452a87245e88da245f1b9bfb0b60b00e11fc0a93/namespaces/openshift-multus/batch/jobs.yaml | yaml2json  | jq -r '.items[].status'
  {
    "conditions": [
      {
        "lastProbeTime": "2022-02-23T19:17:10Z",
        "lastTransitionTime": "2022-02-23T19:17:10Z",
        "message": "Job has reached the specified backoff limit",
        "reason": "BackoffLimitExceeded",
        "status": "True",
        "type": "Failed"
      }
    ],
    "failed": 1,
    "startTime": "2022-02-23T19:15:12Z",
    "uncountedTerminatedPods": {}
  }

Looking at events:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1496559844449062912/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.involvedObject.name // "" | contains("ip-reconciler"))] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + .involvedObject.name + " " + .reason + ": " + .message'
  2022-02-23T19:15:12Z ip-reconciler-27427395-lkxwz FailedScheduling: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  ...
  2022-02-23T19:16:51Z ip-reconciler-27427395-lkxwz Scheduled: Successfully assigned openshift-multus/ip-reconciler-27427395-lkxwz to ip-10-0-190-112.us-west-1.compute.internal
  ...
  2022-02-23T19:17:00Z ip-reconciler-27427395-lkxwz Started: Started container whereabouts
  2022-02-23T19:17:10Z ip-reconciler-27427395 BackoffLimitExceeded: Job has reached the specified backoff limit
  2022-02-23T19:17:10Z ip-reconciler SawCompletedJob: Saw completed job: ip-reconciler-27427395, status: Failed
  2022-02-23T19:30:00Z ip-reconciler-27427410-6b4j9 Scheduled: Successfully assigned openshift-multus/ip-reconciler-27427410-6b4j9 to ip-10-0-167-184.us-west-1.compute.internal
  2022-02-23T19:30:01Z ip-reconciler-27427410 Completed: Job completed
  2022-02-23T19:30:01Z ip-reconciler SawCompletedJob: Saw completed job: ip-reconciler-27427410, status: Complete
  ...

So the CronJob launched one job early in install which failed, subsequent jobs succeeded, the failed job stuck around for inspection (although apparently the pod has not?), and the alert is pending to eventually fire on that old, failed job.

Possibly something like:

  time() - max by (namespace,job_name) (
    kube_job_status_start_time
    * on (namespace,job_name) group_left (blah)
    group by (namespace,job_name) (kube_job_status_active > 0)
  ) > 12*60*60

and then use the value to say "this job has been active for $DURATION"?  With that approach, the duration would be served out of kube-state-metrics, and you wouldn't need to use 'for' and rely on 12h of gapless Prom data.

I think the expr weakness was somewhat papered over in 4.6's bug 1859793, because if any one job still has metrics at all after 12h, it's probably not in a happy place.

Other random nits:

* Runbook has a stale 1h entry [6], while the alert itself has been bumped to 12h [6], bug 1859793.
* KubeJobCompletion sounds like "hooray!", not "oh, no!".  I would have expected KubeJobIncomplete or KubeJobSlow or something for "taking more than $DURATION to complete".

[1]: https://github.com/kubernetes-monitoring/kubernetes-mixin/commit/55214b1f02f4a646002a827f522ece47debd41e4#diff-c511c672e0359c4c2fbe255e8cc894caf1c05b8f3a0db74edc1c38e797e23afcR152
[2]: https://github.com/kubernetes/kube-state-metrics/blob/5db4c85d4e1b4655da68a3f3f882a6a0bcabde63/docs/job-metrics.md
[3]: https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job
[4]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.23/#cronjobspec-v1-batch
[5]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1496559844449062912
[6]: https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/177bc8ec789fa049a9585713d232035b159f8c92/alerts/apps_alerts.libsonnet#L247
[7]: https://github.com/kubernetes-monitoring/kubernetes-mixin/blame/177bc8ec789fa049a9585713d232035b159f8c92/runbook.md#L61

Comment 4 Junqi Zhao 2022-04-13 12:13:39 UTC
4.10 is KubeJobCompletion
    - alert: KubeJobCompletion
      annotations:
        description: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking
          more than 12 hours to complete.
        summary: Job did not complete in time
      expr: |
        kube_job_spec_completions{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"} - kube_job_status_succeeded{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"}  > 0
      for: 12h
      labels:
        severity: warning

4.11.0-0.nightly-2022-04-12-072444, KubeJobCompletion renamed to KubeJobNotCompleted, expression see below:
        - alert: KubeJobNotCompleted
          annotations:
            description: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more
              than {{ "43200" | humanizeDuration }} to complete.
            summary: Job did not complete in time
          expr: |
            time() - max by(namespace, job_name) (kube_job_status_start_time{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"}
              and
            kube_job_status_active{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"} > 0) > 43200
          labels:
            severity: warning

with such expr, the alert will never in Pending status, we only could see it firing after 12 hours, the expr is fine, but I think with the for setting is better.
WDYT, @Arunprasad Rajkumar

Comment 6 Arunprasad Rajkumar 2022-04-14 06:37:23 UTC
@juzhao , Thanks for testing this. This alert expression simply relies on that job start time value from k8s/etcd instead of prometheus keeping track of it. This has been discussed here in https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/746#discussion_r818861718 and decided to remove the `for` clause.

Comment 7 Junqi Zhao 2022-04-14 07:19:13 UTC
based on Comment 4 - 6, set to VERIFIED

Comment 12 errata-xmlrpc 2022-08-10 10:51:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069