1794823 – Jobs should not be reconciled

Bug 1794823 - Jobs should not be reconciled

Summary: Jobs should not be reconciled

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Jack Ottofaro
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-24 19:12 UTC by Jesus M. Rodriguez
Modified:	2020-05-04 11:27 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	CVO will no not attempt to change Job's immutable spec.selector. Jobs cannot be modified.
Clone Of:
Environment:
Last Closed:	2020-05-04 11:26:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
cluster-version-operator.log (1.80 MB, text/plain) 2020-01-24 19:12 UTC, Jesus M. Rodriguez	no flags	Details
event data from the CI run (4.07 MB, text/plain) 2020-01-24 19:43 UTC, Jesus M. Rodriguez	no flags	Details
full build-log.txt from CI run (5.42 KB, text/plain) 2020-01-24 19:49 UTC, Jesus M. Rodriguez	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 312	0	None	closed	Bug 1794823: lib/resourcemerge: Do not attempt to change Job's immutable spec.selector	2020-12-13 09:20:42 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:27:05 UTC

Description Jesus M. Rodriguez 2020-01-24 19:12:35 UTC

Created attachment 1655126 [details]
cluster-version-operator.log

Description of problem:
During installation Jobs are being reconciled instead of killing and rerunning the Job.

I0124 04:14:59.014889       1 sync_worker.go:621] Running sync for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536)
E0124 04:14:59.063107       1 task.go:81] error running apply for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536): Job.batch "openshift-service-catalog-apiserver-remover" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string{"controller-uid":"a6d7c62d-63a0-46f2-a99b-77580f70ab4f", "job-name":"openshift-service-catalog-apiserver-remover"}: `selector` does not match template `labels`, spec.selector: Invalid value: "null": field is immutable]

Comment 1 Jesus M. Rodriguez 2020-01-24 19:43:38 UTC

Created attachment 1655140 [details]
event data from the CI run

Comment 2 W. Trevor King 2020-01-24 19:45:09 UTC

Excerpts from the CVO logs:

I0124 04:12:48.430317       1 sync_worker.go:621] Running sync for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536)
...
E0124 04:13:02.948359       1 task.go:81] error running apply for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536): timed out waiting for the condition
...
I0124 04:13:02.949247       1 task_graph.go:596] Result of work: [Could not update job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536)]
I0124 04:13:02.949266       1 sync_worker.go:783] Summarizing 1 errors
I0124 04:13:02.949275       1 sync_worker.go:787] Update error 474 of 536: UpdatePayloadFailed Could not update job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536) (*errors.errorString: timed out waiting for the condition)
...
I0124 04:14:59.014889       1 sync_worker.go:621] Running sync for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536)
E0124 04:14:59.063107       1 task.go:81] error running apply for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536): Job.batch "openshift-service-catalog-apiserver-remover" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string{"controller-uid":"a6d7c62d-63a0-46f2-a99b-77580f70ab4f", "job-name":"openshift-service-catalog-apiserver-remover"}: `selector` does not match template `labels`, spec.selector: Invalid value: "null": field is immutable]
...

So yeah, we launch the job, it takes a while, we cycle back in the next round and fail to pick up the exiting Job cleanly.

Comment 3 Jesus M. Rodriguez 2020-01-24 19:49:57 UTC

Created attachment 1655141 [details]
full build-log.txt from CI run

Comment 4 W. Trevor King 2020-01-30 19:21:02 UTC

The issue is that at least [1] is setting 'modified' for this case, but that apparently you aren't allowed to modify Jobs [citation-needed].  Ideally we'd recognize that the existing Job was fine, and just watch it again instead of trying to modify it.  The worst-case hack would be to delete the previous job and create a new job (which would hopefully be able to pick up where its predecessor left off and exit before the ~4m timeout on the Job within the single CVO-manifest-application cycle.

[1]: https://github.com/openshift/cluster-version-operator/blob/54faf6fad0d4dfa7c2a7953076f608d018577fd1/lib/resourcemerge/batch.go#L11-L20

Comment 5 W. Trevor King 2020-01-30 19:32:48 UTC

The Job manifest is [1], from [2].  The failed CI run is [3].  [4] has:

    2020-01-24T04:12:48.430341514Z I0124 04:12:48.430317       1 sync_worker.go:621] Running sync for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536)
    ...
    2020-01-24T04:13:02.948410091Z E0124 04:13:02.948359       1 task.go:81] error running apply for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536): timed out waiting for the condition
    ...
    2020-01-24T04:13:02.949262218Z I0124 04:13:02.949247       1 task_graph.go:596] Result of work: [Could not update job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536)]
    2020-01-24T04:13:02.949276199Z I0124 04:13:02.949266       1 sync_worker.go:783] Summarizing 1 errors
    2020-01-24T04:13:02.949292104Z I0124 04:13:02.949275       1 sync_worker.go:787] Update error 474 of 536: UpdatePayloadFailed Could not update job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536) (*errors.errorString: timed out waiting for the condition)
    2020-01-24T04:13:02.94934611Z E0124 04:13:02.949303       1 sync_worker.go:329] unable to synchronize image (waiting 1m26.262851224s): Could not update job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536)
    ...
    2020-01-24T04:14:59.014897431Z I0124 04:14:59.014889       1 sync_worker.go:621] Running sync for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536)
    2020-01-24T04:14:59.063169169Z E0124 04:14:59.063107       1 task.go:81] error running apply for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536): Job.batch "openshift-service-catalog-apiserver-remover" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string{"controller-uid":"a6d7c62d-63a0-46f2-a99b-77580f70ab4f", "job-name":"openshift-service-catalog-apiserver-remover"}: `selector` does not match template `labels`, spec.selector: Invalid value: "null": field is immutable]

must-gather does not collect the live Job object, because it's not in a namespace referenced by a ClusterOperator, but we have the Pod created for the Job:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-apiserver-operator/74/pull-ci-openshift-cluster-svcat-apiserver-operator-master-e2e-aws/218/artifacts/e2e-aws/pods.json | jq '.items[] | select(.metadata.generateName == "openshift-service-catalog-apiserver-remover-").metadata'
{
  "annotations": {
    "k8s.v1.cni.cncf.io/networks-status": "",
    "openshift.io/scc": "anyuid"
  },
  "creationTimestamp": "2020-01-24T04:12:56Z",
  "generateName": "openshift-service-catalog-apiserver-remover-",
  "labels": {
    "controller-uid": "a6d7c62d-63a0-46f2-a99b-77580f70ab4f",
    "job-name": "openshift-service-catalog-apiserver-remover"
  },
  "name": "openshift-service-catalog-apiserver-remover-lwtrl",
  "namespace": "openshift-service-catalog-removed",
  "ownerReferences": [
    {
      "apiVersion": "batch/v1",
      "blockOwnerDeletion": true,
      "controller": true,
      "kind": "Job",
      "name": "openshift-service-catalog-apiserver-remover",
      "uid": "a6d7c62d-63a0-46f2-a99b-77580f70ab4f"
    }
  ],
  "resourceVersion": "18251",
  "selfLink": "/api/v1/namespaces/openshift-service-catalog-removed/pods/openshift-service-catalog-apiserver-remover-lwtrl",
  "uid": "46520d4f-4971-4e79-8da9-925b4605f465"
}

[1]: https://github.com/openshift/cluster-svcat-apiserver-operator/blob/c25b01e65b58a3e11ab2712664c7a0a6ad52fa9b/manifests/0000_90_cluster-svcat-apiserver-operator_01_remover_job.yaml
[2]: https://github.com/openshift/cluster-svcat-apiserver-operator/pull/74
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-apiserver-operator/74/pull-ci-openshift-cluster-svcat-apiserver-operator-master-e2e-aws/218
[4]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-apiserver-operator/74/pull-ci-openshift-cluster-svcat-apiserver-operator-master-e2e-aws/218/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ci-op-di0k4j40-stable-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/namespaces/openshift-cluster-version/pods/cluster-version-operator-7f7765db6f-gw7gh/cluster-version-operator/cluster-version-operator/logs/current.log

Comment 7 liujia 2020-02-11 06:39:45 UTC

@Jesus M. Rodriguez 
After I go through the comments, i think, to verify this bug, we need re-run the e2e job based pr74[1] to check if the fix from cvo[1] works, right? But I found the latest passed ci job[3] was from pr76. So could u show me how QE can rerun the e2e test against [1] to verify the bug?

[1] https://github.com/openshift/cluster-svcat-apiserver-operator/pull/74
[2] https://github.com/openshift/cluster-version-operator/pull/312
[3] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-apiserver-operator/76/pull-ci-openshift-cluster-svcat-apiserver-operator-master-e2e-aws/221

Comment 8 W. Trevor King 2020-02-16 06:33:44 UTC

Looks like PR 74 needs a rebase [1].

[1]: https://github.com/openshift/cluster-svcat-apiserver-operator/pull/74#event-3037296216

Comment 9 liujia 2020-03-09 07:40:55 UTC

Checked several recent jobs against pr74, I did not find original error. 

Such as[1],[2]:
2020-02-20T02:43:40.010528266Z I0220 02:43:40.010504       1 sync_worker.go:621] Running sync for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (510 of 586)
2020-02-20T02:43:40.15841562Z I0220 02:43:40.158336       1 sync_worker.go:634] Done syncing for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (510 of 586)

So verify the bug.

[1] https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-apiserver-operator/74/pull-ci-openshift-cluster-svcat-apiserver-operator-master-e2e-aws/224/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ci-op-1qgpi2k9-stable-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/namespaces/openshift-cluster-version/pods/cluster-version-operator-5f9c5fbd57-r2xwq/cluster-version-operator/cluster-version-operator/logs/current.log

[2] https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-apiserver-operator/74/pull-ci-openshift-cluster-svcat-apiserver-operator-master-e2e-aws/229/artifacts/e2e-aws/pods/openshift-cluster-version_cluster-version-operator-85d545c4b9-rjtlk_cluster-version-operator.log

Comment 10 Jesus M. Rodriguez 2020-03-24 03:41:15 UTC

Dropping needinfo, bug was verified in comment #9

Comment 12 errata-xmlrpc 2020-05-04 11:26:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.