Created attachment 1655126 [details] cluster-version-operator.log Description of problem: During installation Jobs are being reconciled instead of killing and rerunning the Job. I0124 04:14:59.014889 1 sync_worker.go:621] Running sync for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536) E0124 04:14:59.063107 1 task.go:81] error running apply for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536): Job.batch "openshift-service-catalog-apiserver-remover" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string{"controller-uid":"a6d7c62d-63a0-46f2-a99b-77580f70ab4f", "job-name":"openshift-service-catalog-apiserver-remover"}: `selector` does not match template `labels`, spec.selector: Invalid value: "null": field is immutable]
Created attachment 1655140 [details] event data from the CI run
Excerpts from the CVO logs: I0124 04:12:48.430317 1 sync_worker.go:621] Running sync for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536) ... E0124 04:13:02.948359 1 task.go:81] error running apply for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536): timed out waiting for the condition ... I0124 04:13:02.949247 1 task_graph.go:596] Result of work: [Could not update job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536)] I0124 04:13:02.949266 1 sync_worker.go:783] Summarizing 1 errors I0124 04:13:02.949275 1 sync_worker.go:787] Update error 474 of 536: UpdatePayloadFailed Could not update job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536) (*errors.errorString: timed out waiting for the condition) ... I0124 04:14:59.014889 1 sync_worker.go:621] Running sync for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536) E0124 04:14:59.063107 1 task.go:81] error running apply for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536): Job.batch "openshift-service-catalog-apiserver-remover" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string{"controller-uid":"a6d7c62d-63a0-46f2-a99b-77580f70ab4f", "job-name":"openshift-service-catalog-apiserver-remover"}: `selector` does not match template `labels`, spec.selector: Invalid value: "null": field is immutable] ... So yeah, we launch the job, it takes a while, we cycle back in the next round and fail to pick up the exiting Job cleanly.
Created attachment 1655141 [details] full build-log.txt from CI run
The issue is that at least [1] is setting 'modified' for this case, but that apparently you aren't allowed to modify Jobs [citation-needed]. Ideally we'd recognize that the existing Job was fine, and just watch it again instead of trying to modify it. The worst-case hack would be to delete the previous job and create a new job (which would hopefully be able to pick up where its predecessor left off and exit before the ~4m timeout on the Job within the single CVO-manifest-application cycle. [1]: https://github.com/openshift/cluster-version-operator/blob/54faf6fad0d4dfa7c2a7953076f608d018577fd1/lib/resourcemerge/batch.go#L11-L20
The Job manifest is [1], from [2]. The failed CI run is [3]. [4] has: 2020-01-24T04:12:48.430341514Z I0124 04:12:48.430317 1 sync_worker.go:621] Running sync for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536) ... 2020-01-24T04:13:02.948410091Z E0124 04:13:02.948359 1 task.go:81] error running apply for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536): timed out waiting for the condition ... 2020-01-24T04:13:02.949262218Z I0124 04:13:02.949247 1 task_graph.go:596] Result of work: [Could not update job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536)] 2020-01-24T04:13:02.949276199Z I0124 04:13:02.949266 1 sync_worker.go:783] Summarizing 1 errors 2020-01-24T04:13:02.949292104Z I0124 04:13:02.949275 1 sync_worker.go:787] Update error 474 of 536: UpdatePayloadFailed Could not update job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536) (*errors.errorString: timed out waiting for the condition) 2020-01-24T04:13:02.94934611Z E0124 04:13:02.949303 1 sync_worker.go:329] unable to synchronize image (waiting 1m26.262851224s): Could not update job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536) ... 2020-01-24T04:14:59.014897431Z I0124 04:14:59.014889 1 sync_worker.go:621] Running sync for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536) 2020-01-24T04:14:59.063169169Z E0124 04:14:59.063107 1 task.go:81] error running apply for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536): Job.batch "openshift-service-catalog-apiserver-remover" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string{"controller-uid":"a6d7c62d-63a0-46f2-a99b-77580f70ab4f", "job-name":"openshift-service-catalog-apiserver-remover"}: `selector` does not match template `labels`, spec.selector: Invalid value: "null": field is immutable] must-gather does not collect the live Job object, because it's not in a namespace referenced by a ClusterOperator, but we have the Pod created for the Job: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-apiserver-operator/74/pull-ci-openshift-cluster-svcat-apiserver-operator-master-e2e-aws/218/artifacts/e2e-aws/pods.json | jq '.items[] | select(.metadata.generateName == "openshift-service-catalog-apiserver-remover-").metadata' { "annotations": { "k8s.v1.cni.cncf.io/networks-status": "", "openshift.io/scc": "anyuid" }, "creationTimestamp": "2020-01-24T04:12:56Z", "generateName": "openshift-service-catalog-apiserver-remover-", "labels": { "controller-uid": "a6d7c62d-63a0-46f2-a99b-77580f70ab4f", "job-name": "openshift-service-catalog-apiserver-remover" }, "name": "openshift-service-catalog-apiserver-remover-lwtrl", "namespace": "openshift-service-catalog-removed", "ownerReferences": [ { "apiVersion": "batch/v1", "blockOwnerDeletion": true, "controller": true, "kind": "Job", "name": "openshift-service-catalog-apiserver-remover", "uid": "a6d7c62d-63a0-46f2-a99b-77580f70ab4f" } ], "resourceVersion": "18251", "selfLink": "/api/v1/namespaces/openshift-service-catalog-removed/pods/openshift-service-catalog-apiserver-remover-lwtrl", "uid": "46520d4f-4971-4e79-8da9-925b4605f465" } [1]: https://github.com/openshift/cluster-svcat-apiserver-operator/blob/c25b01e65b58a3e11ab2712664c7a0a6ad52fa9b/manifests/0000_90_cluster-svcat-apiserver-operator_01_remover_job.yaml [2]: https://github.com/openshift/cluster-svcat-apiserver-operator/pull/74 [3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-apiserver-operator/74/pull-ci-openshift-cluster-svcat-apiserver-operator-master-e2e-aws/218 [4]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-apiserver-operator/74/pull-ci-openshift-cluster-svcat-apiserver-operator-master-e2e-aws/218/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ci-op-di0k4j40-stable-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/namespaces/openshift-cluster-version/pods/cluster-version-operator-7f7765db6f-gw7gh/cluster-version-operator/cluster-version-operator/logs/current.log
@Jesus M. Rodriguez After I go through the comments, i think, to verify this bug, we need re-run the e2e job based pr74[1] to check if the fix from cvo[1] works, right? But I found the latest passed ci job[3] was from pr76. So could u show me how QE can rerun the e2e test against [1] to verify the bug? [1] https://github.com/openshift/cluster-svcat-apiserver-operator/pull/74 [2] https://github.com/openshift/cluster-version-operator/pull/312 [3] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-apiserver-operator/76/pull-ci-openshift-cluster-svcat-apiserver-operator-master-e2e-aws/221
Looks like PR 74 needs a rebase [1]. [1]: https://github.com/openshift/cluster-svcat-apiserver-operator/pull/74#event-3037296216
Checked several recent jobs against pr74, I did not find original error. Such as[1],[2]: 2020-02-20T02:43:40.010528266Z I0220 02:43:40.010504 1 sync_worker.go:621] Running sync for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (510 of 586) 2020-02-20T02:43:40.15841562Z I0220 02:43:40.158336 1 sync_worker.go:634] Done syncing for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (510 of 586) So verify the bug. [1] https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-apiserver-operator/74/pull-ci-openshift-cluster-svcat-apiserver-operator-master-e2e-aws/224/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ci-op-1qgpi2k9-stable-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/namespaces/openshift-cluster-version/pods/cluster-version-operator-5f9c5fbd57-r2xwq/cluster-version-operator/cluster-version-operator/logs/current.log [2] https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-apiserver-operator/74/pull-ci-openshift-cluster-svcat-apiserver-operator-master-e2e-aws/229/artifacts/e2e-aws/pods/openshift-cluster-version_cluster-version-operator-85d545c4b9-rjtlk_cluster-version-operator.log
Dropping needinfo, bug was verified in comment #9
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581