Description of problem: During installation if there is a job defined in an operator at a runlevel other than the end, the installation blocks until the job can finish. The problem is the job can't run because things like the network isn't even up yet. Version-Release number of selected component (if applicable): How reproducible: Pick an operator, like the cluster-svcat-apiserver-operator. Define a Job and its associated items in the manifests. Specify the job in something like 08_remover_job.yaml. Watch the install fail to bootstrap. Additional info: Switching the job's descriptor to being in a different runlevel, like something at the end, i.e. 0000_90_cluster-svcat-apiserver-operator_01_remover_job.yaml, this will put the job at the end and the cluster will finish bootstrapping since now the job has enough resources to complete its objective.
Another side effect is that if the Job needs to be run earlier than the end, this will become a bigger problem.
https://github.com/openshift/cluster-version-operator/blob/ac394f1f73a6e1367d0af8622d4f9af545a6e39a/pkg/cvo/sync_worker.go#L587-L611
Created attachment 1655127 [details] cluster-version-operator.log
The CVO requires some amount of runlevel management by developers adding manifests. Dropping down to medium, if nothing else we'll need to make the docs around this more clear regarding what expectations are around a given runlevel.
Jessica and Clayton point out that we don't block on other objects during the installation phase, where we just try to stuff as many objects into the cluster as fast as we can. Relaxing that to make install-time jobs non-blocking makes sense to me.
@Jesus M. Rodriguez After go through the comments, I think you install with a customized release image, right? Could u show me the detail steps to verify the bug? and better to share your 0000_90_cluster-svcat-apiserver-operator_01_remover_job.yaml too.
The PRs with the jobs are for apiserver: https://github.com/openshift/cluster-svcat-apiserver-operator/pull/74 and for controller-manager: https://github.com/openshift/cluster-svcat-controller-manager-operator/pull/68 The job definition file is as follows, for apiserver: https://github.com/jmrodri/cluster-svcat-apiserver-operator/blob/removal-job/manifests/0000_90_cluster-svcat-apiserver-operator_01_remover_job.yaml The job definition file is as follows, for controller-manager: https://github.com/jmrodri/cluster-svcat-controller-manager-operator/blob/remove-svcat-add-job/manifests/0000_90_cluster-svcat-controller-manager-operator_01_remover_job.yaml The apiserver job yaml is the most up to date. The PRs are failing test but not related to the CVO. To verify this bug, you might want to create a dummy job that runs in the CVO and ensure the cluster starts up without blocking. Maybe have a job that runs looks at items in the cluster, and logs what it finds. Does this help any?
Thx @Jesus M. Rodriguez, it's not easy for us to trigger an installation with a dummy job included in the cvo, since qe can only trigger installation with a nightly build, which was built by ART team. So I dig into the e2e job according to your info, find original failed test "Run template e2e-aws - e2e-aws container setup". level=info msg="Credentials loaded from the \"default\" profile in file \"/etc/openshift-installer/.awscred\"" level=warning msg="Found override for release image. Please be warned, this is not advised" level=info msg="Consuming Install Config from target directory" level=info msg="Creating infrastructure resources..." level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-di0k4j40-7ad33.origin-ci-int-aws.dev.rhcloud.com:6443..." level=info msg="API v1.17.1 up" level=info msg="Waiting up to 30m0s for bootstrapping to complete..." level=info msg="Destroying the bootstrap resources..." level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-di0k4j40-7ad33.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..." level=info msg="Cluster operator insights Disabled is False with : " level=fatal msg="failed to initialize the cluster: Could not update job \"openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover\" (474 of 536): the object is invalid, possibly due to local cluster configuration" Some error info from cvo logs: 0124 04:13:02.948359 1 task.go:81] error running apply for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536): timed out waiting for the condition ... I0124 04:13:02.949247 1 task_graph.go:596] Result of work: [Could not update job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536)] I0124 04:13:02.949266 1 sync_worker.go:783] Summarizing 1 errors I0124 04:13:02.949275 1 sync_worker.go:787] Update error 474 of 536: UpdatePayloadFailed Could not update job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536) (*errors.errorString: timed out waiting for the condition) E0124 04:13:02.949303 1 sync_worker.go:329] unable to synchronize image (waiting 1m26.262851224s): Could not update job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536) [1]https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-apiserver-operator/74/pull-ci-openshift-cluster-svcat-apiserver-operator-master-e2e-aws/218/artifacts/e2e-aws/pods/openshift-cluster-version_cluster-version-operator-7f7765db6f-gw7gh_cluster-version-operator.log Try to verify it with the latest e2e jobs against pr74 and pr68. Checked a newer e2e jobs against pr74 in [2], "Run template e2e-aws - e2e-aws container setup" test pass and the remover job done successfully from cvo logs. I0305 03:47:25.362240 1 sync_worker.go:621] Running sync for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (494 of 570) I0305 03:47:25.491843 1 sync_worker.go:634] Done syncing for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (494 of 570) [2] https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-apiserver-operator/74/pull-ci-openshift-cluster-svcat-apiserver-operator-master-e2e-aws/229/artifacts/e2e-aws/pods/openshift-cluster-version_cluster-version-operator-85d545c4b9-rjtlk_cluster-version-operator.log Checked a newer e2e job against pr68 in [3], "Run template e2e-aws - e2e-aws container setup" test pass and the remover job done successfully from cvo logs. I0309 21:20:52.471410 1 sync_worker.go:621] Running sync for job "openshift-service-catalog-removed/openshift-service-catalog-controller-manager-remover" (499 of 571) I0309 21:20:52.508906 1 sync_worker.go:634] Done syncing for job "openshift-service-catalog-removed/openshift-service-catalog-controller-manager-remover" (499 of 571) [3] https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-controller-manager-operator/68/pull-ci-openshift-cluster-svcat-controller-manager-operator-master-e2e-aws/162/artifacts/e2e-aws/pods/openshift-cluster-version_cluster-version-operator-7b7fc49b59-j8bsz_cluster-version-operator.log
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581