1794817 – OpenShift installer blocking on jobs

Bug 1794817 - OpenShift installer blocking on jobs

Summary: OpenShift installer blocking on jobs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Abhinav Dahiya
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-24 19:02 UTC by Jesus M. Rodriguez
Modified:	2020-05-04 11:27 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-04 11:26:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
cluster-version-operator.log (1.80 MB, text/plain) 2020-01-24 19:13 UTC, Jesus M. Rodriguez	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 317	0	None	closed	bug 1794817: lib/resourcebuilder/batch.go: skip waiting for job to complete when in Init mode	2020-10-14 15:15:49 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:27:05 UTC

Description Jesus M. Rodriguez 2020-01-24 19:02:59 UTC

Description of problem:
During installation if there is a job defined in an operator at a runlevel other than the end, the installation blocks until the job can finish. The problem is the job can't run because things like the network isn't even up yet. 

Version-Release number of selected component (if applicable):


How reproducible:

Pick an operator, like the cluster-svcat-apiserver-operator. Define a Job and its associated items in the manifests. Specify the job in something like 08_remover_job.yaml.

Watch the install fail to bootstrap.


Additional info:

Switching the job's descriptor to being in a different runlevel, like something at the end, i.e. 0000_90_cluster-svcat-apiserver-operator_01_remover_job.yaml, this will put the job at the end and the cluster will finish bootstrapping since now the job has enough resources to complete its objective.

Comment 1 Jesus M. Rodriguez 2020-01-24 19:05:21 UTC

Another side effect is that if the Job needs to be run earlier than the end, this will become a bigger problem.

Comment 2 Jesus M. Rodriguez 2020-01-24 19:06:49 UTC

https://github.com/openshift/cluster-version-operator/blob/ac394f1f73a6e1367d0af8622d4f9af545a6e39a/pkg/cvo/sync_worker.go#L587-L611

Comment 3 Jesus M. Rodriguez 2020-01-24 19:13:52 UTC

Created attachment 1655127 [details]
cluster-version-operator.log

Comment 4 Scott Dodson 2020-01-24 19:41:40 UTC

The CVO requires some amount of runlevel management by developers adding manifests. Dropping down to medium, if nothing else we'll need to make the docs around this more clear regarding what expectations are around a given runlevel.

Comment 5 W. Trevor King 2020-01-24 21:12:37 UTC

Jessica and Clayton point out that we don't block on other objects during the installation phase, where we just try to stuff as many objects into the cluster as fast as we can.  Relaxing that to make install-time jobs non-blocking makes sense to me.

Comment 8 liujia 2020-02-19 04:10:50 UTC

@Jesus M. Rodriguez 
After go through the comments, I think you install with a customized release image, right? Could u show me the detail steps to verify the bug? and better to share your 0000_90_cluster-svcat-apiserver-operator_01_remover_job.yaml too.

Comment 9 Jesus M. Rodriguez 2020-03-10 03:59:17 UTC

The PRs with the jobs are for apiserver: https://github.com/openshift/cluster-svcat-apiserver-operator/pull/74 and for controller-manager: https://github.com/openshift/cluster-svcat-controller-manager-operator/pull/68

The job definition file is as follows, for apiserver: https://github.com/jmrodri/cluster-svcat-apiserver-operator/blob/removal-job/manifests/0000_90_cluster-svcat-apiserver-operator_01_remover_job.yaml
The job definition file is as follows, for controller-manager: https://github.com/jmrodri/cluster-svcat-controller-manager-operator/blob/remove-svcat-add-job/manifests/0000_90_cluster-svcat-controller-manager-operator_01_remover_job.yaml

The apiserver job yaml is the most up to date. The PRs are failing test but not related to the CVO. 

To verify this bug, you might want to create a dummy job that runs in the CVO and ensure the cluster starts up without blocking. Maybe have a job that runs looks at items in the cluster, and logs what it finds. Does this help any?

Comment 10 liujia 2020-03-10 10:12:36 UTC

Thx @Jesus M. Rodriguez, it's not easy for us to trigger an installation with a dummy job included in the cvo, since qe can only trigger installation with a nightly build, which was built by ART team. So I dig into the e2e job according to your info, find original failed test "Run template e2e-aws - e2e-aws container setup".

level=info msg="Credentials loaded from the \"default\" profile in file \"/etc/openshift-installer/.awscred\""
level=warning msg="Found override for release image. Please be warned, this is not advised"
level=info msg="Consuming Install Config from target directory"
level=info msg="Creating infrastructure resources..."
level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-di0k4j40-7ad33.origin-ci-int-aws.dev.rhcloud.com:6443..."
level=info msg="API v1.17.1 up"
level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
level=info msg="Destroying the bootstrap resources..."
level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-di0k4j40-7ad33.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
level=info msg="Cluster operator insights Disabled is False with : "
level=fatal msg="failed to initialize the cluster: Could not update job \"openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover\" (474 of 536): the object is invalid, possibly due to local cluster configuration"

Some error info from cvo logs:
0124 04:13:02.948359       1 task.go:81] error running apply for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536): timed out waiting for the condition
...
I0124 04:13:02.949247       1 task_graph.go:596] Result of work: [Could not update job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536)]
I0124 04:13:02.949266       1 sync_worker.go:783] Summarizing 1 errors
I0124 04:13:02.949275       1 sync_worker.go:787] Update error 474 of 536: UpdatePayloadFailed Could not update job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536) (*errors.errorString: timed out waiting for the condition)
E0124 04:13:02.949303       1 sync_worker.go:329] unable to synchronize image (waiting 1m26.262851224s): Could not update job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (474 of 536)

[1]https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-apiserver-operator/74/pull-ci-openshift-cluster-svcat-apiserver-operator-master-e2e-aws/218/artifacts/e2e-aws/pods/openshift-cluster-version_cluster-version-operator-7f7765db6f-gw7gh_cluster-version-operator.log


Try to verify it with the latest e2e jobs against pr74 and pr68.  

Checked a newer e2e jobs against pr74 in [2], "Run template e2e-aws - e2e-aws container setup" test pass and the remover job done successfully from cvo logs.
I0305 03:47:25.362240       1 sync_worker.go:621] Running sync for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (494 of 570)
I0305 03:47:25.491843       1 sync_worker.go:634] Done syncing for job "openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover" (494 of 570)

[2] https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-apiserver-operator/74/pull-ci-openshift-cluster-svcat-apiserver-operator-master-e2e-aws/229/artifacts/e2e-aws/pods/openshift-cluster-version_cluster-version-operator-85d545c4b9-rjtlk_cluster-version-operator.log

Checked a newer e2e job against pr68 in [3], "Run template e2e-aws - e2e-aws container setup" test pass and the remover job done successfully from cvo logs.
I0309 21:20:52.471410       1 sync_worker.go:621] Running sync for job "openshift-service-catalog-removed/openshift-service-catalog-controller-manager-remover" (499 of 571)
I0309 21:20:52.508906       1 sync_worker.go:634] Done syncing for job "openshift-service-catalog-removed/openshift-service-catalog-controller-manager-remover" (499 of 571)

[3] https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-svcat-controller-manager-operator/68/pull-ci-openshift-cluster-svcat-controller-manager-operator-master-e2e-aws/162/artifacts/e2e-aws/pods/openshift-cluster-version_cluster-version-operator-7b7fc49b59-j8bsz_cluster-version-operator.log

Comment 12 errata-xmlrpc 2020-05-04 11:26:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.