2021456 – operators installation doesn't start during ZTP installation, installplan reports Bundle unpacking failed. Reason: DeadlineExceeded

Bug 2021456 - operators installation doesn't start during ZTP installation, installplan reports Bundle unpacking failed. Reason: DeadlineExceeded

Summary: operators installation doesn't start during ZTP installation, installplan rep...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Per da Silva
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-09 10:22 UTC by Marius Cornea
Modified:	2023-09-27 12:54 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-23 14:57:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Marius Cornea 2021-11-09 10:22:27 UTC

Description of problem:

cluster-logging operator installation doesn't start during ZTP installation, installplan reports Bundle unpacking failed. Reason: DeadlineExceeded


Version-Release number of selected component (if applicable):
4.9.6


How reproducible:
Not always

Steps to Reproduce:
1. Deploy DU node via ZTP process from
http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/kni-qe-1-4.9
2. Wait for the policies to get applied
3. Check cluster-logging namespace CSVs
oc -n openshift-logging get csv


Actual results:

cluster-logging CSV is not created and the installplan reports Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline

Expected results:
cluster-logging CSV is created

Additional info:
Attaching must-gather

Comment 2 Marius Cornea 2021-11-09 10:32:01 UTC

Workaround:

delete catalogsource
delete installplan
delete sub
wait for the catalogsource and sub to be created via policies
the installplan and CSV should get created

Comment 3 Marius Cornea 2021-11-09 12:07:31 UTC

I am seeing the same issue with the local-storage operator.

Comment 5 yliu1 2021-11-12 20:22:16 UTC

I also encountered this issue during 4.9.6 deployment via ZTP 2 out of 3 times.

Comment 8 obochan 2021-12-15 11:56:13 UTC

Today encountered the same 
Client Version: 4.10.0-0.nightly-2021-12-12-232810
Server Version: 4.9.11

Comment 9 Vadim Rutkovsky 2022-04-04 14:05:22 UTC

This needs to be handled by OLM. In telco-edge we create a cluster with operator subscription on bootstrap time. 

Install plan job is limited to 10 mins, but during install nodes may have not yet become Ready yet, so operator installation may not succeed.
We're seeing this in jobs like https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_assisted-service/3607/pull-ci-openshift-assisted-service-master-e2e-metal-assisted-cnv/1510926604388274176:

14:41:19 - 0/2 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
14:48:59 - 0/5 nodes are available: 2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
 and finally
14:49:25 - Successfully assigned openshift-marketplace/cf3b12d45024d0f1fba7bf83031aca80346f2e77831f2dcb9a017daef5t4qvw to test-infra-cluster-a09b7cbb-worker-0

and the bundle cannot be extracted in the remaining minute.

This issue is reproducible, see https://search.ci.openshift.org/?search=Timeout+of+3600+seconds+expired+waiting+for+Monitored+.*olm.*+operators+to+be+in+of+the+statuses+.*available&maxAge=336h&context=1&type=build-log&name=.*e2e-metal-assisted-cnv.*%7C.*e2e-metal-assisted-ocs.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Seems extending job timeout from 10 mins to 15 mins would help (but that solution is probably not ideal)

Comment 14 Lital Alon 2022-08-09 07:18:34 UTC

In Assisted installer SaaS we also facing it for long time - 90% of the installations are ended up with OCS, CNV, LSO operators timeout
With the help of @akalenyu we noticed job objects are giving up retrying after 3 failures
Reproduces on 4.10.25 and 4.11.0-rc.7 (reproducing also on earlier versions)

Comment 15 Lital Alon 2022-08-09 07:19:30 UTC

Any idea why it fails after 3 retries?

Comment 18 Red Hat Bugzilla 2023-09-15 01:49:53 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.