Bug 2021456

Summary: operators installation doesn't start during ZTP installation, installplan reports Bundle unpacking failed. Reason: DeadlineExceeded
Product: OpenShift Container Platform Reporter: Marius Cornea <mcornea>
Component: OLMAssignee: Per da Silva <pegoncal>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED WONTFIX Docs Contact:
Severity: medium    
Priority: medium CC: achernet, agurenko, aireilly, akalenyu, cchantse, ccrum, jkeister, keyoung, lalon, obochan, odepaz, pegoncal, racedoro, rfreiman, vrutkovs
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-23 14:57:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marius Cornea 2021-11-09 10:22:27 UTC
Description of problem:

cluster-logging operator installation doesn't start during ZTP installation, installplan reports Bundle unpacking failed. Reason: DeadlineExceeded


Version-Release number of selected component (if applicable):
4.9.6


How reproducible:
Not always

Steps to Reproduce:
1. Deploy DU node via ZTP process from
http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/kni-qe-1-4.9
2. Wait for the policies to get applied
3. Check cluster-logging namespace CSVs
oc -n openshift-logging get csv


Actual results:

cluster-logging CSV is not created and the installplan reports Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline

Expected results:
cluster-logging CSV is created

Additional info:
Attaching must-gather

Comment 2 Marius Cornea 2021-11-09 10:32:01 UTC
Workaround:

delete catalogsource
delete installplan
delete sub
wait for the catalogsource and sub to be created via policies
the installplan and CSV should get created

Comment 3 Marius Cornea 2021-11-09 12:07:31 UTC
I am seeing the same issue with the local-storage operator.

Comment 5 yliu1 2021-11-12 20:22:16 UTC
I also encountered this issue during 4.9.6 deployment via ZTP 2 out of 3 times.

Comment 8 obochan 2021-12-15 11:56:13 UTC
Today encountered the same 
Client Version: 4.10.0-0.nightly-2021-12-12-232810
Server Version: 4.9.11

Comment 9 Vadim Rutkovsky 2022-04-04 14:05:22 UTC
This needs to be handled by OLM. In telco-edge we create a cluster with operator subscription on bootstrap time. 

Install plan job is limited to 10 mins, but during install nodes may have not yet become Ready yet, so operator installation may not succeed.
We're seeing this in jobs like https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_assisted-service/3607/pull-ci-openshift-assisted-service-master-e2e-metal-assisted-cnv/1510926604388274176:

14:41:19 - 0/2 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
14:48:59 - 0/5 nodes are available: 2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
 and finally
14:49:25 - Successfully assigned openshift-marketplace/cf3b12d45024d0f1fba7bf83031aca80346f2e77831f2dcb9a017daef5t4qvw to test-infra-cluster-a09b7cbb-worker-0

and the bundle cannot be extracted in the remaining minute.

This issue is reproducible, see https://search.ci.openshift.org/?search=Timeout+of+3600+seconds+expired+waiting+for+Monitored+.*olm.*+operators+to+be+in+of+the+statuses+.*available&maxAge=336h&context=1&type=build-log&name=.*e2e-metal-assisted-cnv.*%7C.*e2e-metal-assisted-ocs.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Seems extending job timeout from 10 mins to 15 mins would help (but that solution is probably not ideal)

Comment 14 Lital Alon 2022-08-09 07:18:34 UTC
In Assisted installer SaaS we also facing it for long time - 90% of the installations are ended up with OCS, CNV, LSO operators timeout
With the help of @akalenyu we noticed job objects are giving up retrying after 3 failures
Reproduces on 4.10.25 and 4.11.0-rc.7 (reproducing also on earlier versions)

Comment 15 Lital Alon 2022-08-09 07:19:30 UTC
Any idea why it fails after 3 retries?

Comment 18 Red Hat Bugzilla 2023-09-15 01:49:53 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days