Description of problem: cluster-logging operator installation doesn't start during ZTP installation, installplan reports Bundle unpacking failed. Reason: DeadlineExceeded Version-Release number of selected component (if applicable): 4.9.6 How reproducible: Not always Steps to Reproduce: 1. Deploy DU node via ZTP process from http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/kni-qe-1-4.9 2. Wait for the policies to get applied 3. Check cluster-logging namespace CSVs oc -n openshift-logging get csv Actual results: cluster-logging CSV is not created and the installplan reports Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline Expected results: cluster-logging CSV is created Additional info: Attaching must-gather
Workaround: delete catalogsource delete installplan delete sub wait for the catalogsource and sub to be created via policies the installplan and CSV should get created
I am seeing the same issue with the local-storage operator.
I also encountered this issue during 4.9.6 deployment via ZTP 2 out of 3 times.
Today encountered the same Client Version: 4.10.0-0.nightly-2021-12-12-232810 Server Version: 4.9.11
This needs to be handled by OLM. In telco-edge we create a cluster with operator subscription on bootstrap time. Install plan job is limited to 10 mins, but during install nodes may have not yet become Ready yet, so operator installation may not succeed. We're seeing this in jobs like https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_assisted-service/3607/pull-ci-openshift-assisted-service-master-e2e-metal-assisted-cnv/1510926604388274176: 14:41:19 - 0/2 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. 14:48:59 - 0/5 nodes are available: 2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. and finally 14:49:25 - Successfully assigned openshift-marketplace/cf3b12d45024d0f1fba7bf83031aca80346f2e77831f2dcb9a017daef5t4qvw to test-infra-cluster-a09b7cbb-worker-0 and the bundle cannot be extracted in the remaining minute. This issue is reproducible, see https://search.ci.openshift.org/?search=Timeout+of+3600+seconds+expired+waiting+for+Monitored+.*olm.*+operators+to+be+in+of+the+statuses+.*available&maxAge=336h&context=1&type=build-log&name=.*e2e-metal-assisted-cnv.*%7C.*e2e-metal-assisted-ocs.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Seems extending job timeout from 10 mins to 15 mins would help (but that solution is probably not ideal)
In Assisted installer SaaS we also facing it for long time - 90% of the installations are ended up with OCS, CNV, LSO operators timeout With the help of @akalenyu we noticed job objects are giving up retrying after 3 failures Reproduces on 4.10.25 and 4.11.0-rc.7 (reproducing also on earlier versions)
Any idea why it fails after 3 retries?
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days