Bug 2021456
| Summary: | operators installation doesn't start during ZTP installation, installplan reports Bundle unpacking failed. Reason: DeadlineExceeded | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Marius Cornea <mcornea> |
| Component: | OLM | Assignee: | Per da Silva <pegoncal> |
| OLM sub component: | OLM | QA Contact: | Jian Zhang <jiazha> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | achernet, agurenko, aireilly, akalenyu, cchantse, ccrum, jkeister, keyoung, lalon, obochan, odepaz, pegoncal, racedoro, rfreiman, vrutkovs |
| Version: | 4.9 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-23 14:57:25 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Marius Cornea
2021-11-09 10:22:27 UTC
Workaround: delete catalogsource delete installplan delete sub wait for the catalogsource and sub to be created via policies the installplan and CSV should get created I am seeing the same issue with the local-storage operator. I also encountered this issue during 4.9.6 deployment via ZTP 2 out of 3 times. Today encountered the same Client Version: 4.10.0-0.nightly-2021-12-12-232810 Server Version: 4.9.11 This needs to be handled by OLM. In telco-edge we create a cluster with operator subscription on bootstrap time. Install plan job is limited to 10 mins, but during install nodes may have not yet become Ready yet, so operator installation may not succeed. We're seeing this in jobs like https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_assisted-service/3607/pull-ci-openshift-assisted-service-master-e2e-metal-assisted-cnv/1510926604388274176: 14:41:19 - 0/2 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. 14:48:59 - 0/5 nodes are available: 2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. and finally 14:49:25 - Successfully assigned openshift-marketplace/cf3b12d45024d0f1fba7bf83031aca80346f2e77831f2dcb9a017daef5t4qvw to test-infra-cluster-a09b7cbb-worker-0 and the bundle cannot be extracted in the remaining minute. This issue is reproducible, see https://search.ci.openshift.org/?search=Timeout+of+3600+seconds+expired+waiting+for+Monitored+.*olm.*+operators+to+be+in+of+the+statuses+.*available&maxAge=336h&context=1&type=build-log&name=.*e2e-metal-assisted-cnv.*%7C.*e2e-metal-assisted-ocs.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Seems extending job timeout from 10 mins to 15 mins would help (but that solution is probably not ideal) In Assisted installer SaaS we also facing it for long time - 90% of the installations are ended up with OCS, CNV, LSO operators timeout With the help of @akalenyu we noticed job objects are giving up retrying after 3 failures Reproduces on 4.10.25 and 4.11.0-rc.7 (reproducing also on earlier versions) Any idea why it fails after 3 retries? The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days |