Bug 1921264
| Summary: | Bundle unpack jobs can hang indefinitely | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ben Luddy <bluddy> |
| Component: | OLM | Assignee: | Haseeb Tariq <htariq> |
| OLM sub component: | OLM | QA Contact: | xzha |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | high | CC: | anbhatta, krizza, tflannag |
| Version: | 4.6 | Keywords: | Triaged |
| Target Milestone: | --- | ||
| Target Release: | 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: The catalog-operator created bundle unpack jobs for an InstallPlan without setting a timeout
Consequence: In the case of a non-existent or deleted bundle image, the bundle unpack job would stay running forever and the InstallPlan would stay in the Installing phase with no indication of the bundle unpack Job's pod failing to resolve the image.
Fix: The catalog-operator now sets a default 10m timeout on bundle unpack jobs. This can be configured by the "--bundle-unpack-timeout" flag.
Result: The bundle unpack Job fails after the configured timeout and the InstallPlan also transitions to phase Failed with the failure reason visible in status.conditions and status.bundleLookups.conditions.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-27 22:36:45 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Ben Luddy
2021-01-27 19:47:58 UTC
Marking this with upcoming sprint and removing from 4.7.0. This will be addressed in a future sprint. Marking as triaged and reviewed in sprint. We just need to assign this one out. Marking as reviewed in sprint: We already have a PR in progress that adds a timeout and fails the InstallPlan on bundle unpack failures. We're currently working on covering the case of non-existent bundle images after which the PR should be ready to go. The linked PR: https://github.com/operator-framework/operator-lifecycle-manager/pull/2093 is only upstream for now so I've unlinked the BZ bot from that so this doesn't go to MODIFIED/ON_QA when that PR merges. That should happen once I have a corresponding PR on the downstream repo: https://github.com/openshift/operator-framework-olm Changing to assigned since only the upstream PR was merged. https://github.com/operator-framework/operator-lifecycle-manager/pull/2093 The BZ should move to POST once the downstream PR is opened. Verify:
zhaoxia@xzha-mac etcd % oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.8.0-0.nightly-2021-05-15-141455 True False 5h6m Cluster version is 4.8.0-0.nightly-2021-05-15-141455
zhaoxia@xzha-mac etcd % oc adm release info registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-05-15-141455 |grep operator-lifecycle-manager
operator-lifecycle-manager sha256:d03207b3ed3d3713bdd5e3f8cd912f61d91e676d3916da0c3e1957366069dc4d
1) prepare a index image which bundle is unpullable
zhaoxia@xzha-mac etcd % opm alpha bundle build -c 4.8 -e 4.8 -d ./0.9.2/ -p etcd -t quay.io/olmqe/etcd-bundle:bug-0.9.2 -o=true
INFO[0000] Building annotations.yaml
...
=> => naming to quay.io/olmqe/etcd-bundle:bug-0.9.2
zhaoxia@xzha-mac etcd % docker push quay.io/olmqe/etcd-bundle:bug-0.9.2
The push refers to repository [quay.io/olmqe/etcd-bundle]
f0911355df3b: Pushed
e5a10c489bac: Pushed
bug-0.9.2: digest: sha256:916850868de1b58d9b5ea72e571b071f4e618389395d1fbd2edf1709b36bfd9f size: 732
zhaoxia@xzha-mac etcd % opm index add -b quay.io/olmqe/etcd-bundle:bug-0.9.2 -t quay.io/olmqe/etcd-index:bug-0.9.2 -c docker --mode semver
INFO[0000] building the index bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
INFO[0000] running /usr/local/bin/docker pull quay.io/olmqe/etcd-bundle:bug-0.9.2 bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
INFO[0005] running docker create bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
INFO[0005] running docker cp bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
INFO[0005] running docker rm bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
INFO[0005] Could not find optional dependencies file dir=bundle_tmp064132418 file=bundle_tmp064132418/metadata load=annotations
INFO[0005] found csv, loading bundle dir=bundle_tmp064132418 file=bundle_tmp064132418/manifests load=bundle
INFO[0005] loading bundle file dir=bundle_tmp064132418/manifests file=etcdbackups.etcd.database.coreos.com.crd.yaml load=bundle
INFO[0005] loading bundle file dir=bundle_tmp064132418/manifests file=etcdclusters.etcd.database.coreos.com.crd.yaml load=bundle
INFO[0005] loading bundle file dir=bundle_tmp064132418/manifests file=etcdoperator.v0.9.2.clusterserviceversion.yaml load=bundle
INFO[0005] loading bundle file dir=bundle_tmp064132418/manifests file=etcdrestores.etcd.database.coreos.com.crd.yaml load=bundle
INFO[0005] Generating dockerfile bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
INFO[0005] writing dockerfile: index.Dockerfile695761296 bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
INFO[0005] running docker build bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
INFO[0005] [docker build -f index.Dockerfile695761296 -t quay.io/olmqe/etcd-index:bug-0.9.2 .] bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
zhaoxia@xzha-mac etcd % docker push quay.io/olmqe/etcd-index:bug-0.9.2
The push refers to repository [quay.io/olmqe/etcd-index]
3c9814547744: Pushed
87d838f78e87: Mounted from olmqe/ditto-index
e9033f602913: Mounted from olmqe/ditto-index
479bfeeb0448: Layer already exists
71b397855bc4: Layer already exists
b2d5eeeaba3a: Layer already exists
bug-0.9.2: digest: sha256:b8fe75aae745d3d69e76bcef1eb1e7e27181006d217cad46d81e3df907f17d1c size: 1578
Delete quay.io/olmqe/etcd-bundle:bug-0.9.2 using quay.io GUI
2) create catsrc using this index image, create sub
zhaoxia@xzha-mac etcd % cat catsrc.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: etcd-index-1
namespace: openshift-marketplace
spec:
displayName: etcd-cr
publisher: testqe
sourceType: grpc
image: quay.io/olmqe/etcd-index:bug-0.9.2
updateStrategy:
registryPoll:
interval: 2m
zhaoxia@xzha-mac etcd % cat sub.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: etcd
namespace: test-bug
spec:
channel: "4.8"
installPlanApproval: Automatic
name: etcd
source: etcd-index-1
sourceNamespace: openshift-marketplace
zhaoxia@xzha-mac etcd % oc apply -f catsrc.yaml
catalogsource.operators.coreos.com/etcd-index-1 created
zhaoxia@xzha-mac etcd % oc get packagemanifests | grep etcd
etcd Community Operators 5h13m
etcd etcd-cr 46s
zhaoxia@xzha-mac etcd % date; oc apply -f sub.yaml
Mon May 17 16:11:27 CST 2021
subscription.operators.coreos.com/etcd created
3) monitor installplan status
zhaoxia@xzha-mac etcd % date; oc get ip -o=jsonpath={..status}
Mon May 17 16:12:01 CST 2021
{"bundleLookups":[{"catalogSourceRef":{"name":"etcd-index-1","namespace":"openshift-marketplace"},"conditions":[{"message":"bundle contents have not yet been persisted to installplan status","reason":"BundleNotUnpacked","status":"True","type":"BundleLookupNotPersisted"},{"lastTransitionTime":"2021-05-17T08:11:35Z","message":"unpack job not completed: Unpack pod(openshift-marketplace/4ac4da32fae9695941e732bb422cdc9c4db9dcaf24ee477ad1681f72624zmgw) container(pull) is pending. Reason: ImagePullBackOff, Message: Back-off pulling image \"quay.io/olmqe/etcd-bundle:bug-0.9.2\"","reason":"JobIncomplete","status":"True","type":"BundleLookupPending"}],"identifier":"etcdoperator.v0.9.2","path":"quay.io/olmqe/etcd-bundle:bug-0.9.2","properties":"{\"properties\":[{\"type\":\"olm.gvk\",\"value\":{\"group\":\"etcd.database.coreos.com\",\"kind\":\"EtcdBackup\",\"version\":\"v1beta2\"}},{\"type\":\"olm.gvk\",\"value\":{\"group\":\"etcd.database.coreos.com\",\"kind\":\"EtcdCluster\",\"version\":\"v1beta2\"}},{\"type\":\"olm.gvk\",\"value\":{\"group\":\"etcd.database.coreos.com\",\"kind\":\"EtcdRestore\",\"version\":\"v1beta2\"}},{\"type\":\"olm.package\",\"value\":{\"packageName\":\"etcd\",\"version\":\"0.9.2\"}}]}","replaces":""}],"catalogSources":[],"phase":"Installing"} True True%
zhaoxia@xzha-mac etcd % date; oc get ip -o=jsonpath={..status} -n test-bug
Mon May 17 16:22:07 CST 2021
{"bundleLookups":[{"catalogSourceRef":{"name":"etcd-index-1","namespace":"openshift-marketplace"},"conditions":[{"message":"bundle contents have not yet been persisted to installplan status","reason":"BundleNotUnpacked","status":"True","type":"BundleLookupNotPersisted"},{"lastTransitionTime":"2021-05-17T08:11:35Z","message":"unpack job not completed: Unpack pod(openshift-marketplace/4ac4da32fae9695941e732bb422cdc9c4db9dcaf24ee477ad1681f72624zmgw) container(pull) is pending. Reason: ImagePullBackOff, Message: Back-off pulling image \"quay.io/olmqe/etcd-bundle:bug-0.9.2\"","reason":"JobIncomplete","status":"True","type":"BundleLookupPending"},{"lastTransitionTime":"2021-05-17T08:21:38Z","message":"Job was active longer than specified deadline","reason":"DeadlineExceeded","status":"True","type":"BundleLookupFailed"}],"identifier":"etcdoperator.v0.9.2","path":"quay.io/olmqe/etcd-bundle:bug-0.9.2","properties":"{\"properties\":[{\"type\":\"olm.gvk\",\"value\":{\"group\":\"etcd.database.coreos.com\",\"kind\":\"EtcdBackup\",\"version\":\"v1beta2\"}},{\"type\":\"olm.gvk\",\"value\":{\"group\":\"etcd.database.coreos.com\",\"kind\":\"EtcdCluster\",\"version\":\"v1beta2\"}},{\"type\":\"olm.gvk\",\"value\":{\"group\":\"etcd.database.coreos.com\",\"kind\":\"EtcdRestore\",\"version\":\"v1beta2\"}},{\"type\":\"olm.package\",\"value\":{\"packageName\":\"etcd\",\"version\":\"0.9.2\"}}]}","replaces":""}],"catalogSources":[],"conditions":[{"lastTransitionTime":"2021-05-17T08:21:38Z","lastUpdateTime":"2021-05-17T08:21:38Z","message":"Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline","reason":"InstallCheckFailed","status":"False","type":"Installed"}],"phase":"Failed"} True True True False%
4) check logs
1660 time="2021-05-17T08:20:22Z" level=info msg=syncing id=fgodz ip=install-cns6w namespace=test-bug phase=Installing
1661 time="2021-05-17T08:20:27Z" level=info msg=syncing id=z6vQR ip=install-cns6w namespace=test-bug phase=Installing
1662 time="2021-05-17T08:20:32Z" level=info msg=syncing id=kv014 ip=install-cns6w namespace=test-bug phase=Installing
1663 time="2021-05-17T08:20:37Z" level=info msg=syncing id=yuhel ip=install-cns6w namespace=test-bug phase=Installing
1664 time="2021-05-17T08:20:42Z" level=info msg=syncing id=zvtO2 ip=install-cns6w namespace=test-bug phase=Installing
1665 time="2021-05-17T08:20:47Z" level=info msg=syncing id=pUWDX ip=install-cns6w namespace=test-bug phase=Installing
1666 time="2021-05-17T08:20:53Z" level=info msg=syncing id=MQfiH ip=install-cns6w namespace=test-bug phase=Installing
1667 time="2021-05-17T08:20:58Z" level=info msg=syncing id=HXytH ip=install-cns6w namespace=test-bug phase=Installing
1668 time="2021-05-17T08:21:03Z" level=info msg=syncing id=uDYZn ip=install-cns6w namespace=test-bug phase=Installing
1669 time="2021-05-17T08:21:08Z" level=info msg=syncing id=uXa15 ip=install-cns6w namespace=test-bug phase=Installing
1670 time="2021-05-17T08:21:13Z" level=info msg=syncing id=Z/6cu ip=install-cns6w namespace=test-bug phase=Installing
1671 time="2021-05-17T08:21:18Z" level=info msg=syncing id=NIRpY ip=install-cns6w namespace=test-bug phase=Installing
1672 time="2021-05-17T08:21:23Z" level=info msg=syncing id=IEeXY ip=install-cns6w namespace=test-bug phase=Installing
1673 time="2021-05-17T08:21:28Z" level=info msg=syncing id=2uqS2 ip=install-cns6w namespace=test-bug phase=Installing
1674 time="2021-05-17T08:21:33Z" level=info msg=syncing id=AE9hO ip=install-cns6w namespace=test-bug phase=Installing
1675 time="2021-05-17T08:21:35Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=qe-app-registry state.State=CONNECTING"
1676 time="2021-05-17T08:21:38Z" level=info msg=syncing id=2mipO ip=install-cns6w namespace=test-bug phase=Installing
1677 time="2021-05-17T08:21:38Z" level=warning msg="status not equal, updating..." id=2mipO ip=install-cns6w namespace=test-bug phase=Installing
1678 time="2021-05-17T08:21:38Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
1679 time="2021-05-17T08:21:38Z" level=info msg=syncing id=U/5h8 ip=install-cns6w namespace=test-bug phase=Installing
1680 time="2021-05-17T08:21:38Z" level=info msg="Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline" id=U/5h8 ip=install-cns6w namespace=test-b ug phase=Installing
1681 time="2021-05-17T08:21:38Z" level=info msg="transitioning InstallPlan to failed" id=U/5h8 ip=install-cns6w namespace=test-bug phase=Installing
LGTM, verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |