Bug 1921264 - Bundle unpack jobs can hang indefinitely
Summary: Bundle unpack jobs can hang indefinitely
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.8.0
Assignee: Haseeb Tariq
QA Contact: xzha
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-27 19:47 UTC by Ben Luddy
Modified: 2021-07-27 22:37 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The catalog-operator created bundle unpack jobs for an InstallPlan without setting a timeout Consequence: In the case of a non-existent or deleted bundle image, the bundle unpack job would stay running forever and the InstallPlan would stay in the Installing phase with no indication of the bundle unpack Job's pod failing to resolve the image. Fix: The catalog-operator now sets a default 10m timeout on bundle unpack jobs. This can be configured by the "--bundle-unpack-timeout" flag. Result: The bundle unpack Job fails after the configured timeout and the InstallPlan also transitions to phase Failed with the failure reason visible in status.conditions and status.bundleLookups.conditions.
Clone Of:
Environment:
Last Closed: 2021-07-27 22:36:45 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift operator-framework-olm pull 78 0 None open Bug 1921264: Fail InstallPlan on bundle unpack timeout 2021-05-12 19:56:10 UTC
Github operator-framework operator-lifecycle-manager pull 2093 0 None open Bug 1921264: Fail InstallPlan on bundle unpack timeout 2021-04-15 23:10:51 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:37:15 UTC

Description Ben Luddy 2021-01-27 19:47:58 UTC
Description of problem:

If a bundle image reference in an InstallPlan is unpullable, the corresponding bundle unpack job will wait indefinitely because its managed pod never advances beyond Pending. The job should eventually fail if it does not succeed within a limited time period, and the InstallPlan's lookup status should reflect the job failure.

How reproducible: Always

Steps to Reproduce:

1. Create an InstallPlan whose status contains a BundleLookup that references an unpullable image.

Actual results:

The unpack job's pod never starts because it is unable to pull the image, and the job never terminates.

Expected results:

After a reasonable period of time (five or ten minutes?), the unpack job terminates and the InstallPlan status indicates that the bundle lookup has failed.

Comment 1 Kevin Rizza 2021-02-01 15:26:06 UTC
Marking this with upcoming sprint and removing from 4.7.0. This will be addressed in a future sprint.

Comment 2 Kevin Rizza 2021-02-22 18:33:48 UTC
Marking as triaged and reviewed in sprint. We just need to assign this one out.

Comment 3 Haseeb Tariq 2021-04-21 16:48:01 UTC
Marking as reviewed in sprint: We already have a PR in progress that adds a timeout and fails the InstallPlan on bundle unpack failures. We're currently working on covering the case of non-existent bundle images after which the PR should be ready to go.

Comment 4 Haseeb Tariq 2021-04-26 18:09:53 UTC
The linked PR: https://github.com/operator-framework/operator-lifecycle-manager/pull/2093 is only upstream for now so I've unlinked the BZ bot from that so this doesn't go to MODIFIED/ON_QA when that PR merges.

That should happen once I have a corresponding PR on the downstream repo: https://github.com/openshift/operator-framework-olm

Comment 6 Haseeb Tariq 2021-05-12 19:55:26 UTC
Changing to assigned since only the upstream PR was merged.
https://github.com/operator-framework/operator-lifecycle-manager/pull/2093

The BZ should move to POST once the downstream PR is opened.

Comment 8 xzha 2021-05-17 08:48:00 UTC
Verify:

zhaoxia@xzha-mac etcd % oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-15-141455   True        False         5h6m    Cluster version is 4.8.0-0.nightly-2021-05-15-141455
zhaoxia@xzha-mac etcd % oc adm release info registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-05-15-141455 |grep operator-lifecycle-manager         
  operator-lifecycle-manager                     sha256:d03207b3ed3d3713bdd5e3f8cd912f61d91e676d3916da0c3e1957366069dc4d

1) prepare a index image which bundle is unpullable
zhaoxia@xzha-mac etcd % opm alpha bundle build  -c 4.8 -e 4.8 -d ./0.9.2/ -p etcd -t quay.io/olmqe/etcd-bundle:bug-0.9.2 -o=true
INFO[0000] Building annotations.yaml                    
...
 => => naming to quay.io/olmqe/etcd-bundle:bug-0.9.2   
zhaoxia@xzha-mac etcd % docker push quay.io/olmqe/etcd-bundle:bug-0.9.2 
The push refers to repository [quay.io/olmqe/etcd-bundle]
f0911355df3b: Pushed 
e5a10c489bac: Pushed 
bug-0.9.2: digest: sha256:916850868de1b58d9b5ea72e571b071f4e618389395d1fbd2edf1709b36bfd9f size: 732

zhaoxia@xzha-mac etcd % opm index add -b quay.io/olmqe/etcd-bundle:bug-0.9.2 -t quay.io/olmqe/etcd-index:bug-0.9.2 -c docker --mode semver
INFO[0000] building the index                            bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
INFO[0000] running /usr/local/bin/docker pull quay.io/olmqe/etcd-bundle:bug-0.9.2  bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
INFO[0005] running docker create                         bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
INFO[0005] running docker cp                             bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
INFO[0005] running docker rm                             bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
INFO[0005] Could not find optional dependencies file     dir=bundle_tmp064132418 file=bundle_tmp064132418/metadata load=annotations
INFO[0005] found csv, loading bundle                     dir=bundle_tmp064132418 file=bundle_tmp064132418/manifests load=bundle
INFO[0005] loading bundle file                           dir=bundle_tmp064132418/manifests file=etcdbackups.etcd.database.coreos.com.crd.yaml load=bundle
INFO[0005] loading bundle file                           dir=bundle_tmp064132418/manifests file=etcdclusters.etcd.database.coreos.com.crd.yaml load=bundle
INFO[0005] loading bundle file                           dir=bundle_tmp064132418/manifests file=etcdoperator.v0.9.2.clusterserviceversion.yaml load=bundle
INFO[0005] loading bundle file                           dir=bundle_tmp064132418/manifests file=etcdrestores.etcd.database.coreos.com.crd.yaml load=bundle
INFO[0005] Generating dockerfile                         bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
INFO[0005] writing dockerfile: index.Dockerfile695761296  bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
INFO[0005] running docker build                          bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
INFO[0005] [docker build -f index.Dockerfile695761296 -t quay.io/olmqe/etcd-index:bug-0.9.2 .]  bundles="[quay.io/olmqe/etcd-bundle:bug-0.9.2]"
zhaoxia@xzha-mac etcd % docker push quay.io/olmqe/etcd-index:bug-0.9.2
The push refers to repository [quay.io/olmqe/etcd-index]
3c9814547744: Pushed 
87d838f78e87: Mounted from olmqe/ditto-index 
e9033f602913: Mounted from olmqe/ditto-index 
479bfeeb0448: Layer already exists 
71b397855bc4: Layer already exists 
b2d5eeeaba3a: Layer already exists 
bug-0.9.2: digest: sha256:b8fe75aae745d3d69e76bcef1eb1e7e27181006d217cad46d81e3df907f17d1c size: 1578

Delete quay.io/olmqe/etcd-bundle:bug-0.9.2 using quay.io GUI

2) create catsrc using this index image, create sub
zhaoxia@xzha-mac etcd % cat catsrc.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: etcd-index-1
  namespace: openshift-marketplace
spec:
  displayName: etcd-cr
  publisher: testqe
  sourceType: grpc
  image: quay.io/olmqe/etcd-index:bug-0.9.2
  updateStrategy:
    registryPoll:
      interval: 2m

zhaoxia@xzha-mac etcd % cat sub.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: etcd
  namespace: test-bug
spec:
  channel: "4.8"
  installPlanApproval: Automatic
  name: etcd
  source: etcd-index-1
  sourceNamespace: openshift-marketplace
zhaoxia@xzha-mac etcd % oc apply -f catsrc.yaml 
catalogsource.operators.coreos.com/etcd-index-1 created
zhaoxia@xzha-mac etcd % oc get packagemanifests | grep etcd
etcd                                                 Community Operators   5h13m
etcd                                                 etcd-cr               46s

zhaoxia@xzha-mac etcd % date; oc apply -f sub.yaml 
Mon May 17 16:11:27 CST 2021
subscription.operators.coreos.com/etcd created


3) monitor installplan status
zhaoxia@xzha-mac etcd % date; oc get ip -o=jsonpath={..status}
Mon May 17 16:12:01 CST 2021
{"bundleLookups":[{"catalogSourceRef":{"name":"etcd-index-1","namespace":"openshift-marketplace"},"conditions":[{"message":"bundle contents have not yet been persisted to installplan status","reason":"BundleNotUnpacked","status":"True","type":"BundleLookupNotPersisted"},{"lastTransitionTime":"2021-05-17T08:11:35Z","message":"unpack job not completed: Unpack pod(openshift-marketplace/4ac4da32fae9695941e732bb422cdc9c4db9dcaf24ee477ad1681f72624zmgw) container(pull) is pending. Reason: ImagePullBackOff, Message: Back-off pulling image \"quay.io/olmqe/etcd-bundle:bug-0.9.2\"","reason":"JobIncomplete","status":"True","type":"BundleLookupPending"}],"identifier":"etcdoperator.v0.9.2","path":"quay.io/olmqe/etcd-bundle:bug-0.9.2","properties":"{\"properties\":[{\"type\":\"olm.gvk\",\"value\":{\"group\":\"etcd.database.coreos.com\",\"kind\":\"EtcdBackup\",\"version\":\"v1beta2\"}},{\"type\":\"olm.gvk\",\"value\":{\"group\":\"etcd.database.coreos.com\",\"kind\":\"EtcdCluster\",\"version\":\"v1beta2\"}},{\"type\":\"olm.gvk\",\"value\":{\"group\":\"etcd.database.coreos.com\",\"kind\":\"EtcdRestore\",\"version\":\"v1beta2\"}},{\"type\":\"olm.package\",\"value\":{\"packageName\":\"etcd\",\"version\":\"0.9.2\"}}]}","replaces":""}],"catalogSources":[],"phase":"Installing"} True True% 

zhaoxia@xzha-mac etcd % date; oc get ip -o=jsonpath={..status} -n test-bug
Mon May 17 16:22:07 CST 2021
{"bundleLookups":[{"catalogSourceRef":{"name":"etcd-index-1","namespace":"openshift-marketplace"},"conditions":[{"message":"bundle contents have not yet been persisted to installplan status","reason":"BundleNotUnpacked","status":"True","type":"BundleLookupNotPersisted"},{"lastTransitionTime":"2021-05-17T08:11:35Z","message":"unpack job not completed: Unpack pod(openshift-marketplace/4ac4da32fae9695941e732bb422cdc9c4db9dcaf24ee477ad1681f72624zmgw) container(pull) is pending. Reason: ImagePullBackOff, Message: Back-off pulling image \"quay.io/olmqe/etcd-bundle:bug-0.9.2\"","reason":"JobIncomplete","status":"True","type":"BundleLookupPending"},{"lastTransitionTime":"2021-05-17T08:21:38Z","message":"Job was active longer than specified deadline","reason":"DeadlineExceeded","status":"True","type":"BundleLookupFailed"}],"identifier":"etcdoperator.v0.9.2","path":"quay.io/olmqe/etcd-bundle:bug-0.9.2","properties":"{\"properties\":[{\"type\":\"olm.gvk\",\"value\":{\"group\":\"etcd.database.coreos.com\",\"kind\":\"EtcdBackup\",\"version\":\"v1beta2\"}},{\"type\":\"olm.gvk\",\"value\":{\"group\":\"etcd.database.coreos.com\",\"kind\":\"EtcdCluster\",\"version\":\"v1beta2\"}},{\"type\":\"olm.gvk\",\"value\":{\"group\":\"etcd.database.coreos.com\",\"kind\":\"EtcdRestore\",\"version\":\"v1beta2\"}},{\"type\":\"olm.package\",\"value\":{\"packageName\":\"etcd\",\"version\":\"0.9.2\"}}]}","replaces":""}],"catalogSources":[],"conditions":[{"lastTransitionTime":"2021-05-17T08:21:38Z","lastUpdateTime":"2021-05-17T08:21:38Z","message":"Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline","reason":"InstallCheckFailed","status":"False","type":"Installed"}],"phase":"Failed"} True True True False%   

4) check logs
1660 time="2021-05-17T08:20:22Z" level=info msg=syncing id=fgodz ip=install-cns6w namespace=test-bug phase=Installing
1661 time="2021-05-17T08:20:27Z" level=info msg=syncing id=z6vQR ip=install-cns6w namespace=test-bug phase=Installing
1662 time="2021-05-17T08:20:32Z" level=info msg=syncing id=kv014 ip=install-cns6w namespace=test-bug phase=Installing
1663 time="2021-05-17T08:20:37Z" level=info msg=syncing id=yuhel ip=install-cns6w namespace=test-bug phase=Installing
1664 time="2021-05-17T08:20:42Z" level=info msg=syncing id=zvtO2 ip=install-cns6w namespace=test-bug phase=Installing
1665 time="2021-05-17T08:20:47Z" level=info msg=syncing id=pUWDX ip=install-cns6w namespace=test-bug phase=Installing
1666 time="2021-05-17T08:20:53Z" level=info msg=syncing id=MQfiH ip=install-cns6w namespace=test-bug phase=Installing
1667 time="2021-05-17T08:20:58Z" level=info msg=syncing id=HXytH ip=install-cns6w namespace=test-bug phase=Installing
1668 time="2021-05-17T08:21:03Z" level=info msg=syncing id=uDYZn ip=install-cns6w namespace=test-bug phase=Installing
1669 time="2021-05-17T08:21:08Z" level=info msg=syncing id=uXa15 ip=install-cns6w namespace=test-bug phase=Installing
1670 time="2021-05-17T08:21:13Z" level=info msg=syncing id=Z/6cu ip=install-cns6w namespace=test-bug phase=Installing
1671 time="2021-05-17T08:21:18Z" level=info msg=syncing id=NIRpY ip=install-cns6w namespace=test-bug phase=Installing
1672 time="2021-05-17T08:21:23Z" level=info msg=syncing id=IEeXY ip=install-cns6w namespace=test-bug phase=Installing
1673 time="2021-05-17T08:21:28Z" level=info msg=syncing id=2uqS2 ip=install-cns6w namespace=test-bug phase=Installing
1674 time="2021-05-17T08:21:33Z" level=info msg=syncing id=AE9hO ip=install-cns6w namespace=test-bug phase=Installing
1675 time="2021-05-17T08:21:35Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=qe-app-registry state.State=CONNECTING"
1676 time="2021-05-17T08:21:38Z" level=info msg=syncing id=2mipO ip=install-cns6w namespace=test-bug phase=Installing
1677 time="2021-05-17T08:21:38Z" level=warning msg="status not equal, updating..." id=2mipO ip=install-cns6w namespace=test-bug phase=Installing
1678 time="2021-05-17T08:21:38Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
1679 time="2021-05-17T08:21:38Z" level=info msg=syncing id=U/5h8 ip=install-cns6w namespace=test-bug phase=Installing
1680 time="2021-05-17T08:21:38Z" level=info msg="Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline" id=U/5h8 ip=install-cns6w namespace=test-b     ug phase=Installing
1681 time="2021-05-17T08:21:38Z" level=info msg="transitioning InstallPlan to failed" id=U/5h8 ip=install-cns6w namespace=test-bug phase=Installing


LGTM, verified

Comment 11 errata-xmlrpc 2021-07-27 22:36:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.