Bug 1984829 - InstallPlan does not recover after not being able to pull the image
Summary: InstallPlan does not recover after not being able to pull the image
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Kevin Rizza
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-22 10:07 UTC by Marius Cornea
Modified: 2021-09-22 16:35 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-22 12:54:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Marius Cornea 2021-07-22 10:07:00 UTC
Description of problem:

This issue shows up on a disconnected environment which initially didn't have an ImageContentSourcePolicy including the mirror for operator images set. The InstallPlan initially shows the following failure

lastTransitionTime: "2021-07-21T18:26:32Z"
message: 'unpack job not completed: Unpack pod(openshift-marketplace/8448a620ab041469e30d9ce22dc6be76a624b8834ad8af66f50105cd73kxk6m)
  container(pull) is pending. Reason: ImagePullBackOff, Message: Back-off
  pulling image "registry-proxy.engineering.redhat.com/rh-osbs/openshift-ose-sriov-network-operator-bundle@sha256:9d93eb2a6f2cf7ba466784f72cf782e8c99921b43704f06dd493364aed95ace7"'
reason: JobIncomplete
status: "True"
type: BundleLookupPending

After creating the ImageContentSourcePolicy at 2021-07-21T18:27:00Z including the correct mirror for registry-proxy.engineering.redhat.com/rh-osbs/openshift-ose-sriov-network-operator-bundle@sha256:9d93eb2a6f2cf7ba466784f72cf782e8c99921b43704f06dd493364aed95ace7 :

apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata:
  creationTimestamp: "2021-07-21T18:27:00Z"
  generation: 1
  name: redhat-internal-icsp
  resourceVersion: "2182653"
  uid: 550a03f4-5381-4014-b5aa-273adb787da2
spec:
  repositoryDigestMirrors:
  - mirrors:
    - registry.kni-qe-0.lab.eng.rdu2.redhat.com:5000
    source: registry.redhat.io
  - mirrors:
    - registry.kni-qe-0.lab.eng.rdu2.redhat.com:5000
    source: registry-proxy.engineering.redhat.com
  - mirrors:
    - registry.kni-qe-0.lab.eng.rdu2.redhat.com:5000
    source: registry.stage.redhat.io
  - mirrors:
    - registry.kni-qe-0.lab.eng.rdu2.redhat.com:5000/localimages/local-release-image
    source: registry.ci.openshift.org/ocp/release


the InstallPlan doesn't progress and it eventually shows 'Job was active longer than specified deadline' at 2021-07-21T18:41:45Z

apiVersion: v1
items:
- apiVersion: operators.coreos.com/v1alpha1
  kind: InstallPlan
  metadata:
    creationTimestamp: "2021-07-21T18:26:32Z"
    generateName: install-
    generation: 1
    labels:
      operators.coreos.com/sriov-network-operator.openshift-sriov-network-operator: ""
    name: install-449sn
    namespace: openshift-sriov-network-operator
    ownerReferences:
    - apiVersion: operators.coreos.com/v1alpha1
      blockOwnerDeletion: false
      controller: false
      kind: Subscription
      name: sriov-network-operator-subscription
      uid: 2d61735d-22f1-444e-91e5-d343f4fd6c12
    resourceVersion: "2188022"
    uid: 142a8acf-53c9-45dc-a709-7e3ca7c75efe
  spec:
    approval: Automatic
    approved: true
    clusterServiceVersionNames:
    - sriov-network-operator.4.8.0-202107081650
    generation: 1
  status:
    bundleLookups:
    - catalogSourceRef:
        name: sriov-network-operator
        namespace: openshift-marketplace
      conditions:
      - message: bundle contents have not yet been persisted to installplan status
        reason: BundleNotUnpacked
        status: "True"
        type: BundleLookupNotPersisted
      - lastTransitionTime: "2021-07-21T18:26:32Z"
        message: 'unpack job not completed: Unpack pod(openshift-marketplace/8448a620ab041469e30d9ce22dc6be76a624b8834ad8af66f50105cd73kxk6m)
          container(pull) is pending. Reason: ImagePullBackOff, Message: Back-off
          pulling image "registry-proxy.engineering.redhat.com/rh-osbs/openshift-ose-sriov-network-operator-bundle@sha256:9d93eb2a6f2cf7ba466784f72cf782e8c99921b43704f06dd493364aed95ace7"'
        reason: JobIncomplete
        status: "True"
        type: BundleLookupPending
      - lastTransitionTime: "2021-07-21T18:41:45Z"
        message: Job was active longer than specified deadline
        reason: DeadlineExceeded
        status: "True"
        type: BundleLookupFailed
      identifier: sriov-network-operator.4.8.0-202107081650
      path: registry-proxy.engineering.redhat.com/rh-osbs/openshift-ose-sriov-network-operator-bundle@sha256:9d93eb2a6f2cf7ba466784f72cf782e8c99921b43704f06dd493364aed95ace7
      properties: '{"properties":[{"type":"olm.gvk","value":{"group":"sriovnetwork.openshift.io","kind":"SriovIBNetwork","version":"v1"}},{"type":"olm.gvk","value":{"group":"sriovnetwork.openshift.io","kind":"SriovNetwork","version":"v1"}},{"type":"olm.gvk","value":{"group":"sriovnetwork.openshift.io","kind":"SriovNetworkNodePolicy","version":"v1"}},{"type":"olm.gvk","value":{"group":"sriovnetwork.openshift.io","kind":"SriovNetworkNodeState","version":"v1"}},{"type":"olm.gvk","value":{"group":"sriovnetwork.openshift.io","kind":"SriovOperatorConfig","version":"v1"}},{"type":"olm.package","value":{"packageName":"sriov-network-operator","version":"4.8.0-202107081650"}}]}'
      replaces: ""
    catalogSources: []
    conditions:
    - lastTransitionTime: "2021-07-21T18:41:47Z"
      lastUpdateTime: "2021-07-21T18:41:47Z"
      message: 'Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job
        was active longer than specified deadline'
      reason: InstallCheckFailed
      status: "False"
      type: Installed
    phase: Failed
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


Version-Release number of selected component (if applicable):
4.8.0-rc.3

How reproducible:
100%

Steps to Reproduce:

1. Create a catalogsource and subscription which generate an installplan that tries to pull a bundle image which is not reachable(in this case it was due to missing ICSP with the correct mirror)

2. Make the bundle image reachable

Actual results:

InstallPlan doesn't progress and stops at 

      lastUpdateTime: "2021-07-21T18:41:47Z"
      message: 'Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job
        was active longer than specified deadline'
      reason: InstallCheckFailed


Expected results:

InstallPlan retries pulling the bundle image

Additional info:

To get it progressing I had to delete the existing catalogsource pod and installplan.

Comment 1 Kevin Rizza 2021-07-22 12:54:01 UTC
Hi Marius,

This is actually expected behavior. Installplans are not a declarative resource, they are a definition of an execution that runs on a cluster (similar to a job). They will perform a certain number of retries in certain specific cases, but once they exceed their retry limit (on the order of ~seconds) they will go into a permanent failed state, and there are some failures that they cannot recover from.


Note You need to log in before you can comment on or make changes to this bug.