Description of problem: When trying to update ocp in a disconnected environment: 1. It keeps on trying to update the version 2. Switches the status of the command - oc get clusterversion From: "Working towards internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236: downloading update" To: "Unable to apply internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236: could not download the update " Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-04-27-013217 How reproducible: Every time Steps to Reproduce: 1. oc adm release mirror \ --insecure=true \ -a combined-secret.json \ --from quay.io/openshift-release-dev/ocp-release-nightly:4.4.0-0.nightly-2020-04-29-160236 \ --to-release-image registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image:4.4.0-0.nightly-2020-04-29-160236 \ --to registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image 2. vi upgrade-image-policy.yml and pasting in from the output of previous step: apiVersion: operator.openshift.io/v1alpha1 kind: ImageContentSourcePolicy metadata: name: example spec: repositoryDigestMirrors: - mirrors: - registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image source: quay.io/openshift-release-dev/ocp-release-nightly - mirrors: - registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image source: quay.io/openshift-release-dev/ocp-v4.0-art-dev 3. oc create -f upgrade-image-policy.yml 4. oc wait --for=condition=UPDATED --timeout 1800s mcp/master 5. oc wait --for=condition=UPDATED --timeout 1800s mcp/worker 6. oc adm upgrade --to-image internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236 --allow-explicit-upgrade --force 7. oc get clusterversion Actual results: 1. NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.0-0.nightly-2020-04-27-013217 True True 5h Working towards internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236: downloading update 2. NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.0-0.nightly-2020-04-27-013217 True True 5h Unable to apply internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236: could not download the update Expected results: Cluster version updated to 4.4.0-0.nightly-2020-04-29-160236 Additional info:
'could not download the update' is the message for the UpdatePayloadRetrievalFailed reason [1]. That reason is set when targetUpdatePayloadDir fails [2]. That function has several possible failure modes though, and only one of them is actually a failure to retrieve the payload [3]. Can you attach your CVO logs and/or a must-gather from when this issue was presenting? They will hopefully contain the original error instead of a generic simplification. [1]: https://github.com/openshift/cluster-version-operator/blob/23856901003b95b559087b8e83bffdee82872b2b/pkg/payload/task.go#L176-L177 [2]: https://github.com/openshift/cluster-version-operator/blob/23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L108-L112 [3]: https://github.com/openshift/cluster-version-operator/blob/23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L127
(In reply to W. Trevor King from comment #1) > 'could not download the update' is the message for the > UpdatePayloadRetrievalFailed reason [1]. That reason is set when > targetUpdatePayloadDir fails [2]. That function has several possible > failure modes though, and only one of them is actually a failure to retrieve > the payload [3]. Can you attach your CVO logs and/or a must-gather from > when this issue was presenting? They will hopefully contain the original > error instead of a generic simplification. > > [1]: > https://github.com/openshift/cluster-version-operator/blob/ > 23856901003b95b559087b8e83bffdee82872b2b/pkg/payload/task.go#L176-L177 > [2]: > https://github.com/openshift/cluster-version-operator/blob/ > 23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L108-L112 > [3]: > https://github.com/openshift/cluster-version-operator/blob/ > 23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L127 I reprovisioned it to get requested logs.
After checking again it was a configuration mistake. Although, you would expect it to fail after some time or after a number of attemps. Closing this as NOTABUG.
Created attachment 1685682 [details] cvo-spec
Created attachment 1685683 [details] cvo-status-history
Created attachment 1685684 [details] must-gather-log
> Although, you would expect it to fail after some time or after a number of attemps. How would the CVO distinguish between transients like "there's some network outage, and I'm currently unable to reach the target registry" (where we want it to retry) and "user pointed me at a bogus pullspec and there is never going to be a registry at that domain" (where... I dunno, maybe there would be a point where it was not worth retrying?). If you do think there is a case where the CVO is retrying but should not be, give some details around it so we can discuss.
I think there should be a limit of tries/time because otherwise it could run endlessly and consume resources. Should I open a new BZ titled differently?
> I think there should be a limit of tries/time because otherwise it could run endlessly and consume resources. The rest of the CVO is already running endlessly reconciling manifests to and reporting on cluster state. An extra goroutine and an HTTP(S) request every few minutes is not consuming many additional resources. If, on the other hand, the CVO gives up after a while and fails to notice a recovered path to the upstream service, it would need a human admin or other external tooling to notice the recovery and kick the CVO to get it to start fetching again, and humans are expensive. If your CVO is banging it's head against the wall and the overhead bothers you, clear spec.channel in your ClusterVersion to tell the CVO not to bother.