Bug 1830714

Summary:

Unable to update cluster in disconnected environment: UpdatePayloadRetrievalFailed: could not download the update

Product:

OpenShift Container Platform

Reporter:

Ori Michaeli <omichael>

Component:

Cluster Version Operator

Assignee:

Lalatendu Mohanty <lmohanty>

Status:

CLOSED NOTABUG

QA Contact:

liujia <jiajliu>

Severity:

high

Docs Contact:

Priority:

medium

Version:

4.4

CC:

aos-bugs, jokerman, omichael, ssmolyak, wking

Target Milestone:

---

Target Release:

4.5.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-05-06 12:01:19 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
cvo-spec	none
cvo-status-history	none
must-gather-log	none

Description Ori Michaeli 2020-05-03 11:32:44 UTC

Description of problem:

When trying to update ocp in a disconnected environment:
1. It keeps on trying to update the version
2. Switches the status of the command - oc get clusterversion
From: "Working towards internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236: downloading update"
To: "Unable to apply internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236: could not download the update
"

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-04-27-013217

How reproducible:
Every time

Steps to Reproduce:
1. oc adm release mirror \
    --insecure=true \
    -a combined-secret.json \
    --from quay.io/openshift-release-dev/ocp-release-nightly:4.4.0-0.nightly-2020-04-29-160236 \
    --to-release-image registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image:4.4.0-0.nightly-2020-04-29-160236 \
--to registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image

2. vi upgrade-image-policy.yml and pasting in from the output of previous step: 
apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata:
  name: example
spec:
  repositoryDigestMirrors:
  - mirrors:
    - registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image
    source: quay.io/openshift-release-dev/ocp-release-nightly
  - mirrors:
    - registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image
    source: quay.io/openshift-release-dev/ocp-v4.0-art-dev

3. oc create -f upgrade-image-policy.yml

4. oc wait --for=condition=UPDATED --timeout 1800s mcp/master

5. oc wait --for=condition=UPDATED --timeout 1800s mcp/worker

6. oc adm upgrade --to-image internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236 --allow-explicit-upgrade --force

7. oc get clusterversion

Actual results:
1. NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-04-27-013217   True        True          5h     Working towards internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236: downloading update

2. NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-04-27-013217   True        True          5h     Unable to apply internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236: could not download the update

Expected results:
Cluster version updated to 4.4.0-0.nightly-2020-04-29-160236

Additional info:

Comment 1 W. Trevor King 2020-05-04 20:42:01 UTC

'could not download the update' is the message for the UpdatePayloadRetrievalFailed reason [1].  That reason is set when targetUpdatePayloadDir fails [2].  That function has several possible failure modes though, and only one of them is actually a failure to retrieve the payload [3].  Can you attach your CVO logs and/or a must-gather from when this issue was presenting?  They will hopefully contain the original error instead of a generic simplification.

[1]: https://github.com/openshift/cluster-version-operator/blob/23856901003b95b559087b8e83bffdee82872b2b/pkg/payload/task.go#L176-L177
[2]: https://github.com/openshift/cluster-version-operator/blob/23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L108-L112
[3]: https://github.com/openshift/cluster-version-operator/blob/23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L127

Comment 2 Ori Michaeli 2020-05-06 11:35:47 UTC

(In reply to W. Trevor King from comment #1)
> 'could not download the update' is the message for the
> UpdatePayloadRetrievalFailed reason [1].  That reason is set when
> targetUpdatePayloadDir fails [2].  That function has several possible
> failure modes though, and only one of them is actually a failure to retrieve
> the payload [3].  Can you attach your CVO logs and/or a must-gather from
> when this issue was presenting?  They will hopefully contain the original
> error instead of a generic simplification.
> 
> [1]:
> https://github.com/openshift/cluster-version-operator/blob/
> 23856901003b95b559087b8e83bffdee82872b2b/pkg/payload/task.go#L176-L177
> [2]:
> https://github.com/openshift/cluster-version-operator/blob/
> 23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L108-L112
> [3]:
> https://github.com/openshift/cluster-version-operator/blob/
> 23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L127

I reprovisioned it to get requested logs.

Comment 3 Ori Michaeli 2020-05-06 11:57:36 UTC

After checking again it was a configuration mistake.

Although, you would expect it to fail after some time or after a number of attemps.

Closing this as NOTABUG.

Comment 4 Ori Michaeli 2020-05-06 11:58:14 UTC

Created attachment 1685682 [details]
cvo-spec

Comment 5 Ori Michaeli 2020-05-06 11:58:49 UTC

Created attachment 1685683 [details]
cvo-status-history

Comment 6 Ori Michaeli 2020-05-06 11:59:18 UTC

Created attachment 1685684 [details]
must-gather-log

Comment 7 W. Trevor King 2020-05-06 18:32:51 UTC

> Although, you would expect it to fail after some time or after a number of attemps.

How would the CVO distinguish between transients like "there's some network outage, and I'm currently unable to reach the target registry" (where we want it to retry) and "user pointed me at a bogus pullspec and there is never going to be a registry at that domain" (where... I dunno, maybe there would be a point where it was not worth retrying?).  If you do think there is a case where the CVO is retrying but should not be, give some details around it so we can discuss.

Comment 8 Ori Michaeli 2020-05-10 08:06:53 UTC

I think there should be a limit of tries/time because otherwise it could run endlessly and consume resources.

Should I open a new BZ titled differently?

Comment 9 W. Trevor King 2020-05-11 17:15:24 UTC

> I think there should be a limit of tries/time because otherwise it could run endlessly and consume resources.

The rest of the CVO is already running endlessly reconciling manifests to and reporting on cluster state.  An extra goroutine and an HTTP(S) request every few minutes is not consuming many additional resources.  If, on the other hand, the CVO gives up after a while and fails to notice a recovered path to the upstream service, it would need a human admin or other external tooling to notice the recovery and kick the CVO to get it to start fetching again, and humans are expensive.  If your CVO is banging it's head against the wall and the overhead bothers you, clear spec.channel in your ClusterVersion to tell the CVO not to bother.