1830714 – Unable to update cluster in disconnected environment: UpdatePayloadRetrievalFailed: could not download the update

Bug 1830714 - Unable to update cluster in disconnected environment: UpdatePayloadRetrievalFailed: could not download the update

Summary: Unable to update cluster in disconnected environment: UpdatePayloadRetrievalF...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.4
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Lalatendu Mohanty
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-03 11:32 UTC by Ori Michaeli
Modified:	2020-05-11 17:15 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-06 12:01:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
cvo-spec (331 bytes, text/plain) 2020-05-06 11:58 UTC, Ori Michaeli	no flags	Details
cvo-status-history (589 bytes, text/plain) 2020-05-06 11:58 UTC, Ori Michaeli	no flags	Details
must-gather-log (491.34 KB, text/plain) 2020-05-06 11:59 UTC, Ori Michaeli	no flags	Details
View All

Description Ori Michaeli 2020-05-03 11:32:44 UTC

Description of problem:

When trying to update ocp in a disconnected environment:
1. It keeps on trying to update the version
2. Switches the status of the command - oc get clusterversion
From: "Working towards internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236: downloading update"
To: "Unable to apply internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236: could not download the update
"

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-04-27-013217

How reproducible:
Every time

Steps to Reproduce:
1. oc adm release mirror \
    --insecure=true \
    -a combined-secret.json \
    --from quay.io/openshift-release-dev/ocp-release-nightly:4.4.0-0.nightly-2020-04-29-160236 \
    --to-release-image registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image:4.4.0-0.nightly-2020-04-29-160236 \
--to registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image

2. vi upgrade-image-policy.yml and pasting in from the output of previous step: 
apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata:
  name: example
spec:
  repositoryDigestMirrors:
  - mirrors:
    - registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image
    source: quay.io/openshift-release-dev/ocp-release-nightly
  - mirrors:
    - registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image
    source: quay.io/openshift-release-dev/ocp-v4.0-art-dev

3. oc create -f upgrade-image-policy.yml

4. oc wait --for=condition=UPDATED --timeout 1800s mcp/master

5. oc wait --for=condition=UPDATED --timeout 1800s mcp/worker

6. oc adm upgrade --to-image internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236 --allow-explicit-upgrade --force

7. oc get clusterversion

Actual results:
1. NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-04-27-013217   True        True          5h     Working towards internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236: downloading update

2. NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-04-27-013217   True        True          5h     Unable to apply internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236: could not download the update

Expected results:
Cluster version updated to 4.4.0-0.nightly-2020-04-29-160236

Additional info:

Comment 1 W. Trevor King 2020-05-04 20:42:01 UTC

'could not download the update' is the message for the UpdatePayloadRetrievalFailed reason [1].  That reason is set when targetUpdatePayloadDir fails [2].  That function has several possible failure modes though, and only one of them is actually a failure to retrieve the payload [3].  Can you attach your CVO logs and/or a must-gather from when this issue was presenting?  They will hopefully contain the original error instead of a generic simplification.

[1]: https://github.com/openshift/cluster-version-operator/blob/23856901003b95b559087b8e83bffdee82872b2b/pkg/payload/task.go#L176-L177
[2]: https://github.com/openshift/cluster-version-operator/blob/23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L108-L112
[3]: https://github.com/openshift/cluster-version-operator/blob/23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L127

Comment 2 Ori Michaeli 2020-05-06 11:35:47 UTC

(In reply to W. Trevor King from comment #1)
> 'could not download the update' is the message for the
> UpdatePayloadRetrievalFailed reason [1].  That reason is set when
> targetUpdatePayloadDir fails [2].  That function has several possible
> failure modes though, and only one of them is actually a failure to retrieve
> the payload [3].  Can you attach your CVO logs and/or a must-gather from
> when this issue was presenting?  They will hopefully contain the original
> error instead of a generic simplification.
> 
> [1]:
> https://github.com/openshift/cluster-version-operator/blob/
> 23856901003b95b559087b8e83bffdee82872b2b/pkg/payload/task.go#L176-L177
> [2]:
> https://github.com/openshift/cluster-version-operator/blob/
> 23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L108-L112
> [3]:
> https://github.com/openshift/cluster-version-operator/blob/
> 23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L127

I reprovisioned it to get requested logs.

Comment 3 Ori Michaeli 2020-05-06 11:57:36 UTC

After checking again it was a configuration mistake.

Although, you would expect it to fail after some time or after a number of attemps.

Closing this as NOTABUG.

Comment 4 Ori Michaeli 2020-05-06 11:58:14 UTC

Created attachment 1685682 [details]
cvo-spec

Comment 5 Ori Michaeli 2020-05-06 11:58:49 UTC

Created attachment 1685683 [details]
cvo-status-history

Comment 6 Ori Michaeli 2020-05-06 11:59:18 UTC

Created attachment 1685684 [details]
must-gather-log

Comment 7 W. Trevor King 2020-05-06 18:32:51 UTC

> Although, you would expect it to fail after some time or after a number of attemps.

How would the CVO distinguish between transients like "there's some network outage, and I'm currently unable to reach the target registry" (where we want it to retry) and "user pointed me at a bogus pullspec and there is never going to be a registry at that domain" (where... I dunno, maybe there would be a point where it was not worth retrying?).  If you do think there is a case where the CVO is retrying but should not be, give some details around it so we can discuss.

Comment 8 Ori Michaeli 2020-05-10 08:06:53 UTC

I think there should be a limit of tries/time because otherwise it could run endlessly and consume resources.

Should I open a new BZ titled differently?

Comment 9 W. Trevor King 2020-05-11 17:15:24 UTC

> I think there should be a limit of tries/time because otherwise it could run endlessly and consume resources.

The rest of the CVO is already running endlessly reconciling manifests to and reporting on cluster state.  An extra goroutine and an HTTP(S) request every few minutes is not consuming many additional resources.  If, on the other hand, the CVO gives up after a while and fails to notice a recovered path to the upstream service, it would need a human admin or other external tooling to notice the recovery and kick the CVO to get it to start fetching again, and humans are expensive.  If your CVO is banging it's head against the wall and the overhead bothers you, clear spec.channel in your ClusterVersion to tell the CVO not to bother.

Note You need to log in before you can comment on or make changes to this bug.