Bug 1830714

Summary: Unable to update cluster in disconnected environment: UpdatePayloadRetrievalFailed: could not download the update
Product: OpenShift Container Platform Reporter: Ori Michaeli <omichael>
Component: Cluster Version OperatorAssignee: Lalatendu Mohanty <lmohanty>
Status: CLOSED NOTABUG QA Contact: liujia <jiajliu>
Severity: high Docs Contact:
Priority: medium    
Version: 4.4CC: aos-bugs, jokerman, omichael, ssmolyak, wking
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-06 12:01:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cvo-spec
none
cvo-status-history
none
must-gather-log none

Description Ori Michaeli 2020-05-03 11:32:44 UTC
Description of problem:

When trying to update ocp in a disconnected environment:
1. It keeps on trying to update the version
2. Switches the status of the command - oc get clusterversion
From: "Working towards internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236: downloading update"
To: "Unable to apply internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236: could not download the update
"

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-04-27-013217

How reproducible:
Every time

Steps to Reproduce:
1. oc adm release mirror \
    --insecure=true \
    -a combined-secret.json \
    --from quay.io/openshift-release-dev/ocp-release-nightly:4.4.0-0.nightly-2020-04-29-160236 \
    --to-release-image registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image:4.4.0-0.nightly-2020-04-29-160236 \
--to registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image

2. vi upgrade-image-policy.yml and pasting in from the output of previous step: 
apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata:
  name: example
spec:
  repositoryDigestMirrors:
  - mirrors:
    - registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image
    source: quay.io/openshift-release-dev/ocp-release-nightly
  - mirrors:
    - registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image
    source: quay.io/openshift-release-dev/ocp-v4.0-art-dev

3. oc create -f upgrade-image-policy.yml

4. oc wait --for=condition=UPDATED --timeout 1800s mcp/master

5. oc wait --for=condition=UPDATED --timeout 1800s mcp/worker

6. oc adm upgrade --to-image internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236 --allow-explicit-upgrade --force

7. oc get clusterversion

Actual results:
1. NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-04-27-013217   True        True          5h     Working towards internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236: downloading update

2. NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-04-27-013217   True        True          5h     Unable to apply internal-registry.qe.devcluster.openshift.com:5000/ocp/release:4.4.0-0.nightly-2020-04-29-160236: could not download the update



Expected results:
Cluster version updated to 4.4.0-0.nightly-2020-04-29-160236

Additional info:

Comment 1 W. Trevor King 2020-05-04 20:42:01 UTC
'could not download the update' is the message for the UpdatePayloadRetrievalFailed reason [1].  That reason is set when targetUpdatePayloadDir fails [2].  That function has several possible failure modes though, and only one of them is actually a failure to retrieve the payload [3].  Can you attach your CVO logs and/or a must-gather from when this issue was presenting?  They will hopefully contain the original error instead of a generic simplification.

[1]: https://github.com/openshift/cluster-version-operator/blob/23856901003b95b559087b8e83bffdee82872b2b/pkg/payload/task.go#L176-L177
[2]: https://github.com/openshift/cluster-version-operator/blob/23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L108-L112
[3]: https://github.com/openshift/cluster-version-operator/blob/23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L127

Comment 2 Ori Michaeli 2020-05-06 11:35:47 UTC
(In reply to W. Trevor King from comment #1)
> 'could not download the update' is the message for the
> UpdatePayloadRetrievalFailed reason [1].  That reason is set when
> targetUpdatePayloadDir fails [2].  That function has several possible
> failure modes though, and only one of them is actually a failure to retrieve
> the payload [3].  Can you attach your CVO logs and/or a must-gather from
> when this issue was presenting?  They will hopefully contain the original
> error instead of a generic simplification.
> 
> [1]:
> https://github.com/openshift/cluster-version-operator/blob/
> 23856901003b95b559087b8e83bffdee82872b2b/pkg/payload/task.go#L176-L177
> [2]:
> https://github.com/openshift/cluster-version-operator/blob/
> 23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L108-L112
> [3]:
> https://github.com/openshift/cluster-version-operator/blob/
> 23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L127

I reprovisioned it to get requested logs.

Comment 3 Ori Michaeli 2020-05-06 11:57:36 UTC
After checking again it was a configuration mistake.

Although, you would expect it to fail after some time or after a number of attemps.

Closing this as NOTABUG.

Comment 4 Ori Michaeli 2020-05-06 11:58:14 UTC
Created attachment 1685682 [details]
cvo-spec

Comment 5 Ori Michaeli 2020-05-06 11:58:49 UTC
Created attachment 1685683 [details]
cvo-status-history

Comment 6 Ori Michaeli 2020-05-06 11:59:18 UTC
Created attachment 1685684 [details]
must-gather-log

Comment 7 W. Trevor King 2020-05-06 18:32:51 UTC
> Although, you would expect it to fail after some time or after a number of attemps.

How would the CVO distinguish between transients like "there's some network outage, and I'm currently unable to reach the target registry" (where we want it to retry) and "user pointed me at a bogus pullspec and there is never going to be a registry at that domain" (where... I dunno, maybe there would be a point where it was not worth retrying?).  If you do think there is a case where the CVO is retrying but should not be, give some details around it so we can discuss.

Comment 8 Ori Michaeli 2020-05-10 08:06:53 UTC
I think there should be a limit of tries/time because otherwise it could run endlessly and consume resources.

Should I open a new BZ titled differently?

Comment 9 W. Trevor King 2020-05-11 17:15:24 UTC
> I think there should be a limit of tries/time because otherwise it could run endlessly and consume resources.

The rest of the CVO is already running endlessly reconciling manifests to and reporting on cluster state.  An extra goroutine and an HTTP(S) request every few minutes is not consuming many additional resources.  If, on the other hand, the CVO gives up after a while and fails to notice a recovered path to the upstream service, it would need a human admin or other external tooling to notice the recovery and kick the CVO to get it to start fetching again, and humans are expensive.  If your CVO is banging it's head against the wall and the overhead bothers you, clear spec.channel in your ClusterVersion to tell the CVO not to bother.