Bug 1830714
Summary: | Unable to update cluster in disconnected environment: UpdatePayloadRetrievalFailed: could not download the update | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ori Michaeli <omichael> | ||||||||
Component: | Cluster Version Operator | Assignee: | Lalatendu Mohanty <lmohanty> | ||||||||
Status: | CLOSED NOTABUG | QA Contact: | liujia <jiajliu> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 4.4 | CC: | aos-bugs, jokerman, omichael, ssmolyak, wking | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 4.5.0 | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2020-05-06 12:01:19 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Ori Michaeli
2020-05-03 11:32:44 UTC
'could not download the update' is the message for the UpdatePayloadRetrievalFailed reason [1]. That reason is set when targetUpdatePayloadDir fails [2]. That function has several possible failure modes though, and only one of them is actually a failure to retrieve the payload [3]. Can you attach your CVO logs and/or a must-gather from when this issue was presenting? They will hopefully contain the original error instead of a generic simplification. [1]: https://github.com/openshift/cluster-version-operator/blob/23856901003b95b559087b8e83bffdee82872b2b/pkg/payload/task.go#L176-L177 [2]: https://github.com/openshift/cluster-version-operator/blob/23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L108-L112 [3]: https://github.com/openshift/cluster-version-operator/blob/23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L127 (In reply to W. Trevor King from comment #1) > 'could not download the update' is the message for the > UpdatePayloadRetrievalFailed reason [1]. That reason is set when > targetUpdatePayloadDir fails [2]. That function has several possible > failure modes though, and only one of them is actually a failure to retrieve > the payload [3]. Can you attach your CVO logs and/or a must-gather from > when this issue was presenting? They will hopefully contain the original > error instead of a generic simplification. > > [1]: > https://github.com/openshift/cluster-version-operator/blob/ > 23856901003b95b559087b8e83bffdee82872b2b/pkg/payload/task.go#L176-L177 > [2]: > https://github.com/openshift/cluster-version-operator/blob/ > 23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L108-L112 > [3]: > https://github.com/openshift/cluster-version-operator/blob/ > 23856901003b95b559087b8e83bffdee82872b2b/pkg/cvo/updatepayload.go#L127 I reprovisioned it to get requested logs. After checking again it was a configuration mistake. Although, you would expect it to fail after some time or after a number of attemps. Closing this as NOTABUG. Created attachment 1685682 [details]
cvo-spec
Created attachment 1685683 [details]
cvo-status-history
Created attachment 1685684 [details]
must-gather-log
> Although, you would expect it to fail after some time or after a number of attemps.
How would the CVO distinguish between transients like "there's some network outage, and I'm currently unable to reach the target registry" (where we want it to retry) and "user pointed me at a bogus pullspec and there is never going to be a registry at that domain" (where... I dunno, maybe there would be a point where it was not worth retrying?). If you do think there is a case where the CVO is retrying but should not be, give some details around it so we can discuss.
I think there should be a limit of tries/time because otherwise it could run endlessly and consume resources. Should I open a new BZ titled differently? > I think there should be a limit of tries/time because otherwise it could run endlessly and consume resources.
The rest of the CVO is already running endlessly reconciling manifests to and reporting on cluster state. An extra goroutine and an HTTP(S) request every few minutes is not consuming many additional resources. If, on the other hand, the CVO gives up after a while and fails to notice a recovered path to the upstream service, it would need a human admin or other external tooling to notice the recovery and kick the CVO to get it to start fetching again, and humans are expensive. If your CVO is banging it's head against the wall and the overhead bothers you, clear spec.channel in your ClusterVersion to tell the CVO not to bother.
|