Bug 1970421 - CVO does not provide a good enough reason to why an upgrade payload pull failed
Summary: CVO does not provide a good enough reason to why an upgrade payload pull failed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.9.0
Assignee: Arjun Naik
QA Contact: Yang Yang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-10 13:03 UTC by Standa Laznicka
Modified: 2021-10-18 17:34 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:33:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 614 0 None open Bug 1970421:Do not check if Job has Failed count set 2021-06-28 15:39:15 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:34:03 UTC

Description Standa Laznicka 2021-06-10 13:03:04 UTC
Description of problem:
When CVO pulls an upgrade payload and if fails, it does not provide an actionable reason as to why the pull failed.

Version-Release number of selected component (if applicable):
4.8

How reproducible:
100%

Steps to Reproduce:
1. push a release payload to a registry that the cluster does not have credentials for
2. oc adm upgrade --to-image=awesome.registry.cz/image/tag:latest --allow-explicit-upgrade --force

Actual results:
CVO logs that it's continuing the pull, it does not mention that it's incapable to do so due to unauthorized errors.

Eventually, CVO will report in clusterversion/version, that the pull failed, again without any reason.,

(I fixed this and did not get the exact logs as I was in a bit of a hurry so I do not know the exact logs and status, sorry :( )

Expected results:
I can see at least that CVO failed to authenticate/was not authorized contacting the remote registry

Additional info:

Comment 1 Vadim Rutkovsky 2021-06-10 13:09:45 UTC
CVO doesn't pull it the image directly, instead it starts a job to pull it. If the job fails we'd set `UpdatePayloadRetrievalFailed` status in ClusterVersion.

Perhaps we should additionally log the job progress too

Comment 2 Lalatendu Mohanty 2021-06-10 13:20:21 UTC
You should be able to see more details if you look in to the cluster version status i.e. oc get clusterversion -o yaml or -o json.

@slaznick can you tell us what is the output of `oc get clusterversion` when the upgrade fails.

Comment 3 Lalatendu Mohanty 2021-06-21 03:35:57 UTC
Removing need info as we can use steps to reproduce to find out more.

Comment 4 Arjun Naik 2021-06-21 10:56:29 UTC
After performing the update as given in the instructions. This found in the logs:


I0621 10:52:35.946490       1 batch.go:21] Job version--8qr8h in namespace openshift-cluster-version is not ready, continuing to wait.

I0621 10:52:38.947496       1 batch.go:21] Job version--8qr8h in namespace openshift-cluster-version is not ready, continuing to wait.
I0621 10:52:41.948606       1 batch.go:21] Job version--8qr8h in namespace openshift-cluster-version is not ready, continuing to wait.
I0621 10:52:42.377824       1 leaderelection.go:273] successfully renewed lease openshift-cluster-version/version
I0621 10:52:44.947118       1 batch.go:21] Job version--8qr8h in namespace openshift-cluster-version is not ready, continuing to wait.
I0621 10:52:46.871307       1 metrics.go:369] skipping metrics for kube-storage-version-migrator ClusterOperator condition Upgradeable=Unknown (neither True nor False)
I0621 10:52:47.945497       1 batch.go:21] Job version--8qr8h in namespace openshift-cluster-version is not ready, continuing to wait.
I0621 10:52:48.844185       1 cvo.go:481] Started syncing cluster version "openshift-cluster-version/version" (2021-06-21 10:52:48.844173102 +0000 UTC m=+2809.870740387)
I0621 10:52:48.844234       1 cvo.go:510] Desired version from spec is v1.Update{Version:"", Image:"awesome.registry.cz/image/tag:latest", Force:true}
I0621 10:52:48.844252       1 sync_worker.go:227] Update work is equal to current target; no change required
I0621 10:52:48.844262       1 status.go:161] Synchronizing errs=field.ErrorList{} status=&cvo.SyncWorkerStatus{Generation:2, Step:"RetrievePayload", Failure:error(nil), Done:0, Total:0, Completed:0, Reconciling:false, Initial:false, VersionHash:"", LastProgress:time.Time{wall:0xc02c399a76d52ccd, ext:2658946507579, loc:(*time.Location)(0x2ab0a00)}, Actual:v1.Release{Version:"", Image:"awesome.registry.cz/image/tag:latest", URL:"", Channels:[]string(nil)}, Verified:false}
I0621 10:52:48.844302       1 status.go:81] merge into existing history completed=false desired=v1.Release{Version:"", Image:"awesome.registry.cz/image/tag:latest", URL:"", Channels:[]string(nil)} last=&v1.UpdateHistory{State:"Partial", StartedTime:v1.Time{Time:time.Time{wall:0x0, ext:63759869417, loc:(*time.Location)(0x2ab0a00)}}, CompletionTime:(*v1.Time)(nil), Version:"", Image:"awesome.registry.cz/image/tag:latest", Verified:false}
I0621 10:52:48.844376       1 cvo.go:483] Finished syncing cluster version "openshift-cluster-version/version" (200.815µs)
I0621 10:52:50.945824       1 batch.go:21] Job version--8qr8h in namespace openshift-cluster-version is not ready, continuing to wait.
I0621 10:52:52.414253       1 leaderelection.go:273] successfully renewed lease openshift-cluster-version/version
I0621 10:52:53.947754       1 batch.go:21] Job version--8qr8h in namespace openshift-cluster-version is not ready, continuing to wait.
I0621 10:52:54.020293       1 metrics.go:369] skipping metrics for kube-storage-version-migrator ClusterOperator condition Upgradeable=Unknown (neither True nor False)
I0621 10:52:56.949318       1 batch.go:21] Job version--8qr8h in namespace openshift-cluster-version is not ready, continuing to wait.
I0621 10:52:59.946345       1 batch.go:21] Job version--8qr8h in namespace openshift-cluster-version is not ready, continuing to wait.
I0621 10:53:02.471886       1 leaderelection.go:273] successfully renewed lease openshift-cluster-version/version
I0621 10:53:02.945892       1 batch.go:21] Job version--8qr8h in namespace openshift-cluster-version is not ready, continuing to wait.
I0621 10:53:03.843973       1 cvo.go:481] Started syncing cluster version "openshift-cluster-version/version" (2021-06-21 10:53:03.843961817 +0000 UTC m=+2824.870529113)
I0621 10:53:03.844029       1 cvo.go:510] Desired version from spec is v1.Update{Version:"", Image:"awesome.registry.cz/image/tag:latest", Force:true}
I0621 10:53:03.844051       1 sync_worker.go:227] Update work is equal to current target; no change required
I0621 10:53:03.844059       1 status.go:161] Synchronizing errs=field.ErrorList{} status=&cvo.SyncWorkerStatus{Generation:2, Step:"RetrievePayload", Failure:error(nil), Done:0, Total:0, Completed:0, Reconciling:false, Initial:false, VersionHash:"", LastProgress:time.Time{wall:0xc02c399a76d52ccd, ext:2658946507579, loc:(*time.Location)(0x2ab0a00)}, Actual:v1.Release{Version:"", Image:"awesome.registry.cz/image/tag:latest", URL:"", Channels:[]string(nil)}, Verified:false}
I0621 10:53:03.844110       1 status.go:81] merge into existing history completed=false desired=v1.Release{Version:"", Image:"awesome.registry.cz/image/tag:latest", URL:"", Channels:[]string(nil)} last=&v1.UpdateHistory{State:"Partial", StartedTime:v1.Time{Time:time.Time{wall:0x0, ext:63759869417, loc:(*time.Location)(0x2ab0a00)}}, CompletionTime:(*v1.Time)(nil), Version:"", Image:"awesome.registry.cz/image/tag:latest", Verified:false}

Comment 5 Standa Laznicka 2021-06-21 11:42:37 UTC
Thanks, I did not notice the needinfo. The posted logs are exactly what I found, as you can see, there is no reason whatsoever, and I remember the `oc get clusterversion` was not particularly helpful either. I believe there's work to be done here to make the reason of the failure more transparent, not only in the logs but in the status, too.

Comment 7 Yang Yang 2021-07-19 11:31:47 UTC
Reproducing with 4.7.19

# oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-release@sha256:56dc7d529400ff7ee2eb1c50f14352fb0773ed9f3a2d24383507188ceb98900b --allow-explicit-upgrade --force

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.19    True        True          67m     Working towards quay.io/openshift-release-dev/ocp-release@sha256:56dc7d529400ff7ee2eb1c50f14352fb0773ed9f3a2d24383507188ceb98900b: downloading update

$oc logs cluster-version-operator-76bcf5c675-h2gtc -n openshift-cluster-version | grep batch.go

I0719 11:22:04.024110       1 batch.go:21] Job version--kk49s in namespace openshift-cluster-version is not ready, continuing to wait.
I0719 11:22:07.023461       1 batch.go:21] Job version--kk49s in namespace openshift-cluster-version is not ready, continuing to wait.
I0719 11:22:10.023790       1 batch.go:21] Job version--kk49s in namespace openshift-cluster-version is not ready, continuing to wait.
I0719 11:22:13.024548       1 batch.go:21] Job version--kk49s in namespace openshift-cluster-version is not ready, continuing to wait.
I0719 11:22:16.023475       1 batch.go:21] Job version--kk49s in namespace openshift-cluster-version is not ready, continuing to wait.
E0719 11:22:19.023342       1 batch.go:18] deadline exceeded, reason: "DeadlineExceeded", message: "Job was active longer than specified deadline"
E0719 11:22:22.025066       1 batch.go:18] deadline exceeded, reason: "DeadlineExceeded", message: "Job was active longer than specified deadline"
E0719 11:22:25.023708       1 batch.go:18] deadline exceeded, reason: "DeadlineExceeded", message: "Job was active longer than specified deadline"
E0719 11:22:28.024860       1 batch.go:18] deadline exceeded, reason: "DeadlineExceeded", message: "Job was active longer than specified deadline"

Comment 8 Yang Yang 2021-07-20 03:28:31 UTC
(In reply to Yang Yang from comment #7)
> Reproducing with 4.7.19
> 
> # oc adm upgrade
> --to-image=quay.io/openshift-release-dev/ocp-release@sha256:
> 56dc7d529400ff7ee2eb1c50f14352fb0773ed9f3a2d24383507188ceb98900b
> --allow-explicit-upgrade --force
> 
> # oc get clusterversion
> NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
> version   4.7.19    True        True          67m     Working towards
> quay.io/openshift-release-dev/ocp-release@sha256:
> 56dc7d529400ff7ee2eb1c50f14352fb0773ed9f3a2d24383507188ceb98900b:
> downloading update
> 
> $oc logs cluster-version-operator-76bcf5c675-h2gtc -n
> openshift-cluster-version | grep batch.go
> 
> I0719 11:22:04.024110       1 batch.go:21] Job version--kk49s in namespace
> openshift-cluster-version is not ready, continuing to wait.
> I0719 11:22:07.023461       1 batch.go:21] Job version--kk49s in namespace
> openshift-cluster-version is not ready, continuing to wait.
> I0719 11:22:10.023790       1 batch.go:21] Job version--kk49s in namespace
> openshift-cluster-version is not ready, continuing to wait.
> I0719 11:22:13.024548       1 batch.go:21] Job version--kk49s in namespace
> openshift-cluster-version is not ready, continuing to wait.
> I0719 11:22:16.023475       1 batch.go:21] Job version--kk49s in namespace
> openshift-cluster-version is not ready, continuing to wait.
> E0719 11:22:19.023342       1 batch.go:18] deadline exceeded, reason:
> "DeadlineExceeded", message: "Job was active longer than specified deadline"
> E0719 11:22:22.025066       1 batch.go:18] deadline exceeded, reason:
> "DeadlineExceeded", message: "Job was active longer than specified deadline"
> E0719 11:22:25.023708       1 batch.go:18] deadline exceeded, reason:
> "DeadlineExceeded", message: "Job was active longer than specified deadline"
> E0719 11:22:28.024860       1 batch.go:18] deadline exceeded, reason:
> "DeadlineExceeded", message: "Job was active longer than specified deadline"

CVO does not detect the download failure.

  status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2021-07-19T08:24:55Z"
      message: Done applying 4.7.19
      status: "True"
      type: Available
    - lastTransitionTime: "2021-07-20T03:23:43Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2021-07-19T10:18:45Z"
      message: 'Working towards quay.io/openshift-release-dev/ocp-release@sha256:56dc7d529400ff7ee2eb1c50f14352fb0773ed9f3a2d24383507188ceb98900b: downloading update'
      reason: DownloadingUpdate
      status: "True"
      type: Progressing
    - lastTransitionTime: "2021-07-19T07:58:57Z"
      status: "True"
      type: RetrievedUpdates

Comment 9 Yang Yang 2021-07-20 03:43:11 UTC
Verifying with 4.9.0-0.nightly-2021-07-18-155939

Upgrade to a fake payload

$ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release@sha256:eb6db523e8ffe1c7d1a90d879df749ad2d63cef41abc5e5d8a9ec8d3321b5a1a --allow-explicit-upgrade --force


$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-07-18-155939   True        True          19m     Unable to apply registry.ci.openshift.org/ocp/release@sha256:eb6db523e8ffe1c7d1a90d879df749ad2d63cef41abc5e5d8a9ec8d3321b5a1a: could not download the update

$ oc get clusterversion -ojson |jq -r .items[].status
{
  "availableUpdates": null,
  "conditions": [
    {
      "lastTransitionTime": "2021-07-19T09:36:51Z",
      "message": "Done applying 4.9.0-0.nightly-2021-07-18-155939",
      "status": "True",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2021-07-20T03:40:26Z",
      "message": "Unable to download and prepare the update: timed out waiting for the condition",
      "reason": "UpdatePayloadRetrievalFailed",
      "status": "True",
      "type": "Failing"
    },
    {
      "lastTransitionTime": "2021-07-20T03:05:12Z",
      "message": "Unable to apply registry.ci.openshift.org/ocp/release@sha256:eb6db523e8ffe1c7d1a90d879df749ad2d63cef41abc5e5d8a9ec8d3321b5a1a: could not download the update",
      "reason": "UpdatePayloadRetrievalFailed",
      "status": "True",
      "type": "Progressing"
    },
    {
      "lastTransitionTime": "2021-07-19T09:13:10Z",
      "message": "Unable to retrieve available updates: currently reconciling cluster version 4.9.0-0.nightly-2021-07-18-155939 not found in the \"stable-4.8\" channel",
      "reason": "VersionNotFound",
      "status": "False",
      "type": "RetrievedUpdates"
    }
  ],
  "desired": {
    "image": "registry.ci.openshift.org/ocp/release@sha256:eb6db523e8ffe1c7d1a90d879df749ad2d63cef41abc5e5d8a9ec8d3321b5a1a",
    "version": ""
  },
  "history": [
    {
      "completionTime": null,
      "image": "registry.ci.openshift.org/ocp/release@sha256:eb6db523e8ffe1c7d1a90d879df749ad2d63cef41abc5e5d8a9ec8d3321b5a1a",
      "startedTime": "2021-07-20T03:05:12Z",
      "state": "Partial",
      "verified": false,
      "version": ""
    },
    {
      "completionTime": "2021-07-19T09:36:51Z",
      "image": "registry.ci.openshift.org/ocp/release@sha256:c61fe2b05944244eb8814806655ad74390aa0395e6bd8e32f9fa196bca09f5db",
      "startedTime": "2021-07-19T09:13:10Z",
      "state": "Completed",
      "verified": false,
      "version": "4.9.0-0.nightly-2021-07-18-155939"
    }
  ],
  "observedGeneration": 2,
  "versionHash": "SJCDimW4TWg="
}

When the image download fails, CVO sets Failing=true with UpdatePayloadRetrievalFailed reason. Moving it to verified state.

Comment 10 Yang Yang 2021-07-20 03:57:39 UTC
(In reply to Yang Yang from comment #9)
> Verifying with 4.9.0-0.nightly-2021-07-18-155939

In the meanwhile, CVO logs the job progress:

$ oc logs cluster-version-operator-7c57756cb4-bhvr6 -n openshift-cluster-version | grep batch.go

I0720 03:50:26.764796       1 batch.go:22] Job version--grg78 in namespace openshift-cluster-version is not ready, continuing to wait.
I0720 03:50:29.765791       1 batch.go:22] Job version--grg78 in namespace openshift-cluster-version is not ready, continuing to wait.
I0720 03:50:32.765319       1 batch.go:53] No active pods for job version--grg78 in namespace openshift-cluster-version
E0720 03:50:32.765357       1 batch.go:19] deadline exceeded, reason: "DeadlineExceeded", message: "Job was active longer than specified deadline"
I0720 03:50:35.765436       1 batch.go:53] No active pods for job version--grg78 in namespace openshift-cluster-version
E0720 03:50:35.765473       1 batch.go:19] deadline exceeded, reason: "DeadlineExceeded", message: "Job was active longer than specified deadline"
I0720 03:50:38.764876       1 batch.go:53] No active pods for job version--grg78 in namespace openshift-cluster-version
E0720 03:50:38.764907       1 batch.go:19] deadline exceeded, reason: "DeadlineExceeded", message: "Job was active longer than specified deadline"

Comment 11 Yang Yang 2021-07-20 08:36:13 UTC
I misunderstood the PR. Please ignore comment#7 and comment#8.

Attempting to reproduce it one more time.

Install a cluster

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-07-19-192457   True        False         41m     Cluster version is 4.8.0-0.nightly-2021-07-19-192457


Upgrade to a fake pull spec

$ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release@sha256:eb6db523e8ffe1c7d1a90d879df749ad2d63cef41abc5e5d8a9ec8d3321b5a1a --allow-explicit-upgrade --force

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-07-19-192457   True        True          7m21s   Unable to apply registry.ci.openshift.org/ocp/release@sha256:eb6db523e8ffe1c7d1a90d879df749ad2d63cef41abc5e5d8a9ec8d3321b5a1a: could not download the update

$ oc get clusterversion -oyaml | jq -r .items[].status
status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2021-07-20T07:28:08Z"
      message: Done applying 4.8.0-0.nightly-2021-07-19-192457
      status: "True"
      type: Available
    - lastTransitionTime: "2021-07-20T08:15:53Z"
      message: 'Unable to download and prepare the update: timed out waiting for the condition'
      reason: UpdatePayloadRetrievalFailed
      status: "True"
      type: Failing
    - lastTransitionTime: "2021-07-20T08:10:10Z"
      message: 'Unable to apply registry.ci.openshift.org/ocp/release@sha256:eb6db523e8ffe1c7d1a90d879df749ad2d63cef41abc5e5d8a9ec8d3321b5a1a: could not download the update'
      reason: UpdatePayloadRetrievalFailed
      status: "True"
      type: Progressing
    - lastTransitionTime: "2021-07-20T07:01:56Z"
      message: 'Unable to retrieve available updates: currently reconciling cluster version 4.8.0-0.nightly-2021-07-19-192457 not found in the "stable-4.8" channel'
      reason: VersionNotFound
      status: "False"
      type: RetrievedUpdates

It failed with UpdatePayloadRetrievalFailed which seems to be similar with the test results (comment#9) with the PR.

	
Arjun Naik

It sounds like the reporter is expecting CVO can log the helpful and proper messages to indicate why the update retrieval failed. But from my testing (refer to comment#9), it seems that the PR does not address this kind of issue. Could you please help me understand what the PR is fixing or enhancing? Thanks

Comment 12 Arjun Naik 2021-07-20 10:17:57 UTC
As Vadim menioned in comment#1 the CVO does not download the update itself. It uses a job to do that. In OCP 4.7 and older when the image pull failed there was no feedback in the CV status because the job did not show that it had failed. However there was an upstream fix in 4.8 which detected a pull failure and accordingly set the status in the job which could be read by CVO. In addition the PR in CVO added some more logic to detect the job failure faster. So effectively this bug was fixed in 4.8 with some improvements in detection in 4.9. With regards to showing the exact cause of failure in the CVO, this is not trivial because the cause is not stored in the job, only the status is.

Comment 13 Yang Yang 2021-07-21 03:30:07 UTC
> In OCP 4.7 and older when the image pull failed there was no feedback in the CV status

From my testing, in OCP 4.7, CVO does set Failing=true in the CV conditions when the image pull failed.

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-07-19-144858   True        True          6m37s   Unable to apply quay.io/openshift-release-dev/ocp-release@sha256:56dc7d529400ff7ee2eb1c50f14352fb0773ed9f3a2d24383507188ceb98900b: could not download the update

# oc get clusterversion -ojson |jq -r '.items[].status.conditions[]'
{
  "lastTransitionTime": "2021-07-21T02:31:51Z",
  "message": "Done applying 4.7.0-0.nightly-2021-07-19-144858",
  "status": "True",
  "type": "Available"
}
{
  "lastTransitionTime": "2021-07-21T03:12:06Z",
  "message": "Unable to download and prepare the update: timed out waiting for the condition",
  "reason": "UpdatePayloadRetrievalFailed",
  "status": "True",
  "type": "Failing"
}
{
  "lastTransitionTime": "2021-07-21T03:06:19Z",
  "message": "Unable to apply quay.io/openshift-release-dev/ocp-release@sha256:56dc7d529400ff7ee2eb1c50f14352fb0773ed9f3a2d24383507188ceb98900b: could not download the update",
  "reason": "UpdatePayloadRetrievalFailed",
  "status": "True",
  "type": "Progressing"
}
{
  "lastTransitionTime": "2021-07-21T02:05:45Z",
  "message": "Unable to retrieve available updates: currently reconciling cluster version 4.7.0-0.nightly-2021-07-19-144858 not found in the \"stable-4.7\" channel",
  "reason": "VersionNotFound",
  "status": "False",
  "type": "RetrievedUpdates"
}

> In addition the PR in CVO added some more logic to detect the job failure faster.

From my testing, in OCP 4.7, the image pull is started at 2021-07-21T03:06:19Z and failed at 2021-07-21T03:12:06Z, so the failure is detected in less than 6 minutes. While in OCP 4.9, the image pull is started at 2021-07-21T03:15:12Z and failed at 2021-07-21T03:21:01Z, so the failure is detected in around 7 minutes. 

It seems that the PR does not make the detection faster.

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-07-18-155939   True        True          9m16s   Working towards registry.ci.openshift.org/ocp/release@sha256:eb6db523e8ffe1c7d1a90d879df749ad2d63cef41abc5e5d8a9ec8d3321b5a1a: could not download the update

# oc get clusterversion -ojson |jq -r '.items[].status.conditions[]'
{
  "lastTransitionTime": "2021-07-19T09:36:51Z",
  "message": "Done applying 4.9.0-0.nightly-2021-07-18-155939",
  "status": "True",
  "type": "Available"
}
{
  "lastTransitionTime": "2021-07-21T03:21:01Z",
  "message": "Unable to download and prepare the update: timed out waiting for the condition",
  "reason": "UpdatePayloadRetrievalFailed",
  "status": "True",
  "type": "Failing"
}
{
  "lastTransitionTime": "2021-07-21T03:15:12Z",
  "message": "Unable to apply registry.ci.openshift.org/ocp/release@sha256:eb6db523e8ffe1c7d1a90d879df749ad2d63cef41abc5e5d8a9ec8d3321b5a1a: could not download the update",
  "reason": "UpdatePayloadRetrievalFailed",
  "status": "True",
  "type": "Progressing"
}
{
  "lastTransitionTime": "2021-07-19T09:13:10Z",
  "message": "Unable to retrieve available updates: currently reconciling cluster version 4.9.0-0.nightly-2021-07-18-155939 not found in the \"stable-4.8\" channel",
  "reason": "VersionNotFound",
  "status": "False",
  "type": "RetrievedUpdates"
}

Comment 14 Arjun Naik 2021-07-21 08:57:24 UTC
You're right the PR does not make detection faster. And even the older versions detected failure as mentioned in comment#0. What the reporter wants is detailed description of the failure which this PR does not provide.

Comment 15 Yang Yang 2021-07-21 09:48:13 UTC
Is the PR intended to change the code logic only? It doesn't make any changes to function. If that's correct, the function works as before. Then I feel I can move it to verified state based on comment#9.

Comment 16 Arjun Naik 2021-07-22 09:18:05 UTC
Yang Yang the PR does not address the problem mentioned in comment#1 which is the user expects the error to contain details about the cause of failure. So if you can move it to verified without closing this BZ then please do so.

Comment 17 Yang Yang 2021-07-22 09:21:53 UTC
Thanks Arjun. I see what you mean.

Looking back to the code logic in the PR, it includes:

[1]Job failed > 0 and job condition set
[2]Job failed > 0 but job condition yet set
[3]Job yet failed but condition set with reason = DeadlineExceeded
[4]Job yet failed but condition set with reason != DeadlineExceeded


Upgrade a regular cluster to a fake payload (comment#9), it touches [1].

$ oc get job.batch/version--l946z -n openshift-cluster-version -ojson | jq -r .status
{
  "conditions": [
    {
      "lastProbeTime": "2021-07-22T08:55:00Z",
      "lastTransitionTime": "2021-07-22T08:55:00Z",
      "message": "Job was active longer than specified deadline",
      "reason": "DeadlineExceeded",
      "status": "True",
      "type": "Failed"
    }
  ],
  "startTime": "2021-07-22T08:53:00Z"
}


$ oc get clusterversion -ojson |jq -r '.items[].status.conditions[] | select(.type == "Failing")'
{
  "lastTransitionTime": "2021-07-22T08:58:42Z",
  "message": "Unable to download and prepare the update: timed out waiting for the condition",
  "reason": "UpdatePayloadRetrievalFailed",
  "status": "True",
  "type": "Failing"
}

CVO sets Failing=true. However, the job is failed and condition is set at 2021-07-22T08:55:00Z, but CVO sets the condition at 2021-07-22T08:58:42Z. It's 3 mins behind. I'm curious is it something CVO needs to fix?

Arjun, could you please help take a look ^^? Thanks!

Comment 18 Arjun Naik 2021-07-22 09:34:21 UTC
This is expected. There is wait function which polls the job, but it never returns an error even when the job fails https://github.com/openshift/cluster-version-operator/blob/3a68652568e9075c23f491bc8c037942bd67ec82/lib/resourcebuilder/batch.go#L17 . It only returns with an error when it times out. Which explains the delay. Fixing this will make the status update faster but it's not critical and definitely not related to this BZ.

Comment 19 Yang Yang 2021-07-22 13:20:42 UTC
Thanks Arjun for your explanation. That issue is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1984916 with low severity.

Comment 20 Yang Yang 2021-07-23 08:08:29 UTC
In order to touch code branch [3], change the activeDeadlineSeconds to 20 once the job is created.

[3] Job yet failed but condition set with reason = DeadlineExceeded

$ oc get job.batch/version--4t2s6 -ojson | jq -r .status
status:
  conditions:
  - lastProbeTime: "2021-07-23T07:09:51Z"
    lastTransitionTime: "2021-07-23T07:09:51Z"
    message: Job was active longer than specified deadline
    reason: DeadlineExceeded
    status: "True"
    type: Failed
  startTime: "2021-07-23T07:09:31Z"

$ oc get clusterversion -ojson | jq -r '.items[].status.conditions'
{
    "lastTransitionTime": "2021-07-23T07:21:49Z",
    "message": "Unable to download and prepare the update: timed out waiting for the condition",
    "reason": "UpdatePayloadRetrievalFailed",
    "status": "True",
    "type": "Failing"
  },

CVO sets Failed=True in CV status. Passed.

Based on comment#17 and comment#19, moving it to verified state

Comment 23 errata-xmlrpc 2021-10-18 17:33:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.