1984916 – CVO sets failing condition too slow

Bug 1984916 - CVO sets failing condition too slow

Summary: CVO sets failing condition too slow

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Over the Air Updates
QA Contact:	Yang Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-22 13:18 UTC by Yang Yang
Modified:	2023-01-19 01:01 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-19 01:01:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Yang Yang 2021-07-22 13:18:22 UTC

Description of problem:

As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1970421#c17, when a job started by CVO is set Failed=true with reason=DeadlineExceeded, CVO does not set Failing=true in CV status until timeout. And the CV message "Unable to download and prepare the update: timed out waiting for the condition" is not proper.

Version-Release number of the following components:

4.9.0-0.nightly-2021-07-18-155939

How reproducible:

100%

Steps to Reproduce:
1. Upgrade cluster to a fake payload

2. Check that pull image job failed
$ oc get job.batch/version--l946z -n openshift-cluster-version -ojson | jq -r .status
{
  "conditions": [
    {
      "lastProbeTime": "2021-07-22T08:55:00Z",
      "lastTransitionTime": "2021-07-22T08:55:00Z",
      "message": "Job was active longer than specified deadline",
      "reason": "DeadlineExceeded",
      "status": "True",
      "type": "Failed"
    }
  ],
  "startTime": "2021-07-22T08:53:00Z"
}

3. Check CV status
$ oc get clusterversion -ojson |jq -r '.items[].status.conditions[] | select(.type == "Failing")'
{
  "lastTransitionTime": "2021-07-22T08:58:42Z",
  "message": "Unable to download and prepare the update: timed out waiting for the condition",
  "reason": "UpdatePayloadRetrievalFailed",
  "status": "True",
  "type": "Failing"
}


Actual results:
The job is failed and condition is set at 2021-07-22T08:55:00Z, but CVO sets the condition at 2021-07-22T08:58:42Z. CVO sets the CV status until timeout. The failure message yet state the real reason why it failed. And progressing is still true with reason DownloadingUpdate after the job has failed.

Expected results:
CVO sets the CV status once the job failed and logs the correct failure message

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 2 Yang Yang 2022-01-20 10:30:50 UTC

Adding another scenario:

In the situation that a deployment timed out progressing, the deployment has the Progressing=False with Reason ProgressDeadlineExceeded set at 2022-01-20T09:11:21Z.

# oc -n openshift-operator-lifecycle-manager get po
NAME                                     READY   STATUS      RESTARTS   AGE
catalog-operator-6bccb67f6-r85t6         0/1     Pending     0          10m
catalog-operator-d64c86f65-w257p         1/1     Running     0          7h2m
collect-profiles-27377790-s46rj          0/1     Completed   0          41m
collect-profiles-27377805-75qth          0/1     Completed   0          26m
collect-profiles-27377820-fv2cl          0/1     Completed   0          11m
olm-operator-867d686df6-d2v2s            1/1     Running     0          7h2m
package-server-manager-b9bbc9f65-55pbr   1/1     Running     0          7h2m
packageserver-5f457d4b85-qjtsg           1/1     Running     0          6h57m
packageserver-5f457d4b85-z8nl6           1/1     Running     0          6h57m

# oc -n openshift-operator-lifecycle-manager get deployment catalog-operator -ojson | jq -r '.status.conditions[]'
{
  "lastTransitionTime": "2022-01-20T02:14:35Z",
  "lastUpdateTime": "2022-01-20T02:14:35Z",
  "message": "Deployment has minimum availability.",
  "reason": "MinimumReplicasAvailable",
  "status": "True",
  "type": "Available"
}
{
  "lastTransitionTime": "2022-01-20T09:11:21Z",
  "lastUpdateTime": "2022-01-20T09:11:21Z",
  "message": "ReplicaSet \"catalog-operator-6bccb67f6\" has timed out progressing.",
  "reason": "ProgressDeadlineExceeded",
  "status": "False",
  "type": "Progressing"
}

While 7 minutes later , CVO fails. It would be better to fail it fast.

I0120 09:02:52.808628       1 sync_worker.go:555] Running sync 4.10.0-0.nightly-2022-01-19-150530 (force=false) on generation 2 in state Reconciling at attempt 0
I0120 09:02:53.371833       1 sync_worker.go:768] Running sync for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769)
I0120 09:02:53.379966       1 sync_worker.go:780] Done syncing for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769)
I0120 09:03:11.517190       1 sync_worker.go:768] Running sync for customresourcedefinition "storagestates.migration.k8s.io" (336 of 769)
I0120 09:03:21.539370       1 task_graph.go:546] Result of work: []
I0120 09:06:57.178489       1 sync_worker.go:555] Running sync 4.10.0-0.nightly-2022-01-19-150530 (force=false) on generation 2 in state Reconciling at attempt 0
I0120 09:06:57.664737       1 sync_worker.go:768] Running sync for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769)
I0120 09:06:57.672867       1 sync_worker.go:780] Done syncing for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769)
I0120 09:07:15.888068       1 sync_worker.go:768] Running sync for customresourcedefinition "storagestates.migration.k8s.io" (336 of 769)
I0120 09:07:25.904827       1 task_graph.go:546] Result of work: []
I0120 09:11:01.544163       1 sync_worker.go:555] Running sync 4.10.0-0.nightly-2022-01-19-150530 (force=false) on generation 2 in state Reconciling at attempt 0
I0120 09:11:02.158329       1 sync_worker.go:768] Running sync for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769)
I0120 09:11:02.167054       1 sync_worker.go:780] Done syncing for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769)
I0120 09:11:20.254397       1 sync_worker.go:768] Running sync for customresourcedefinition "storagestates.migration.k8s.io" (336 of 769)
I0120 09:18:12.822571       1 task_graph.go:546] Result of work: [Cluster operator machine-config is degraded]
I0120 09:18:42.226379       1 sync_worker.go:555] Running sync 4.10.0-0.nightly-2022-01-19-150530 (force=false) on generation 2 in state Reconciling at attempt 1
I0120 09:18:57.535277       1 sync_worker.go:768] Running sync for customresourcedefinition "storagestates.migration.k8s.io" (336 of 769)
I0120 09:19:09.135678       1 sync_worker.go:768] Running sync for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769)
E0120 09:19:09.234956       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:19:34.616459       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:20:00.310804       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:20:21.493720       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:20:44.772366       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:21:10.058041       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:21:29.586525       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:21:47.770738       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:22:10.906396       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:22:31.167054       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:22:46.275551       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:23:10.984216       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:23:36.265768       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:23:56.507808       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:24:13.785923       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:24:33.520671       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:24:46.760273       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:25:02.995628       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:25:24.075328       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:25:39.317692       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
E0120 09:25:53.347323       1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (580 of 769): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.
I0120 09:25:53.504600       1 task_graph.go:546] Result of work: [Cluster operator machine-config is degraded deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.]
I0120 09:25:53.504667       1 sync_worker.go:940] Update error 580 of 769: WorkloadNotProgressing deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing. (*errors.errorString: deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.)
* deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6bccb67f6" has timed out progressing.

Comment 3 Yang Yang 2022-01-20 10:36:47 UTC

>While 7 minutes later , CVO fails. It would be better to fail it fast.

CVO fails at 09:18:12 to indicate machine-config degraded error.

CVO fails at 09:25:53 to indicate catalog-operator ProgressDeadlineExceeded error. 

So, CVO waits for 14 minutes to fail catalog-operator.

Comment 4 W. Trevor King 2022-01-20 22:56:35 UTC

I expect [1], when we get back around to reviving it, would help with the ProgressDeadlineExceeded reporting latency.

[1]: https://github.com/openshift/cluster-version-operator/pull/573/

Comment 5 W. Trevor King 2023-01-19 01:01:26 UTC

^ that commit landed in 4.12 and later via [1], but it's about comment 2's ProgressDeadlineExceeded, not comment 0's "Unable to download and prepare the update".  Reproducing the latter with a cluster-bot cluster from 'launch 4.12.0', and a 4.12 oc:

$ oc version --client
Client Version: 4.12.0
Kustomize Version: v4.5.7

Set a channel, because cluster bot clears that like all CI jobs, but I want this test to be more like a customer install, and they default to stable channels:

$ oc adm upgrade channel stable-4.12

Updating to a non-existent image (--force to bypass the lack of signature, the other options to go outside channel recommendations):

$ sha256sum </dev/null
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  -
$ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

And check on the UX:

$ oc adm upgrade
Cluster version is 4.12.0

ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" failure=Unable to download and prepare the update: deadline exceeded, reason: "DeadlineExceeded", message: "Job was active longer than specified deadline"

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.12 (available channels: candidate-4.12, candidate-4.13, eus-4.12, fast-4.12, stable-4.12)
No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and may result in downtime or data loss.

Poking at condition details:

$ oc get -o json clusterversion version | jq -r '.status.conditions[] | select(.type == "ReleaseAccepted") | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2023-01-18T23:36:31Z ReleaseAccepted=False RetrievePayload: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" failure=Unable to download and prepare the update: deadline exceeded, reason: "DeadlineExceeded", message: "Job was active longer than specified deadline"

And logs:

$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 | grep 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\|version--qnprr'
I0118 23:34:31.811068       1 cvo.go:596] Desired version from spec is v1.Update{Version:"", Image:"registry.ci.openshift.org/ocp/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", Force:true}
I0118 23:34:31.811095       1 sync_worker.go:811] Detected while considering cluster version generation 7: version changed (from {4.12.0 registry.build05.ci.openshift.org/ci-ln-q516qrb/release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 false} to { registry.ci.openshift.org/ocp/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 true})
I0118 23:34:31.811104       1 sync_worker.go:261] syncPayload: registry.ci.openshift.org/ocp/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 (force=true)
I0118 23:34:31.811342       1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RetrievePayload' Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
W0118 23:34:31.885947       1 updatepayload.go:117] Target release version="" image="registry.ci.openshift.org/ocp/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" cannot be verified, but continuing anyway because the update was forced: unable to verify sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 against keyrings: verifier-public-key-redhat
I0118 23:34:31.921372       1 batch.go:55] No active pods for job version--qnprr in namespace openshift-cluster-version
I0118 23:34:31.921389       1 batch.go:29] Job version--qnprr in namespace openshift-cluster-version is not ready, continuing to wait.
...
I0118 23:36:28.924703       1 batch.go:29] Job version--qnprr in namespace openshift-cluster-version is not ready, continuing to wait.
I0118 23:36:31.938189       1 batch.go:55] No active pods for job version--qnprr in namespace openshift-cluster-version
I0118 23:36:31.938254       1 status.go:170] Synchronizing status errs=field.ErrorList(nil) status=&cvo.SyncWorkerStatus{Generation:6, Failure:error(nil), Done:827, Total:827, Completed:1, Reconciling:true, Initial:false, VersionHash:"HlaWrkSyQi4=", Architecture:"amd64", LastProgress:time.Date(2023, time.January, 18, 23, 32, 23, 92593953, time.Local), Actual:v1.Release{Version:"4.12.0", Image:"registry.build05.ci.openshift.org/ci-ln-q516qrb/release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18", URL:"https://access.redhat.com/errata/RHSA-2022:7399", Channels:[]string(nil)}, Verified:false, loadPayloadStatus:cvo.LoadPayloadStatus{Step:"RetrievePayload", Message:"Retrieving payload failed version=\"\" image=\"registry.ci.openshift.org/ocp/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\" failure=Unable to download and prepare the update: deadline exceeded, reason: \"DeadlineExceeded\", message: \"Job was active longer than specified deadline\"", AcceptedRisks:"", Failure:(*payload.UpdateError)(0xc002368ab0), Update:v1.Update{Version:"", Image:"registry.ci.openshift.org/ocp/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", Force:true}, Verified:false, Local:false, LastTransitionTime:time.Time{wall:0xc0ea3e7ff7ec4eb2, ext:333905283433, loc:(*time.Location)(0x2f49fa0)}}, CapabilitiesStatus:cvo.CapabilityStatus{Status:v1.ClusterVersionCapabilitiesStatus{EnabledCapabilities:[]v1.ClusterVersionCapability{"CSISnapshot", "Console", "Insights", "Storage", "baremetal", "marketplace", "openshift-samples"}, KnownCapabilities:[]v1.ClusterVersionCapability{"CSISnapshot", "Console", "Insights", "Storage", "baremetal", "marketplace", "openshift-samples"}}, ImplicitlyEnabledCaps:[]v1.ClusterVersionCapability(nil)}}
I0118 23:36:31.938458       1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RetrievePayloadFailed' Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" failure=Unable to download and prepare the update: deadline exceeded, reason: "DeadlineExceeded", message: "Job was active longer than specified deadline"
...

And events:

$ oc -n openshift-cluster-version get -o json events | jq -r '.items[] | select(.involvedObject.name == "version--qnprr") | .firstTimestamp + " " + (.count | tostring) + " " + .lastTimestamp + " " + .reason + ": " + .message'
2023-01-18T23:34:31Z 1 2023-01-18T23:34:31Z SuccessfulCreate: Created pod: version--qnprr-5bd4b
2023-01-18T23:36:31Z 1 2023-01-18T23:36:31Z SuccessfulDelete: Deleted pod: version--qnprr-5bd4b
2023-01-18T23:36:31Z 1 2023-01-18T23:36:31Z DeadlineExceeded: Job was active longer than specified deadline

Comparing with comment 0's:

> Actual results:
> The job is failed and condition is set at 2021-07-22T08:55:00Z, but CVO sets the condition at 2021-07-22T08:58:42Z. CVO sets the CV status until timeout. The failure message yet state the real reason why it failed. And progressing is still true with reason DownloadingUpdate after the job has failed.
>
> Expected results:
>CVO sets the CV status once the job failed and logs the correct failure message

So in this case, the job failed at 23:36:31, and we got the ReleaseAccepted=False right then with a matching lastTransitionTime.  Since we grew the ReleaseAccepted condition, we no longer go Progressing=True while considering potential new targets:

$ oc get -o json clusterversion version | jq -r '.status.conditions[] | select(.type == "Progressing") | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2023-01-18T23:32:29Z Progressing=False : Cluster version is 4.12.0

So it looks like we're good vs. the comment 0 ask.

To dig into comment 2's ask, cordon the control-plane nodes to block new pods:

$ oc get nodes -l node-role.kubernetes.io/control-plane= -o name | while read NODE; do oc adm cordon "${NODE}"; done

And kick the comment 2 Deployment to trigger a roll-out attempt (maybe there's a more convenient way to do this?):

$ oc -n openshift-operator-lifecycle-manager patch deployment catalog-operator --type json --patch '[{"op": "add", "path": "/spec/template/metadata/annotations/bump", "value": "123"}]'

Watch the deployment:

$ oc -n openshift-operator-lifecycle-manager get -w -o json deployment catalog-operator | jq -r '.status.conditions[] | select(.type == "Progressing") | .lastTransitionTime + " " + .lastUpdateTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2023-01-18T22:30:40Z 2023-01-19T00:46:12Z Progressing=True ReplicaSetUpdated: ReplicaSet "catalog-operator-6f78dd9b56" is progressing.
2023-01-19T00:56:13Z 2023-01-19T00:56:13Z Progressing=False ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6f78dd9b56" has timed out progressing.
^C

And checking in on ClusterVersion:

$ oc get -o json clusterversion version | jq -r '.status.conditions[] | select(.type == "Failing") | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2023-01-19T00:52:29Z Failing=True MultipleErrors: Multiple errors are preventing progress:
* Cluster operator machine-config is degraded
* deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6f78dd9b56" has timed out progressing.

Looks like machine-config beat us to going sad, and that makes the timestamp harder to figure out.  Drop into CVO logs:

$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 | grep catalog-operator-6f78dd9b56
E0119 00:56:46.738309       1 task.go:117] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (626 of 827): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6f78dd9b56" has timed out progressing.
I0119 00:57:11.888330       1 task_graph.go:546] Result of work: [deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6f78dd9b56" has timed out progressing. Cluster operator machine-config is degraded]
I0119 00:57:11.888354       1 sync_worker.go:1173] Update error 626 of 827: WorkloadNotProgressing deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6f78dd9b56" has timed out progressing. (*errors.errorString: deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6f78dd9b56" has timed out progressing.)
* deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-6f78dd9b56" has timed out progressing.

So something like 00:56:13Z for the Deployment transition, and 00:57:11 for ClusterVersion starting to complain about it, which is a lot faster than comment 2's seven minutes.

I'm going to mark this one closed CURRENTRELEASE, but if I'm missing something in all of this, feel free to re-open or file new bugs :).

[1]: https://github.com/openshift/cluster-version-operator/pull/577

Note You need to log in before you can comment on or make changes to this bug.