Description of problem: After trying to upgrade to an unavailable payload(no upgrade happens as expected), cvo can not continue to start a new upgrade even with a correct payload repo. ======================================= Check cvo log to find cvo struggling for the update job version--v5f88 and fail due to timeout. But it did not respond to the new upgrade requirement after that. # ./oc -n openshift-cluster-version logs cluster-version-operator-68ccb8c4fd-p7x4r|grep 'quay.io/openshift-release-dev/ocp-release@sha256\:90fabdb'|head -n1 I0310 04:52:15.072040 1 cvo.go:546] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4", Force:false} # ./oc -n openshift-cluster-version logs cluster-version-operator-68ccb8c4fd-p7x4r|grep 'registry.ci.openshift.org/ocp/release@sha256\:90fabdb'|head -n1 # ... 0310 04:52:15.072040 1 cvo.go:546] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4", Force:false} ... I0310 04:52:15.225739 1 batch.go:53] No active pods for job version--v5f88 in namespace openshift-cluster-version I0310 04:52:15.225778 1 batch.go:22] Job version--v5f88 in namespace openshift-cluster-version is not ready, continuing to wait. ... I0310 05:03:12.238308 1 batch.go:53] No active pods for job version--v5f88 in namespace openshift-cluster-version E0310 05:03:12.238525 1 batch.go:19] deadline exceeded, reason: "DeadlineExceeded", message: "Job was active longer than specified deadline" ..... # ./oc get all -n openshift-cluster-version NAME READY STATUS RESTARTS AGE pod/cluster-version-operator-68ccb8c4fd-p7x4r 1/1 Running 0 61m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/cluster-version-operator ClusterIP 172.30.220.176 <none> 9099/TCP 62m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/cluster-version-operator 1/1 1 1 61m NAME DESIRED CURRENT READY AGE replicaset.apps/cluster-version-operator-68ccb8c4fd 1 1 1 61m NAME COMPLETIONS DURATION AGE job.batch/version--v5f88 0/1 30m 30m Version-Release number of the following components: 4.11.0-0.nightly-2022-03-04-063157 How reproducible: always Steps to Reproduce: 1. Trigger an upgrade to an unavailable image(by mistake), from 4.11.0-0.nightly-2022-03-04-063157 to 4.11.0-0.nightly-2022-03-08-191358 #./oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4 --allow-explicit-upgrade warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway Updating to release image quay.io/openshift-release-dev/ocp-release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4 2. Wait for several mins(>5mins), no upgrade will happen(expected), and no any failure info(not expected) # ./oc get clusterversion -ojson|jq .items[].status.conditions { "lastTransitionTime": "2022-03-10T04:20:12Z", "message": "Payload loaded version=\"4.11.0-0.nightly-2022-03-04-063157\" image=\"registry.ci.openshift.org/ocp/release@sha256:cdeb8497920d9231ecc1ea7535e056b192f2ccf0fa6257d65be3bb876c1b9de6\"", "reason": "PayloadLoaded", "status": "True", "type": "ReleaseAccepted" }, ... # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-03-04-063157 True False 27m Cluster version is 4.11.0-0.nightly-2022-03-04-063157 # ./oc adm upgrade Cluster version is 4.11.0-0.nightly-2022-03-04-063157 Upstream is unset, so the cluster will use an appropriate default. Channel: stable-4.11 warning: Cannot display available updates: Reason: VersionNotFound Message: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-03-04-063157 not found in the "stable-4.11" channel 3. Continue upgrade to target payload with correct repo # ./oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4 --allow-explicit-upgrade warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway Updating to release image registry.ci.openshift.org/ocp/release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4 4. Still no upgrade happen, the same with step 2(not expected) Actual results: An update to available payload will bring cvo does not work. Expected results: Upgrade to correct target payload should be triggerred. Additional info: `oc adm upgrade --clear` to cancel the initial invalid upgrade before triggering new upgrade does not work. Only delete cvo pod to get it re-deployed, then cvo will work again.
Not a blocker. Most release-pulling jobs will succeed in less than the job timeout, which is two minutes [1]. I suspect you'd need to connect to a very slow registry or have other pod-launching issues to hit this. Mitigation should be possible via: $ oc -n openshift-cluster-version delete jobs --all or similar to remove the failed job. Inspecting the failed job to understand why it failed would also be useful. [1]: https://github.com/openshift/cluster-version-operator/blob/0e9bc4ef03004fcc2bb0c58d39e5b49445a9f8f6/pkg/cvo/updatepayload.go#L162
Version: 4.11.0-0.nightly-2022-03-13-055724 1. Upgrade from 4.11.0-0.nightly-2022-03-13-055724 to 4.11.0-0.nightly-2022-03-14-113722 with wrong repo # ./oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:f21d4bb9ccb8a82cc14906bf89b0422ffd5c423b5e5dfc10b843957181de87f2 --allow-explicit-upgrade warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway Updating to release image quay.io/openshift-release-dev/ocp-release@sha256:f21d4bb9ccb8a82cc14906bf89b0422ffd5c423b5e5dfc10b843957181de87f2 2. No upgrade happen # ./oc get clusterversion -ojson|jq .items[].status.conditions ... { "lastTransitionTime": "2022-03-15T01:20:26Z", "message": "Retrieving payload failed version=\"\" image=\"quay.io/openshift-release-dev/ocp-release@sha256:f21d4bb9ccb8a82cc14906bf89b0422ffd5c423b5e5dfc10b843957181de87f2\" failure=Unable to download and prepare the update: deadline exceeded, reason: \"DeadlineExceeded\", message: \"Job was active longer than specified deadline\"", "reason": "RetrievePayload", "status": "False", "type": "ReleaseAccepted" }, ... # ./oc get clusterversion -ojson|jq .items[].status.history [ { "completionTime": "2022-03-15T00:59:59Z", "image": "registry.ci.openshift.org/ocp/release@sha256:9653c71def3a3cf89e2b973a0328ac684f8bb6f913eab3bfbd106737fe09e57c", "startedTime": "2022-03-15T00:42:38Z", "state": "Completed", "verified": false, "version": "4.11.0-0.nightly-2022-03-13-055724" } ] 3. Continue upgrade to target payload with correct repo. # ./oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:f21d4bb9ccb8a82cc14906bf89b0422ffd5c423b5e5dfc10b843957181de87f2 --allow-explicit-upgrade warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway Updating to release image registry.ci.openshift.org/ocp/release@sha256:f21d4bb9ccb8a82cc14906bf89b0422ffd5c423b5e5dfc10b843957181de87f2 4. Upgrade is triggered successfully. # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-03-13-055724 True True 58s Working towards 4.11.0-0.nightly-2022-03-14-113722: 118 of 777 done (15% complete)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069