2062568 – CVO does not trigger new upgrade again after fail to update to unavailable payload

Bug 2062568 - CVO does not trigger new upgrade again after fail to update to unavailable payload

Summary: CVO does not trigger new upgrade again after fail to update to unavailable pa...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	W. Trevor King
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2071211
TreeView+	depends on / blocked

Reported:	2022-03-10 07:14 UTC by liujia
Modified:	2022-11-07 10:04 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2071211 (view as bug list)
Environment:
Last Closed:	2022-08-10 10:53:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 748	0	None	open	Bug 2062568: lib/resourcebuilder/batch: Stop waiting on Job deadline exceeded	2022-03-10 07:44:22 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:53:49 UTC

Description liujia 2022-03-10 07:14:35 UTC

Description of problem:
After trying to upgrade to an unavailable payload(no upgrade happens as expected), cvo can not continue to start a new upgrade even with a correct payload repo.

=======================================
Check cvo log to find cvo struggling for the update job version--v5f88 and fail due to timeout. But it did not respond to the new upgrade requirement after that.

# ./oc -n openshift-cluster-version logs cluster-version-operator-68ccb8c4fd-p7x4r|grep 'quay.io/openshift-release-dev/ocp-release@sha256\:90fabdb'|head -n1
I0310 04:52:15.072040       1 cvo.go:546] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4", Force:false}

# ./oc -n openshift-cluster-version logs cluster-version-operator-68ccb8c4fd-p7x4r|grep 'registry.ci.openshift.org/ocp/release@sha256\:90fabdb'|head -n1
#


...
0310 04:52:15.072040       1 cvo.go:546] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4", Force:false}
...
I0310 04:52:15.225739       1 batch.go:53] No active pods for job version--v5f88 in namespace openshift-cluster-version
I0310 04:52:15.225778       1 batch.go:22] Job version--v5f88 in namespace openshift-cluster-version is not ready, continuing to wait.
...
I0310 05:03:12.238308       1 batch.go:53] No active pods for job version--v5f88 in namespace openshift-cluster-version
E0310 05:03:12.238525       1 batch.go:19] deadline exceeded, reason: "DeadlineExceeded", message: "Job was active longer than specified deadline"
.....

# ./oc get all -n openshift-cluster-version
NAME                                            READY   STATUS    RESTARTS   AGE
pod/cluster-version-operator-68ccb8c4fd-p7x4r   1/1     Running   0          61m

NAME                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/cluster-version-operator   ClusterIP   172.30.220.176   <none>        9099/TCP   62m

NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cluster-version-operator   1/1     1            1           61m

NAME                                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/cluster-version-operator-68ccb8c4fd   1         1         1       61m

NAME                       COMPLETIONS   DURATION   AGE
job.batch/version--v5f88   0/1           30m        30m


Version-Release number of the following components:
4.11.0-0.nightly-2022-03-04-063157

How reproducible:
always

Steps to Reproduce:
1. Trigger an upgrade to an unavailable image(by mistake), from 4.11.0-0.nightly-2022-03-04-063157 to 4.11.0-0.nightly-2022-03-08-191358

#./oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4 --allow-explicit-upgrade
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
Updating to release image quay.io/openshift-release-dev/ocp-release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4

2. Wait for several mins(>5mins), no upgrade will happen(expected), and no any failure info(not expected)
# ./oc get clusterversion -ojson|jq .items[].status.conditions
  {
    "lastTransitionTime": "2022-03-10T04:20:12Z",
    "message": "Payload loaded version=\"4.11.0-0.nightly-2022-03-04-063157\" image=\"registry.ci.openshift.org/ocp/release@sha256:cdeb8497920d9231ecc1ea7535e056b192f2ccf0fa6257d65be3bb876c1b9de6\"",
    "reason": "PayloadLoaded",
    "status": "True",
    "type": "ReleaseAccepted"
  },
...
# ./oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-03-04-063157   True        False         27m     Cluster version is 4.11.0-0.nightly-2022-03-04-063157

# ./oc adm upgrade
Cluster version is 4.11.0-0.nightly-2022-03-04-063157

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.11
warning: Cannot display available updates:
  Reason: VersionNotFound
  Message: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-03-04-063157 not found in the "stable-4.11" channel

3. Continue upgrade to target payload with correct repo
# ./oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4 --allow-explicit-upgrade
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
Updating to release image registry.ci.openshift.org/ocp/release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4

4. Still no upgrade happen, the same with step 2(not expected)

Actual results:
An update to available payload will bring cvo does not work.

Expected results:
Upgrade to correct target payload should be triggerred.

Additional info:
`oc adm upgrade --clear` to cancel the initial invalid upgrade before triggering new upgrade does not work. Only delete cvo pod to get it re-deployed, then cvo will work again.

Comment 2 W. Trevor King 2022-03-10 08:08:05 UTC

Not a blocker.  Most release-pulling jobs will succeed in less than the job timeout, which is two minutes [1].  I suspect you'd need to connect to a very slow registry or have other pod-launching issues to hit this.  Mitigation should be possible via:

  $ oc -n openshift-cluster-version delete jobs --all

or similar to remove the failed job.  Inspecting the failed job to understand why it failed would also be useful.

[1]: https://github.com/openshift/cluster-version-operator/blob/0e9bc4ef03004fcc2bb0c58d39e5b49445a9f8f6/pkg/cvo/updatepayload.go#L162

Comment 5 liujia 2022-03-15 01:48:45 UTC

Version: 4.11.0-0.nightly-2022-03-13-055724

1. Upgrade from 4.11.0-0.nightly-2022-03-13-055724 to 4.11.0-0.nightly-2022-03-14-113722 with wrong repo
# ./oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:f21d4bb9ccb8a82cc14906bf89b0422ffd5c423b5e5dfc10b843957181de87f2 --allow-explicit-upgrade
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
Updating to release image quay.io/openshift-release-dev/ocp-release@sha256:f21d4bb9ccb8a82cc14906bf89b0422ffd5c423b5e5dfc10b843957181de87f2

2. No upgrade happen
# ./oc get clusterversion -ojson|jq .items[].status.conditions
...
{
    "lastTransitionTime": "2022-03-15T01:20:26Z",
    "message": "Retrieving payload failed version=\"\" image=\"quay.io/openshift-release-dev/ocp-release@sha256:f21d4bb9ccb8a82cc14906bf89b0422ffd5c423b5e5dfc10b843957181de87f2\" failure=Unable to download and prepare the update: deadline exceeded, reason: \"DeadlineExceeded\", message: \"Job was active longer than specified deadline\"",
    "reason": "RetrievePayload",
    "status": "False",
    "type": "ReleaseAccepted"
  },
...
# ./oc get clusterversion -ojson|jq .items[].status.history
[
  {
    "completionTime": "2022-03-15T00:59:59Z",
    "image": "registry.ci.openshift.org/ocp/release@sha256:9653c71def3a3cf89e2b973a0328ac684f8bb6f913eab3bfbd106737fe09e57c",
    "startedTime": "2022-03-15T00:42:38Z",
    "state": "Completed",
    "verified": false,
    "version": "4.11.0-0.nightly-2022-03-13-055724"
  }
]

3. Continue upgrade to target payload with correct repo.
# ./oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:f21d4bb9ccb8a82cc14906bf89b0422ffd5c423b5e5dfc10b843957181de87f2 --allow-explicit-upgrade
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
Updating to release image registry.ci.openshift.org/ocp/release@sha256:f21d4bb9ccb8a82cc14906bf89b0422ffd5c423b5e5dfc10b843957181de87f2

4. Upgrade is triggered successfully.
# ./oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-03-13-055724   True        True          58s     Working towards 4.11.0-0.nightly-2022-03-14-113722: 118 of 777 done (15% complete)

Comment 8 errata-xmlrpc 2022-08-10 10:53:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.