1872906 – Cluster did not acknowledge request to upgrade in a reasonable time

Bug 1872906 - Cluster did not acknowledge request to upgrade in a reasonable time

Summary: Cluster did not acknowledge request to upgrade in a reasonable time

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.5.z
Assignee:	W. Trevor King
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:	1843505
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-26 21:22 UTC by W. Trevor King
Modified:	2020-11-17 19:37 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:	1843505
Environment:
Last Closed:	2020-10-19 14:54:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 446	0	None	closed	Bug 1872906: pkg/start: Release leader lease on graceful shutdown	2021-02-20 06:08:00 UTC
Red Hat Product Errata	RHBA-2020:4228	0	None	None	None	2020-10-19 14:54:40 UTC

Description W. Trevor King 2020-08-26 21:22:02 UTC

+++ This bug was initially created as a clone of Bug #1843505 +++

Description of problem:

Upgrade jobs failing with:

fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:138]: during upgrade to registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338
Unexpected error:
    <*errors.errorString | 0xc0022d5740>: {
        s: "Cluster did not acknowledge request to upgrade in a reasonable time: timed out waiting for the condition",
    }
    Cluster did not acknowledge request to upgrade in a reasonable time: timed out waiting for the condition
occurred

example:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107

recurring quite a bit:
https://search.apps.build01.ci.devcluster.openshift.com/?search=Cluster+did+not+acknowledge+request+to+upgrade+in+a+reasonable+time&maxAge=48h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

--- Additional comment from W. Trevor King on 2020-08-03 19:56:22 UTC ---

Abhinav points out that the delay may be the outgoing CVO not releasing the lease, so the incoming CVO has to wait for the old lease to expire.  We have a TODO about setting ReleaseOnCancel [1].  I'll address that TODO, and we'll see if it fixes the slow leader elections...

[1]: https://github.com/openshift/cluster-version-operator/blob/ed864d6f1ed3b43e7ec719d8b3691813a05cc34f/pkg/start/start.go#L136

--- Additional comment from W. Trevor King on 2020-08-26 18:08:34 UTC ---

For Sippy, from bug 1872826, seems like some test suites call this test:

  [sig-cluster-lifecycle] Cluster version operator acknowledges upgrade

Comment 1 W. Trevor King 2020-08-27 04:46:17 UTC

Setting "No Doc Update", because the effect of this was a new CVO possibly taking a minute or two to pick up the orphaned lease.  Doesn't seem like a big deal, and we don't have a formal commitment around how quickly we turn that status acknowledgement around.

Comment 2 W. Trevor King 2020-09-12 20:48:53 UTC

4.6 bug was re-opened, so the backport won't land until that's been verified.

Comment 3 W. Trevor King 2020-10-04 02:25:30 UTC

Waiting on the patch manager to tag us in.

Comment 6 liujia 2020-10-13 03:52:56 UTC

From ci test results in past 48h, there is still one failure "Cluster version operator acknowledges upgrade" in job "release-openshift-origin-installer-e2e-aws-upgrade".

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1315617602956955648

2020/10/12 11:37:47 Resolved release initial to registry.svc.ci.openshift.org/ocp/release:4.5.14
2020/10/12 11:37:47 Resolved release latest to registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-10-12-113401

fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:137]: during upgrade to registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-10-12-113401
Unexpected error:
    <*errors.errorString | 0xc001ac2440>: {
        s: "timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition",
    }
    timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition
occurred

Comment 7 W. Trevor King 2020-10-13 04:49:49 UTC

$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.5.14-x86_64 | grep cluster-version-operator
  cluster-version-operator                       https://github.com/openshift/cluster-version-operator                       8bfbc20d65c296f48b21d40c47a443fefc0c8d77
$ git --no-pager log --first-parent --oneline -3 origin/release-4.5
0a34ac3632 (origin/release-4.5) Merge pull request #470 from wking/v0.18-go-clients
2c849e5729 Merge pull request #446 from wking/gracefully-release-leader-lease-4.5
8bfbc20d65 Merge pull request #433 from openshift-cherrypick-robot/cherry-pick-428-to-release-4.5

So that didn't have the fix yet in the outgoing version.

Comment 8 liujia 2020-10-13 08:51:01 UTC

According to above, we don't have original failure in job "release-openshift-origin-installer-e2e-aws-upgrade" in past 48h. The only failure of v4.5-v4.6 upgrade is against v4.5.14 stable build, which does not include the fix yet. Moreover, i checked there are two one successful v4.5 upgrade job[1][2](4.5.0-0.ci-2020-10-09-151943 to 4.5.0-0.ci-2020-10-12-153413). So move the bug to verify.

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1315677855731945472
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1315687645593997312

Comment 10 errata-xmlrpc 2020-10-19 14:54:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4228

Comment 11 Hongkai Liu 2020-11-17 19:09:56 UTC

Saw another one today.
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.4-stable-to-4.5-ci/1328703700692111360

Comment 12 W. Trevor King 2020-11-17 19:37:50 UTC

[1] shows that it picked up the 4.5 target shortly after the test timed out.  Improving 4.4's lease-release logic would help, but a minute or so of bumpy lease handoff doesn't seem important enough to be worth mucking with the maintenance-phase 4.4 [2].

[1]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.4-stable-to-4.5-ci/1328703700692111360/artifacts/e2e-gcp-upgrade/clusterversion.json
[2]: https://access.redhat.com/support/policy/updates/openshift#dates

Note You need to log in before you can comment on or make changes to this bug.