+++ This bug was initially created as a clone of Bug #1843505 +++ Description of problem: Upgrade jobs failing with: fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:138]: during upgrade to registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338 Unexpected error: <*errors.errorString | 0xc0022d5740>: { s: "Cluster did not acknowledge request to upgrade in a reasonable time: timed out waiting for the condition", } Cluster did not acknowledge request to upgrade in a reasonable time: timed out waiting for the condition occurred example: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107 recurring quite a bit: https://search.apps.build01.ci.devcluster.openshift.com/?search=Cluster+did+not+acknowledge+request+to+upgrade+in+a+reasonable+time&maxAge=48h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job --- Additional comment from W. Trevor King on 2020-08-03 19:56:22 UTC --- Abhinav points out that the delay may be the outgoing CVO not releasing the lease, so the incoming CVO has to wait for the old lease to expire. We have a TODO about setting ReleaseOnCancel [1]. I'll address that TODO, and we'll see if it fixes the slow leader elections... [1]: https://github.com/openshift/cluster-version-operator/blob/ed864d6f1ed3b43e7ec719d8b3691813a05cc34f/pkg/start/start.go#L136 --- Additional comment from W. Trevor King on 2020-08-26 18:08:34 UTC --- For Sippy, from bug 1872826, seems like some test suites call this test: [sig-cluster-lifecycle] Cluster version operator acknowledges upgrade
Setting "No Doc Update", because the effect of this was a new CVO possibly taking a minute or two to pick up the orphaned lease. Doesn't seem like a big deal, and we don't have a formal commitment around how quickly we turn that status acknowledgement around.
4.6 bug was re-opened, so the backport won't land until that's been verified.
Waiting on the patch manager to tag us in.
From ci test results in past 48h, there is still one failure "Cluster version operator acknowledges upgrade" in job "release-openshift-origin-installer-e2e-aws-upgrade". https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1315617602956955648 2020/10/12 11:37:47 Resolved release initial to registry.svc.ci.openshift.org/ocp/release:4.5.14 2020/10/12 11:37:47 Resolved release latest to registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-10-12-113401 fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:137]: during upgrade to registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-10-12-113401 Unexpected error: <*errors.errorString | 0xc001ac2440>: { s: "timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition", } timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition occurred
$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.5.14-x86_64 | grep cluster-version-operator cluster-version-operator https://github.com/openshift/cluster-version-operator 8bfbc20d65c296f48b21d40c47a443fefc0c8d77 $ git --no-pager log --first-parent --oneline -3 origin/release-4.5 0a34ac3632 (origin/release-4.5) Merge pull request #470 from wking/v0.18-go-clients 2c849e5729 Merge pull request #446 from wking/gracefully-release-leader-lease-4.5 8bfbc20d65 Merge pull request #433 from openshift-cherrypick-robot/cherry-pick-428-to-release-4.5 So that didn't have the fix yet in the outgoing version.
According to above, we don't have original failure in job "release-openshift-origin-installer-e2e-aws-upgrade" in past 48h. The only failure of v4.5-v4.6 upgrade is against v4.5.14 stable build, which does not include the fix yet. Moreover, i checked there are two one successful v4.5 upgrade job[1][2](4.5.0-0.ci-2020-10-09-151943 to 4.5.0-0.ci-2020-10-12-153413). So move the bug to verify. [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1315677855731945472 [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1315687645593997312
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5.15 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4228
Saw another one today. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.4-stable-to-4.5-ci/1328703700692111360
[1] shows that it picked up the 4.5 target shortly after the test timed out. Improving 4.4's lease-release logic would help, but a minute or so of bumpy lease handoff doesn't seem important enough to be worth mucking with the maintenance-phase 4.4 [2]. [1]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.4-stable-to-4.5-ci/1328703700692111360/artifacts/e2e-gcp-upgrade/clusterversion.json [2]: https://access.redhat.com/support/policy/updates/openshift#dates