Bug 1872906

Summary: Cluster did not acknowledge request to upgrade in a reasonable time
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Cluster Version OperatorAssignee: W. Trevor King <wking>
Status: CLOSED ERRATA QA Contact: liujia <jiajliu>
Severity: medium Docs Contact:
Priority: low    
Version: 4.5CC: aos-bugs, bleanhar, bparees, deads, dosmith, hongkliu, jack.ottofaro, jiajliu, jokerman, sdodson, wking, yanyang
Target Milestone: ---   
Target Release: 4.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1843505 Environment:
Last Closed: 2020-10-19 14:54:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1843505    
Bug Blocks:    

Description W. Trevor King 2020-08-26 21:22:02 UTC
+++ This bug was initially created as a clone of Bug #1843505 +++

Description of problem:

Upgrade jobs failing with:

fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:138]: during upgrade to registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-06-03-045338
Unexpected error:
    <*errors.errorString | 0xc0022d5740>: {
        s: "Cluster did not acknowledge request to upgrade in a reasonable time: timed out waiting for the condition",
    }
    Cluster did not acknowledge request to upgrade in a reasonable time: timed out waiting for the condition
occurred

example:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/31107

recurring quite a bit:
https://search.apps.build01.ci.devcluster.openshift.com/?search=Cluster+did+not+acknowledge+request+to+upgrade+in+a+reasonable+time&maxAge=48h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

--- Additional comment from W. Trevor King on 2020-08-03 19:56:22 UTC ---

Abhinav points out that the delay may be the outgoing CVO not releasing the lease, so the incoming CVO has to wait for the old lease to expire.  We have a TODO about setting ReleaseOnCancel [1].  I'll address that TODO, and we'll see if it fixes the slow leader elections...

[1]: https://github.com/openshift/cluster-version-operator/blob/ed864d6f1ed3b43e7ec719d8b3691813a05cc34f/pkg/start/start.go#L136

--- Additional comment from W. Trevor King on 2020-08-26 18:08:34 UTC ---

For Sippy, from bug 1872826, seems like some test suites call this test:

  [sig-cluster-lifecycle] Cluster version operator acknowledges upgrade

Comment 1 W. Trevor King 2020-08-27 04:46:17 UTC
Setting "No Doc Update", because the effect of this was a new CVO possibly taking a minute or two to pick up the orphaned lease.  Doesn't seem like a big deal, and we don't have a formal commitment around how quickly we turn that status acknowledgement around.

Comment 2 W. Trevor King 2020-09-12 20:48:53 UTC
4.6 bug was re-opened, so the backport won't land until that's been verified.

Comment 3 W. Trevor King 2020-10-04 02:25:30 UTC
Waiting on the patch manager to tag us in.

Comment 6 liujia 2020-10-13 03:52:56 UTC
From ci test results in past 48h, there is still one failure "Cluster version operator acknowledges upgrade" in job "release-openshift-origin-installer-e2e-aws-upgrade".

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1315617602956955648

2020/10/12 11:37:47 Resolved release initial to registry.svc.ci.openshift.org/ocp/release:4.5.14
2020/10/12 11:37:47 Resolved release latest to registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-10-12-113401

fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:137]: during upgrade to registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-10-12-113401
Unexpected error:
    <*errors.errorString | 0xc001ac2440>: {
        s: "timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition",
    }
    timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition
occurred

Comment 7 W. Trevor King 2020-10-13 04:49:49 UTC
$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.5.14-x86_64 | grep cluster-version-operator
  cluster-version-operator                       https://github.com/openshift/cluster-version-operator                       8bfbc20d65c296f48b21d40c47a443fefc0c8d77
$ git --no-pager log --first-parent --oneline -3 origin/release-4.5
0a34ac3632 (origin/release-4.5) Merge pull request #470 from wking/v0.18-go-clients
2c849e5729 Merge pull request #446 from wking/gracefully-release-leader-lease-4.5
8bfbc20d65 Merge pull request #433 from openshift-cherrypick-robot/cherry-pick-428-to-release-4.5

So that didn't have the fix yet in the outgoing version.

Comment 8 liujia 2020-10-13 08:51:01 UTC
According to above, we don't have original failure in job "release-openshift-origin-installer-e2e-aws-upgrade" in past 48h. The only failure of v4.5-v4.6 upgrade is against v4.5.14 stable build, which does not include the fix yet. Moreover, i checked there are two one successful v4.5 upgrade job[1][2](4.5.0-0.ci-2020-10-09-151943 to 4.5.0-0.ci-2020-10-12-153413). So move the bug to verify.

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1315677855731945472
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1315687645593997312

Comment 10 errata-xmlrpc 2020-10-19 14:54:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4228

Comment 12 W. Trevor King 2020-11-17 19:37:50 UTC
[1] shows that it picked up the 4.5 target shortly after the test timed out.  Improving 4.4's lease-release logic would help, but a minute or so of bumpy lease handoff doesn't seem important enough to be worth mucking with the maintenance-phase 4.4 [2].

[1]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.4-stable-to-4.5-ci/1328703700692111360/artifacts/e2e-gcp-upgrade/clusterversion.json
[2]: https://access.redhat.com/support/policy/updates/openshift#dates