Bug 1915559 - [sig-cluster-lifecycle] Cluster version operator acknowledges upgrade : timed out waiting for cluster to acknowledge upgrade
Summary: [sig-cluster-lifecycle] Cluster version operator acknowledges upgrade : timed...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.z
Assignee: Jack Ottofaro
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks: 1957927
TreeView+ depends on / blocked
 
Reported: 2021-01-12 22:06 UTC by Jack Ottofaro
Modified: 2021-05-20 11:53 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
This change is an additional log output in a CI test.
Clone Of: 1909875
: 1957927 (view as bug list)
Environment:
[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early]
Last Closed: 2021-05-20 11:52:25 UTC
Target Upstream Version:
jiajliu: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 26055 0 None open [release-4.6] [release-4.7] Bug 1915559: upgrade/upgrade.go: Enhance upgrade ack time out error 2021-04-09 16:20:20 UTC
Red Hat Product Errata RHBA-2021:1521 0 None None None 2021-05-20 11:53:02 UTC

Comment 1 W. Trevor King 2021-01-13 19:04:36 UTC
Delays of a few minutes are not serious enough to block a release.

Comment 2 Scott Dodson 2021-01-13 19:10:33 UTC
(In reply to W. Trevor King from comment #1)
> Delays of a few minutes are not serious enough to block a release.

Can we tune the test to a value that you feel is reasonable to block releases? To me that's what our tests should be tuned to reflect. We can negotiate with whomever picked the current value.

Comment 4 W. Trevor King 2021-01-13 19:44:52 UTC
2-minute timeout is Clayton in [1].  I'm fine leaving that alone in master, because we want to be fast and hear about it when we are slow.  And we definitely want to block if we actually hang (which we have done before, bug 1891143).  It may be possible to distinguish between <2m (great), <10m (acceptable), and >10m (blocker), but I'm not sure how to represent "acceptable" in our output JUnit in a way we'd notice.  If older releases are noisy with this test, I'm fine saying "risk of slowdown from great->acceptable is small, so let's bump the timeout to 10m and only hear about blockers".  Thoughts?

[1]: https://github.com/openshift/origin/commit/a53efd5e2788e7ce37b6e4b251e80bf2b4720739#diff-a3e17408eaaa387d9a91030c6d3cd0fe5ad10f976a605dc1d081b56a57f79162R414

Comment 5 Lalatendu Mohanty 2021-03-09 17:35:18 UTC
Changing the sev/prio as per the parent bug.

Comment 8 liujia 2021-04-13 02:31:39 UTC
# w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Cluster+version+operator+acknowledges+upgrade&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job'|grep 'failures match'
release-openshift-origin-installer-e2e-aws-upgrade (all) - 7 runs, 57% failed, 25% of failures match = 14% impact
-> 4.4 to 4.5 upgrade
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade (all) - 24 runs, 58% failed, 36% of failures match = 21% impact
-> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2"
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-upgrade (all) - 8 runs, 88% failed, 14% of failures match = 13% impact
-> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2"
periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-azure-upgrade (all) - 8 runs, 63% failed, 60% of failures match = 38% impact
-> 4.4 to 4.5 upgrade
release-openshift-origin-installer-old-rhcos-e2e-aws-4.6 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
-> 4.5 to registry.build02.ci.openshift.org/ci-op-7e71bcc2/release:upgrade 
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-ovn-upgrade (all) - 21 runs, 100% failed, 19% of failures match = 19% impact
-> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2"
periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-aws-upgrade (all) - 8 runs, 25% failed, 50% of failures match = 13% impact
-> 4.4 to 4.5 upgrade
periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-gcp-upgrade (all) - 8 runs, 38% failed, 67% of failures match = 25% impact
-> 4.4 to 4.5 upgrade
periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-gcp-ovn-upgrade (all) - 8 runs, 88% failed, 43% of failures match = 38% impact
-> 4.4 to 4.5 upgrade
release-openshift-origin-installer-e2e-aws-upgrade-4.5-to-4.6-to-4.7-to-4.8-ci (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
-> 4.5 to 4.8 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2"
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
-> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2"
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade-rollback (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
-> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2"

From above ci test result, the error info was updated to expected one on v4.6 now.

@Jack
But i notice this pr is just for adding more log to help diagnose issue instead of fixing timeout error. Not sure if we still need keep this bug open for further fix or just close this one and track the issue in another bug.

Comment 9 Jack Ottofaro 2021-04-13 13:01:38 UTC
(In reply to liujia from comment #8)
> # w3m -dump -cols 200
> 'https://search.ci.openshift.org/
> ?search=Cluster+version+operator+acknowledges+upgrade&maxAge=48h&context=1&ty
> pe=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job'|grep
> 'failures match'
> release-openshift-origin-installer-e2e-aws-upgrade (all) - 7 runs, 57%
> failed, 25% of failures match = 14% impact
> -> 4.4 to 4.5 upgrade
> periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-
> upgrade (all) - 24 runs, 58% failed, 36% of failures match = 21% impact
> -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for
> cluster to acknowledge upgrade: timed out waiting for the condition;
> observedGeneration: 1; updated.Generation: 2"
> periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-
> upgrade (all) - 8 runs, 88% failed, 14% of failures match = 13% impact
> -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for
> cluster to acknowledge upgrade: timed out waiting for the condition;
> observedGeneration: 1; updated.Generation: 2"
> periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-
> azure-upgrade (all) - 8 runs, 63% failed, 60% of failures match = 38% impact
> -> 4.4 to 4.5 upgrade
> release-openshift-origin-installer-old-rhcos-e2e-aws-4.6 (all) - 1 runs,
> 100% failed, 100% of failures match = 100% impact
> -> 4.5 to registry.build02.ci.openshift.org/ci-op-7e71bcc2/release:upgrade 
> periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-
> ovn-upgrade (all) - 21 runs, 100% failed, 19% of failures match = 19% impact
> -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for
> cluster to acknowledge upgrade: timed out waiting for the condition;
> observedGeneration: 1; updated.Generation: 2"
> periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-aws-
> upgrade (all) - 8 runs, 25% failed, 50% of failures match = 13% impact
> -> 4.4 to 4.5 upgrade
> periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-gcp-
> upgrade (all) - 8 runs, 38% failed, 67% of failures match = 25% impact
> -> 4.4 to 4.5 upgrade
> periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-gcp-
> ovn-upgrade (all) - 8 runs, 88% failed, 43% of failures match = 38% impact
> -> 4.4 to 4.5 upgrade
> release-openshift-origin-installer-e2e-aws-upgrade-4.5-to-4.6-to-4.7-to-4.8-
> ci (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
> -> 4.5 to 4.8 upgrade,error info is updated to "Timed out waiting for
> cluster to acknowledge upgrade: timed out waiting for the condition;
> observedGeneration: 1; updated.Generation: 2"
> periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-
> ovn-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
> -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for
> cluster to acknowledge upgrade: timed out waiting for the condition;
> observedGeneration: 1; updated.Generation: 2"
> periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-
> upgrade-rollback (all) - 2 runs, 100% failed, 50% of failures match = 50%
> impact
> -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for
> cluster to acknowledge upgrade: timed out waiting for the condition;
> observedGeneration: 1; updated.Generation: 2"
> 
> From above ci test result, the error info was updated to expected one on
> v4.6 now.
> 
> @Jack
> But i notice this pr is just for adding more log to help diagnose issue
> instead of fixing timeout error. Not sure if we still need keep this bug
> open for further fix or just close this one and track the issue in another
> bug.

We can leave it open and I'll use it to work the issue. Thanks.

Comment 10 liujia 2021-04-14 00:40:51 UTC
ok, update bug's status.

Comment 13 Jack Ottofaro 2021-05-06 21:25:46 UTC
(In reply to liujia from comment #10)
> ok, update bug's status.

Revisiting this and looking at the logs in detail again, the CVO "upgrade pod" just misses updating the version/generation by seconds before the Test's 2 minute t/o blows. I'm sure due to the connection errors and churn that accompanies it whenever this happens. I believe this issue only occurs in version 4.5 CI so I have dup'ed this bug with a 4.5.z target and have created a PR to bump the test t/o to 4 minutes. Since no change should be required for 4.6 can we move this to verified so bug 1957927 is valid?

Comment 14 liujia 2021-05-07 01:26:37 UTC
Gotcha, then let's track the issue in bz1957927 and close this one.

According to comment8 and comment13, move the bug to veirfy.

Comment 17 errata-xmlrpc 2021-05-20 11:52:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.29 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1521


Note You need to log in before you can comment on or make changes to this bug.