Bug 1915559
Summary: | [sig-cluster-lifecycle] Cluster version operator acknowledges upgrade : timed out waiting for cluster to acknowledge upgrade | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jack Ottofaro <jack.ottofaro> | |
Component: | Cluster Version Operator | Assignee: | Jack Ottofaro <jack.ottofaro> | |
Status: | CLOSED ERRATA | QA Contact: | liujia <jiajliu> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 4.6 | CC: | aos-bugs, bleanhar, bparees, jack.ottofaro, jiajliu, jialiu, jokerman, lmohanty, pmahajan, vrutkovs, wking | |
Target Milestone: | --- | Flags: | jiajliu:
needinfo-
|
|
Target Release: | 4.6.z | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: |
This change is an additional log output in a CI test.
|
Story Points: | --- | |
Clone Of: | 1909875 | |||
: | 1957927 (view as bug list) | Environment: |
[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early]
|
|
Last Closed: | 2021-05-20 11:52:25 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1957927 |
Comment 1
W. Trevor King
2021-01-13 19:04:36 UTC
(In reply to W. Trevor King from comment #1) > Delays of a few minutes are not serious enough to block a release. Can we tune the test to a value that you feel is reasonable to block releases? To me that's what our tests should be tuned to reflect. We can negotiate with whomever picked the current value. 2-minute timeout is Clayton in [1]. I'm fine leaving that alone in master, because we want to be fast and hear about it when we are slow. And we definitely want to block if we actually hang (which we have done before, bug 1891143). It may be possible to distinguish between <2m (great), <10m (acceptable), and >10m (blocker), but I'm not sure how to represent "acceptable" in our output JUnit in a way we'd notice. If older releases are noisy with this test, I'm fine saying "risk of slowdown from great->acceptable is small, so let's bump the timeout to 10m and only hear about blockers". Thoughts? [1]: https://github.com/openshift/origin/commit/a53efd5e2788e7ce37b6e4b251e80bf2b4720739#diff-a3e17408eaaa387d9a91030c6d3cd0fe5ad10f976a605dc1d081b56a57f79162R414 Changing the sev/prio as per the parent bug. # w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Cluster+version+operator+acknowledges+upgrade&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job'|grep 'failures match' release-openshift-origin-installer-e2e-aws-upgrade (all) - 7 runs, 57% failed, 25% of failures match = 14% impact -> 4.4 to 4.5 upgrade periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade (all) - 24 runs, 58% failed, 36% of failures match = 21% impact -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2" periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-upgrade (all) - 8 runs, 88% failed, 14% of failures match = 13% impact -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2" periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-azure-upgrade (all) - 8 runs, 63% failed, 60% of failures match = 38% impact -> 4.4 to 4.5 upgrade release-openshift-origin-installer-old-rhcos-e2e-aws-4.6 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact -> 4.5 to registry.build02.ci.openshift.org/ci-op-7e71bcc2/release:upgrade periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-ovn-upgrade (all) - 21 runs, 100% failed, 19% of failures match = 19% impact -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2" periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-aws-upgrade (all) - 8 runs, 25% failed, 50% of failures match = 13% impact -> 4.4 to 4.5 upgrade periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-gcp-upgrade (all) - 8 runs, 38% failed, 67% of failures match = 25% impact -> 4.4 to 4.5 upgrade periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-gcp-ovn-upgrade (all) - 8 runs, 88% failed, 43% of failures match = 38% impact -> 4.4 to 4.5 upgrade release-openshift-origin-installer-e2e-aws-upgrade-4.5-to-4.6-to-4.7-to-4.8-ci (all) - 2 runs, 100% failed, 50% of failures match = 50% impact -> 4.5 to 4.8 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2" periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2" periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade-rollback (all) - 2 runs, 100% failed, 50% of failures match = 50% impact -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2" From above ci test result, the error info was updated to expected one on v4.6 now. @Jack But i notice this pr is just for adding more log to help diagnose issue instead of fixing timeout error. Not sure if we still need keep this bug open for further fix or just close this one and track the issue in another bug. (In reply to liujia from comment #8) > # w3m -dump -cols 200 > 'https://search.ci.openshift.org/ > ?search=Cluster+version+operator+acknowledges+upgrade&maxAge=48h&context=1&ty > pe=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job'|grep > 'failures match' > release-openshift-origin-installer-e2e-aws-upgrade (all) - 7 runs, 57% > failed, 25% of failures match = 14% impact > -> 4.4 to 4.5 upgrade > periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws- > upgrade (all) - 24 runs, 58% failed, 36% of failures match = 21% impact > -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for > cluster to acknowledge upgrade: timed out waiting for the condition; > observedGeneration: 1; updated.Generation: 2" > periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp- > upgrade (all) - 8 runs, 88% failed, 14% of failures match = 13% impact > -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for > cluster to acknowledge upgrade: timed out waiting for the condition; > observedGeneration: 1; updated.Generation: 2" > periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e- > azure-upgrade (all) - 8 runs, 63% failed, 60% of failures match = 38% impact > -> 4.4 to 4.5 upgrade > release-openshift-origin-installer-old-rhcos-e2e-aws-4.6 (all) - 1 runs, > 100% failed, 100% of failures match = 100% impact > -> 4.5 to registry.build02.ci.openshift.org/ci-op-7e71bcc2/release:upgrade > periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws- > ovn-upgrade (all) - 21 runs, 100% failed, 19% of failures match = 19% impact > -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for > cluster to acknowledge upgrade: timed out waiting for the condition; > observedGeneration: 1; updated.Generation: 2" > periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-aws- > upgrade (all) - 8 runs, 25% failed, 50% of failures match = 13% impact > -> 4.4 to 4.5 upgrade > periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-gcp- > upgrade (all) - 8 runs, 38% failed, 67% of failures match = 25% impact > -> 4.4 to 4.5 upgrade > periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-gcp- > ovn-upgrade (all) - 8 runs, 88% failed, 43% of failures match = 38% impact > -> 4.4 to 4.5 upgrade > release-openshift-origin-installer-e2e-aws-upgrade-4.5-to-4.6-to-4.7-to-4.8- > ci (all) - 2 runs, 100% failed, 50% of failures match = 50% impact > -> 4.5 to 4.8 upgrade,error info is updated to "Timed out waiting for > cluster to acknowledge upgrade: timed out waiting for the condition; > observedGeneration: 1; updated.Generation: 2" > periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp- > ovn-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact > -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for > cluster to acknowledge upgrade: timed out waiting for the condition; > observedGeneration: 1; updated.Generation: 2" > periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws- > upgrade-rollback (all) - 2 runs, 100% failed, 50% of failures match = 50% > impact > -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for > cluster to acknowledge upgrade: timed out waiting for the condition; > observedGeneration: 1; updated.Generation: 2" > > From above ci test result, the error info was updated to expected one on > v4.6 now. > > @Jack > But i notice this pr is just for adding more log to help diagnose issue > instead of fixing timeout error. Not sure if we still need keep this bug > open for further fix or just close this one and track the issue in another > bug. We can leave it open and I'll use it to work the issue. Thanks. ok, update bug's status. (In reply to liujia from comment #10) > ok, update bug's status. Revisiting this and looking at the logs in detail again, the CVO "upgrade pod" just misses updating the version/generation by seconds before the Test's 2 minute t/o blows. I'm sure due to the connection errors and churn that accompanies it whenever this happens. I believe this issue only occurs in version 4.5 CI so I have dup'ed this bug with a 4.5.z target and have created a PR to bump the test t/o to 4 minutes. Since no change should be required for 4.6 can we move this to verified so bug 1957927 is valid? Gotcha, then let's track the issue in bz1957927 and close this one. According to comment8 and comment13, move the bug to veirfy. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.29 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1521 |