Delays of a few minutes are not serious enough to block a release.
(In reply to W. Trevor King from comment #1) > Delays of a few minutes are not serious enough to block a release. Can we tune the test to a value that you feel is reasonable to block releases? To me that's what our tests should be tuned to reflect. We can negotiate with whomever picked the current value.
2-minute timeout is Clayton in [1]. I'm fine leaving that alone in master, because we want to be fast and hear about it when we are slow. And we definitely want to block if we actually hang (which we have done before, bug 1891143). It may be possible to distinguish between <2m (great), <10m (acceptable), and >10m (blocker), but I'm not sure how to represent "acceptable" in our output JUnit in a way we'd notice. If older releases are noisy with this test, I'm fine saying "risk of slowdown from great->acceptable is small, so let's bump the timeout to 10m and only hear about blockers". Thoughts? [1]: https://github.com/openshift/origin/commit/a53efd5e2788e7ce37b6e4b251e80bf2b4720739#diff-a3e17408eaaa387d9a91030c6d3cd0fe5ad10f976a605dc1d081b56a57f79162R414
Changing the sev/prio as per the parent bug.
# w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Cluster+version+operator+acknowledges+upgrade&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job'|grep 'failures match' release-openshift-origin-installer-e2e-aws-upgrade (all) - 7 runs, 57% failed, 25% of failures match = 14% impact -> 4.4 to 4.5 upgrade periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade (all) - 24 runs, 58% failed, 36% of failures match = 21% impact -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2" periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-upgrade (all) - 8 runs, 88% failed, 14% of failures match = 13% impact -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2" periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-azure-upgrade (all) - 8 runs, 63% failed, 60% of failures match = 38% impact -> 4.4 to 4.5 upgrade release-openshift-origin-installer-old-rhcos-e2e-aws-4.6 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact -> 4.5 to registry.build02.ci.openshift.org/ci-op-7e71bcc2/release:upgrade periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-ovn-upgrade (all) - 21 runs, 100% failed, 19% of failures match = 19% impact -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2" periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-aws-upgrade (all) - 8 runs, 25% failed, 50% of failures match = 13% impact -> 4.4 to 4.5 upgrade periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-gcp-upgrade (all) - 8 runs, 38% failed, 67% of failures match = 25% impact -> 4.4 to 4.5 upgrade periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-gcp-ovn-upgrade (all) - 8 runs, 88% failed, 43% of failures match = 38% impact -> 4.4 to 4.5 upgrade release-openshift-origin-installer-e2e-aws-upgrade-4.5-to-4.6-to-4.7-to-4.8-ci (all) - 2 runs, 100% failed, 50% of failures match = 50% impact -> 4.5 to 4.8 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2" periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2" periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade-rollback (all) - 2 runs, 100% failed, 50% of failures match = 50% impact -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 1; updated.Generation: 2" From above ci test result, the error info was updated to expected one on v4.6 now. @Jack But i notice this pr is just for adding more log to help diagnose issue instead of fixing timeout error. Not sure if we still need keep this bug open for further fix or just close this one and track the issue in another bug.
(In reply to liujia from comment #8) > # w3m -dump -cols 200 > 'https://search.ci.openshift.org/ > ?search=Cluster+version+operator+acknowledges+upgrade&maxAge=48h&context=1&ty > pe=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job'|grep > 'failures match' > release-openshift-origin-installer-e2e-aws-upgrade (all) - 7 runs, 57% > failed, 25% of failures match = 14% impact > -> 4.4 to 4.5 upgrade > periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws- > upgrade (all) - 24 runs, 58% failed, 36% of failures match = 21% impact > -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for > cluster to acknowledge upgrade: timed out waiting for the condition; > observedGeneration: 1; updated.Generation: 2" > periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp- > upgrade (all) - 8 runs, 88% failed, 14% of failures match = 13% impact > -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for > cluster to acknowledge upgrade: timed out waiting for the condition; > observedGeneration: 1; updated.Generation: 2" > periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e- > azure-upgrade (all) - 8 runs, 63% failed, 60% of failures match = 38% impact > -> 4.4 to 4.5 upgrade > release-openshift-origin-installer-old-rhcos-e2e-aws-4.6 (all) - 1 runs, > 100% failed, 100% of failures match = 100% impact > -> 4.5 to registry.build02.ci.openshift.org/ci-op-7e71bcc2/release:upgrade > periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws- > ovn-upgrade (all) - 21 runs, 100% failed, 19% of failures match = 19% impact > -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for > cluster to acknowledge upgrade: timed out waiting for the condition; > observedGeneration: 1; updated.Generation: 2" > periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-aws- > upgrade (all) - 8 runs, 25% failed, 50% of failures match = 13% impact > -> 4.4 to 4.5 upgrade > periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-gcp- > upgrade (all) - 8 runs, 38% failed, 67% of failures match = 25% impact > -> 4.4 to 4.5 upgrade > periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-gcp- > ovn-upgrade (all) - 8 runs, 88% failed, 43% of failures match = 38% impact > -> 4.4 to 4.5 upgrade > release-openshift-origin-installer-e2e-aws-upgrade-4.5-to-4.6-to-4.7-to-4.8- > ci (all) - 2 runs, 100% failed, 50% of failures match = 50% impact > -> 4.5 to 4.8 upgrade,error info is updated to "Timed out waiting for > cluster to acknowledge upgrade: timed out waiting for the condition; > observedGeneration: 1; updated.Generation: 2" > periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp- > ovn-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact > -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for > cluster to acknowledge upgrade: timed out waiting for the condition; > observedGeneration: 1; updated.Generation: 2" > periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws- > upgrade-rollback (all) - 2 runs, 100% failed, 50% of failures match = 50% > impact > -> 4.5 to 4.6 upgrade,error info is updated to "Timed out waiting for > cluster to acknowledge upgrade: timed out waiting for the condition; > observedGeneration: 1; updated.Generation: 2" > > From above ci test result, the error info was updated to expected one on > v4.6 now. > > @Jack > But i notice this pr is just for adding more log to help diagnose issue > instead of fixing timeout error. Not sure if we still need keep this bug > open for further fix or just close this one and track the issue in another > bug. We can leave it open and I'll use it to work the issue. Thanks.
ok, update bug's status.
(In reply to liujia from comment #10) > ok, update bug's status. Revisiting this and looking at the logs in detail again, the CVO "upgrade pod" just misses updating the version/generation by seconds before the Test's 2 minute t/o blows. I'm sure due to the connection errors and churn that accompanies it whenever this happens. I believe this issue only occurs in version 4.5 CI so I have dup'ed this bug with a 4.5.z target and have created a PR to bump the test t/o to 4 minutes. Since no change should be required for 4.6 can we move this to verified so bug 1957927 is valid?
Gotcha, then let's track the issue in bz1957927 and close this one. According to comment8 and comment13, move the bug to veirfy.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.29 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1521