Description of problem: The upgrades job has been broken for a long time: https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade There could be multiple entities in the play here. The goal is to see if there are improvements that can be done on the networking side of things. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: Upgrades going smoothly Expected results: Additional info:
IMO the synthetic tests that fail on "waiting for flows" are overly aggressive and this should not be a blocker for 4.8 release as it is not a regression. Instead, we should make the synthetic "waiting for flows" tests flakes for 4.8.
One permafailing tests case in these upgrade jobs is not ovn specific and looks like it will be addressed when this 4.7 backport is merged: https://bugzilla.redhat.com/show_bug.cgi?id=1959238
Another permafailing test case (not specific to OVN) in upgrade jobs is "Application behind service load balancer with PDB is not disrupted" That appears to be getting worked on with https://bugzilla.redhat.com/show_bug.cgi?id=1929396
the "cluster upgrade should be fast" test case also fails almost every time. there was a recent slack discussion around this: https://coreos.slack.com/archives/C01CQA76KMX/p1620236543482500 and two bugs (from that thread) to hopefully come in and help matters: https://bugzilla.redhat.com/show_bug.cgi?id=1942164 https://bugzilla.redhat.com/show_bug.cgi?id=1817075 The test case has a 75m timeout before the failure will show up: https://github.com/openshift/origin/blob/d704a4d2ab5e55731d11770c11eacd666940b944/test/e2e/upgrade/upgrade.go#L274 The test case does pass every once in a while, and in the most recent failing job you can see the upgrade was 77 minutes, so barely over the 75m window. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1392413115697598464 "upgrade to registry.build02.ci.openshift.org/ci-op-irx3q246/release@sha256:13d044e10254d79be573be32d7b3fcdb6d0da893cee5879c98fe4889a8d1e6da took too long: 77.8378904475"
closing this bug as it does not have any specific focus other than to track other bugs that may be causing upgrade job failures. We need to have specific bugs for each different failure happening in those upgrade jobs and mark them blocker (or not) as appropriate. Here is the current list as I know it, and I'm sure it's not complete. I'm also sure there are other bugs not yet filed for the upgrade job, but until we make progress on existing bugs it's very noisy to know what failure is already being tracked or not. https://bugzilla.redhat.com/show_bug.cgi?id=1943334 https://bugzilla.redhat.com/show_bug.cgi?id=1927264 https://bugzilla.redhat.com/show_bug.cgi?id=1959200 https://bugzilla.redhat.com/show_bug.cgi?id=1942164 https://bugzilla.redhat.com/show_bug.cgi?id=1817075 https://bugzilla.redhat.com/show_bug.cgi?id=1968021 https://bugzilla.redhat.com/show_bug.cgi?id=1968030 https://bugzilla.redhat.com/show_bug.cgi?id=1968009 https://bugzilla.redhat.com/show_bug.cgi?id=1944264 https://bugzilla.redhat.com/show_bug.cgi?id=1943363