openshift-tests-upgrade.[sig-network] pods should successfully create sandboxes by other is failing frequently in CI, see: https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=openshift-tests-upgrade.%5Bsig-network%5D%20pods%20should%20successfully%20create%20sandboxes%20by%20other Found in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-aws-ovn-upgrade-4.10-minor-release-openshift-release-analysis-aggregator/1483642143795843072 but appears to be perma failing in some jobs.
Ok there's some important info here. You'll note the "openshift-tests-upgrade" suite name in the link above. This is brand new resulting from a TRT change we made. Awhile back we discovered that testgrid and sippy and aggregation all did not properly differentiate multiple executions of openshift-tests. What was happening with this test is we would run an openshift-tests upgrade suite, and a conformance suite, all in one job. The junit results get merged together because the xml always had the same suite name. THe merging would see one test run pass, and one fail, and consider the test a flake, when in reality the test was hard failing in one of those runs and we couldn't tell because suites were not being used properly. We fixed this a couple days ago such that different invocations of openshift-tests will have different suite names. Now we see that this test fails in the upgrade suite. It likely has done so for a very long time and it's a 100% failure. So I am dropping severity, we will ignore the test in aggregation so payloads start flowing. It still needs some kind of a fix, either removed from the suite, or made to pass somehow.
(In reply to Devan Goodwin from comment #2) > Ok there's some important info here. You'll note the > "openshift-tests-upgrade" suite name in the link above. This is brand new > resulting from a TRT change we made. Awhile back we discovered that testgrid > and sippy and aggregation all did not properly differentiate multiple > executions of openshift-tests. What was happening with this test is we would > run an openshift-tests upgrade suite, and a conformance suite, all in one > job. The junit results get merged together because the xml always had the > same suite name. THe merging would see one test run pass, and one fail, and > consider the test a flake, when in reality the test was hard failing in one > of those runs and we couldn't tell because suites were not being used > properly. > > We fixed this a couple days ago such that different invocations of > openshift-tests will have different suite names. > > Now we see that this test fails in the upgrade suite. It likely has done so > for a very long time and it's a 100% failure. So I am dropping severity, we > will ignore the test in aggregation so payloads start flowing. It still > needs some kind of a fix, either removed from the suite, or made to pass > somehow. This was eating at me recently because I spent so many cycles figuring out this failure for our 4.9->4.10 ovn upgrade jobs that started permafailing. I couldn't explain why it wasn't being reported on this 4.10->4.10 upgrade job. It was there after all. This is just a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2038481 which has a few things: 1) a test PR we can merge now that will ignore this specific case of a guard pod that is deleted on a node drain then incorrectly gets restarted before the node is rebooted. Then the pod exists as the node is coming up and before networking is deployed and we get this sandbox error. We can merge this PR today if needed to get this out of the way: https://github.com/openshift/origin/pull/26763 2) here is the slack conversation about the problem in #forum-workloads: https://coreos.slack.com/archives/CKJR6200N/p1642096272047700 3) there are some PRs being worked (some already merged) that will be the final fix for this. Not sure how long that will take. We can do 1) above and then I can keep track of these real-fix PRs and revert 1) when those are all in: https://github.com/openshift/library-go/pull/1287 https://github.com/openshift/cluster-kube-apiserver-operator/pull/1295 https://github.com/openshift/cluster-kube-scheduler-operator/pull/397 https://github.com/openshift/cluster-kube-controller-manager-operator/pull/590 https://github.com/openshift/cluster-kube-controller-manager-operator/pull/591
*** This bug has been marked as a duplicate of bug 2038481 ***