Bug 2042956

Summary:	openshift-tests-upgrade.[sig-network] pods should successfully create sandboxes by other
Product:	OpenShift Container Platform	Reporter:	Devan Goodwin <dgoodwin>
Component:	Networking	Assignee:	jamo luhrsen <jluhrsen>
Networking sub component:	ovn-kubernetes	QA Contact:	Anurag saxena <anusaxen>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	sippy
Version:	4.10
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-01-20 17:32:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Devan Goodwin 2022-01-20 12:04:15 UTC

openshift-tests-upgrade.[sig-network] pods should successfully create sandboxes by other

is failing frequently in CI, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=openshift-tests-upgrade.%5Bsig-network%5D%20pods%20should%20successfully%20create%20sandboxes%20by%20other

Found in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-aws-ovn-upgrade-4.10-minor-release-openshift-release-analysis-aggregator/1483642143795843072 but appears to be perma failing in some jobs.

Comment 2 Devan Goodwin 2022-01-20 13:13:41 UTC

Ok there's some important info here. You'll note the "openshift-tests-upgrade" suite name in the link above. This is brand new resulting from a TRT change we made. Awhile back we discovered that testgrid and sippy and aggregation all did not properly differentiate multiple executions of openshift-tests. What was happening with this test is we would run an openshift-tests upgrade suite, and a conformance suite, all in one job. The junit results get merged together because the xml always had the same suite name. THe merging would see one test run pass, and one fail, and consider the test a flake, when in reality the test was hard failing in one of those runs and we couldn't tell because suites were not being used properly.

We fixed this a couple days ago such that different invocations of openshift-tests will have different suite names.

Now we see that this test fails in the upgrade suite. It likely has done so for a very long time and it's a 100% failure. So I am dropping severity, we will ignore the test in aggregation so payloads start flowing. It still needs some kind of a fix, either removed from the suite, or made to pass somehow.

Comment 3 jamo luhrsen 2022-01-20 17:31:30 UTC

(In reply to Devan Goodwin from comment #2)
> Ok there's some important info here. You'll note the
> "openshift-tests-upgrade" suite name in the link above. This is brand new
> resulting from a TRT change we made. Awhile back we discovered that testgrid
> and sippy and aggregation all did not properly differentiate multiple
> executions of openshift-tests. What was happening with this test is we would
> run an openshift-tests upgrade suite, and a conformance suite, all in one
> job. The junit results get merged together because the xml always had the
> same suite name. THe merging would see one test run pass, and one fail, and
> consider the test a flake, when in reality the test was hard failing in one
> of those runs and we couldn't tell because suites were not being used
> properly.
> 
> We fixed this a couple days ago such that different invocations of
> openshift-tests will have different suite names.
> 
> Now we see that this test fails in the upgrade suite. It likely has done so
> for a very long time and it's a 100% failure. So I am dropping severity, we
> will ignore the test in aggregation so payloads start flowing. It still
> needs some kind of a fix, either removed from the suite, or made to pass
> somehow.

This was eating at me recently because I spent so many cycles figuring out
this failure for our 4.9->4.10 ovn upgrade jobs that started permafailing.
I couldn't explain why it wasn't being reported on this 4.10->4.10 upgrade
job. It was there after all.

This is just a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2038481
which has a few things:

1) a test PR we can merge now that will ignore this specific case of a
guard pod that is deleted on a node drain then incorrectly gets restarted
before the node is rebooted. Then the pod exists as the node is coming up
and before networking is deployed and we get this sandbox error. We can
merge this PR today if needed to get this out of the way:
  https://github.com/openshift/origin/pull/26763

2) here is the slack conversation about the problem in #forum-workloads:
  https://coreos.slack.com/archives/CKJR6200N/p1642096272047700

3) there are some PRs being worked (some already merged) that will be the
final fix for this. Not sure how long that will take. We can do 1) above
and then I can keep track of these real-fix PRs and revert 1) when those
are all in:
  https://github.com/openshift/library-go/pull/1287
  https://github.com/openshift/cluster-kube-apiserver-operator/pull/1295
  https://github.com/openshift/cluster-kube-scheduler-operator/pull/397
  https://github.com/openshift/cluster-kube-controller-manager-operator/pull/590
  https://github.com/openshift/cluster-kube-controller-manager-operator/pull/591

Comment 4 jamo luhrsen 2022-01-20 17:32:37 UTC


*** This bug has been marked as a duplicate of bug 2038481 ***