2042956 – openshift-tests-upgrade.[sig-network] pods should successfully create sandboxes by other

Bug 2042956 - openshift-tests-upgrade.[sig-network] pods should successfully create sandboxes by other

Summary: openshift-tests-upgrade.[sig-network] pods should successfully create sandbox...

Keywords:
Status:	CLOSED DUPLICATE of bug 2038481
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	jamo luhrsen
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-20 12:04 UTC by Devan Goodwin
Modified:	2022-01-21 09:29 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-01-20 17:32:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Devan Goodwin 2022-01-20 12:04:15 UTC

openshift-tests-upgrade.[sig-network] pods should successfully create sandboxes by other

is failing frequently in CI, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=openshift-tests-upgrade.%5Bsig-network%5D%20pods%20should%20successfully%20create%20sandboxes%20by%20other

Found in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-aws-ovn-upgrade-4.10-minor-release-openshift-release-analysis-aggregator/1483642143795843072 but appears to be perma failing in some jobs.

Comment 2 Devan Goodwin 2022-01-20 13:13:41 UTC

Ok there's some important info here. You'll note the "openshift-tests-upgrade" suite name in the link above. This is brand new resulting from a TRT change we made. Awhile back we discovered that testgrid and sippy and aggregation all did not properly differentiate multiple executions of openshift-tests. What was happening with this test is we would run an openshift-tests upgrade suite, and a conformance suite, all in one job. The junit results get merged together because the xml always had the same suite name. THe merging would see one test run pass, and one fail, and consider the test a flake, when in reality the test was hard failing in one of those runs and we couldn't tell because suites were not being used properly.

We fixed this a couple days ago such that different invocations of openshift-tests will have different suite names.

Now we see that this test fails in the upgrade suite. It likely has done so for a very long time and it's a 100% failure. So I am dropping severity, we will ignore the test in aggregation so payloads start flowing. It still needs some kind of a fix, either removed from the suite, or made to pass somehow.

Comment 3 jamo luhrsen 2022-01-20 17:31:30 UTC

(In reply to Devan Goodwin from comment #2)
> Ok there's some important info here. You'll note the
> "openshift-tests-upgrade" suite name in the link above. This is brand new
> resulting from a TRT change we made. Awhile back we discovered that testgrid
> and sippy and aggregation all did not properly differentiate multiple
> executions of openshift-tests. What was happening with this test is we would
> run an openshift-tests upgrade suite, and a conformance suite, all in one
> job. The junit results get merged together because the xml always had the
> same suite name. THe merging would see one test run pass, and one fail, and
> consider the test a flake, when in reality the test was hard failing in one
> of those runs and we couldn't tell because suites were not being used
> properly.
> 
> We fixed this a couple days ago such that different invocations of
> openshift-tests will have different suite names.
> 
> Now we see that this test fails in the upgrade suite. It likely has done so
> for a very long time and it's a 100% failure. So I am dropping severity, we
> will ignore the test in aggregation so payloads start flowing. It still
> needs some kind of a fix, either removed from the suite, or made to pass
> somehow.

This was eating at me recently because I spent so many cycles figuring out
this failure for our 4.9->4.10 ovn upgrade jobs that started permafailing.
I couldn't explain why it wasn't being reported on this 4.10->4.10 upgrade
job. It was there after all.

This is just a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2038481
which has a few things:

1) a test PR we can merge now that will ignore this specific case of a
guard pod that is deleted on a node drain then incorrectly gets restarted
before the node is rebooted. Then the pod exists as the node is coming up
and before networking is deployed and we get this sandbox error. We can
merge this PR today if needed to get this out of the way:
  https://github.com/openshift/origin/pull/26763

2) here is the slack conversation about the problem in #forum-workloads:
  https://coreos.slack.com/archives/CKJR6200N/p1642096272047700

3) there are some PRs being worked (some already merged) that will be the
final fix for this. Not sure how long that will take. We can do 1) above
and then I can keep track of these real-fix PRs and revert 1) when those
are all in:
  https://github.com/openshift/library-go/pull/1287
  https://github.com/openshift/cluster-kube-apiserver-operator/pull/1295
  https://github.com/openshift/cluster-kube-scheduler-operator/pull/397
  https://github.com/openshift/cluster-kube-controller-manager-operator/pull/590
  https://github.com/openshift/cluster-kube-controller-manager-operator/pull/591

Comment 4 jamo luhrsen 2022-01-20 17:32:37 UTC


*** This bug has been marked as a duplicate of bug 2038481 ***

Note You need to log in before you can comment on or make changes to this bug.