1923231 – [sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a NodePort service

Bug 1923231 - [sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a NodePort service

Summary: [sig-network] Conntrack should be able to preserve UDP traffic when server po...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Antonio Ojea
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:	1949063
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-01 15:20 UTC by Antonio Ojea
Modified:	2021-12-01 13:35 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1949063 (view as bug list)
Environment:	[sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a NodePort service
Last Closed:	2021-12-01 13:35:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes pull 98305	None	Merged	kube-proxy has to clear NodePort stale UDP entries	2021-11-09 17:57:09 UTC
Github	openshift sdn pull 286	None	Merged	Bug 1923231: rebase to sdn-4.7-kubernetes-1.20.0-rc.0	2021-11-09 17:57:05 UTC
Red Hat Product Errata	RHBA-2021:4802	None	None	None	2021-12-01 13:35:43 UTC

Description Antonio Ojea 2021-02-01 15:20:49 UTC

test:
[sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a NodePort service 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-network%5C%5D+Conntrack+should+be+able+to+preserve+UDP+traffic+when+server+pod+cycles+for+a+NodePort+service


It is a known issue upstream
https://github.com/kubernetes/kubernetes/issues/91236

Can be reproduced running the test multiple times
/e2e.test -kubeconfig ~/Downloads/kubeconfig -ginkgo.focus "Conntrack should be able to preserve UDP traffic when server pod cycles for a NodePort service" -test.count 150 -test.failfast

Comment 1 Clayton Coleman 2021-02-01 15:28:53 UTC

This fails roughly 1/150 times in our new network stress test (which this is the only remaining known unfixed flake in over 300 runs of each test).  Once we have this fixed, we can use network-stress as a "flake introduction PR blocker" - the occurence of a new flake on this test suite in a PR would block the merge, which would potentially help us tighten regressions introduced by new versions of the OS, network plugins, etc.

Comment 3 Antonio Ojea 2021-02-02 11:53:38 UTC

Temptative fix https://github.com/kubernetes/kubernetes/pull/98305

I still need to run it with the reproducer to confirm it fixes the problem

Comment 4 Antonio Ojea 2021-02-03 18:42:50 UTC

Fix on https://github.com/kubernetes/kubernetes/pull/98305

I run the test 150 times without any failure

Comment 5 jamo luhrsen 2021-02-23 19:08:31 UTC

(In reply to Antonio Ojea from comment #4)
> Fix on https://github.com/kubernetes/kubernetes/pull/98305
> 
> I run the test 150 times without any failure

This is still failing:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.6-stable-to-4.7-ci/1364177777766436864

looks like it's happening sporadically across some different jobs:
https://search.ci.openshift.org/?search=Conntrack+should+be+able+to+preserve+UDP+traffic+when+server+pod+cycles+for+a+NodePort+service&maxAge=48h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 6 Antonio Ojea 2021-02-24 07:53:48 UTC

(In reply to jamo luhrsen from comment #5)
> (In reply to Antonio Ojea from comment #4)
> > Fix on https://github.com/kubernetes/kubernetes/pull/98305
> > 
> > I run the test 150 times without any failure
> 
> This is still failing:
> https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-
> origin-installer-e2e-aws-ovn-upgrade-4.6-stable-to-4.7-ci/1364177777766436864
> 
> looks like it's happening sporadically across some different jobs:
> https://search.ci.openshift.org/
> ?search=Conntrack+should+be+able+to+preserve+UDP+traffic+when+server+pod+cycl
> es+for+a+NodePort+service&maxAge=48h&context=1&type=junit&name=&maxMatches=5&
> maxBytes=20971520&groupBy=job

the patch it is not in openshift, was merged only in Kubernetes, it needs to be backported

Comment 7 jamo luhrsen 2021-02-24 16:51:22 UTC

(In reply to Antonio Ojea from comment #6)
> (In reply to jamo luhrsen from comment #5)
> > (In reply to Antonio Ojea from comment #4)
> > > Fix on https://github.com/kubernetes/kubernetes/pull/98305
> > > 
> > > I run the test 150 times without any failure
> > 
> > This is still failing:
> > https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-
> > origin-installer-e2e-aws-ovn-upgrade-4.6-stable-to-4.7-ci/1364177777766436864
> > 
> > looks like it's happening sporadically across some different jobs:
> > https://search.ci.openshift.org/
> > ?search=Conntrack+should+be+able+to+preserve+UDP+traffic+when+server+pod+cycl
> > es+for+a+NodePort+service&maxAge=48h&context=1&type=junit&name=&maxMatches=5&
> > maxBytes=20971520&groupBy=job
> 
> the patch it is not in openshift, was merged only in Kubernetes, it needs to
> be backported

got it. who takes care of this, or how can we track it? I keep running across this failure in our downstream CI so would be nice
to get it resolved.

Comment 8 Antonio Ojea 2021-03-03 12:17:16 UTC

backport upstream merged in 1.20 branch
https://github.com/kubernetes/kubernetes/pull/99017/commits

Comment 9 Antonio Ojea 2021-03-04 08:57:26 UTC

downstream backport to openshift
https://github.com/openshift/kubernetes/pull/602

Comment 10 Antonio Ojea 2021-04-13 11:28:11 UTC

https://github.com/openshift/sdn/pull/285

Comment 11 Antonio Ojea 2021-11-08 15:50:55 UTC

the fix merged in https://github.com/openshift/sdn/pull/286

Comment 12 jamo luhrsen 2021-11-09 18:06:25 UTC

following up that I've looked at the testgrid of some jobs and this test case does flake every once in a while (fails first try
and passes second try), but mostly it's all passing. That seems good enough reason to close this bug as Verified.

some testgrid links:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-upgrade

Comment 13 Antonio Ojea 2021-11-10 10:10:31 UTC

(In reply to jamo luhrsen from comment #12)
> following up that I've looked at the testgrid of some jobs and this test
> case does flake every once in a while (fails first try
> and passes second try), but mostly it's all passing. That seems good enough
> reason to close this bug as Verified.
> 
> some testgrid links:
> https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-
> ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-
> upgrade
> https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-
> ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-
> upgrade
> https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-
> ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-
> upgrade
> https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-
> ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-
> upgrade

this bug is about kube-proxy - openshift-sdn, not OVN, that uses a different logic

Comment 14 jamo luhrsen 2021-11-10 17:45:00 UTC

(In reply to Antonio Ojea from comment #13)
> (In reply to jamo luhrsen from comment #12)
> > following up that I've looked at the testgrid of some jobs and this test
> > case does flake every once in a while (fails first try
> > and passes second try), but mostly it's all passing. That seems good enough
> > reason to close this bug as Verified.
> > 
> > some testgrid links:
> > https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-
> > ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-
> > upgrade
> > https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-
> > ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-
> > upgrade
> > https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-
> > ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-
> > upgrade
> > https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-
> > ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-
> > upgrade
> 
> this bug is about kube-proxy - openshift-sdn, not OVN, that uses a different
> logic

ok, but that's not what jobs I was commenting on in comment #5. Looks like
things got better some other way.

do we need to change this back from Verified?

Comment 15 Antonio Ojea 2021-11-11 08:11:59 UTC

(In reply to jamo luhrsen from comment #14)
> (In reply to Antonio Ojea from comment #13)
> > (In reply to jamo luhrsen from comment #12)
> > > following up that I've looked at the testgrid of some jobs and this test
> > > case does flake every once in a while (fails first try
> > > and passes second try), but mostly it's all passing. That seems good enough
> > > reason to close this bug as Verified.
> > > 
> > > some testgrid links:
> > > https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-
> > > ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-
> > > upgrade
> > > https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-
> > > ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-
> > > upgrade
> > > https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-
> > > ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-
> > > upgrade
> > > https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-
> > > ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-
> > > upgrade
> > 
> > this bug is about kube-proxy - openshift-sdn, not OVN, that uses a different
> > logic
> 
> ok, but that's not what jobs I was commenting on in comment #5. Looks like
> things got better some other way.
> 
> do we need to change this back from Verified?

I just wanted to clarify that my PRs are unrelated to OVN , this bug was fixed upstream (and downstream AFAIK)

I suggest to close this bug and open a new one for OVN if you want to track it, but is really up to you on how to handle it

Comment 18 errata-xmlrpc 2021-12-01 13:35:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.38 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4802

Note You need to log in before you can comment on or make changes to this bug.