Bug 1990335

Summary:	Flaky CI - [sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a ClusterIP service
Product:	OpenShift Container Platform	Reporter:	Martin Kennelly <mkennell>
Component:	Networking	Assignee:	Surya Seetharaman <surya>
Networking sub component:	ovn-kubernetes	QA Contact:	Anurag saxena <anusaxen>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aconstan, surya, trozet
Version:	4.9	Keywords:	Reopened
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-09-03 12:51:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Martin Kennelly 2021-08-05 08:35:00 UTC

Description of problem:
Flaky test - Fails roughly every 3 runs as seen here on master: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-ovn

test/suite name: [sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a ClusterIP service [Suite:openshift/conformance/parallel] [Suite:k8s]


How reproducible:
Unknown


Actual results:
"Failed to connect to backend 2" [1]


Expected results:
Tests passes consistently as observed on testgrid

Additional info:
[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-ovn/1420528506080595968

Comment 1 Martin Kennelly 2021-08-09 17:47:14 UTC

This suite/test above are passing on PRs (https://github.com/openshift/kubernetes/pull/862) latest test runs - e.g. https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_kubernetes/862/pull-ci-openshift-kubernetes-master-e2e-gcp/1423849437725200384/build-log.txt

Remains to be seen if this is consistent and fixes flakiness.

Comment 2 Martin Kennelly 2021-11-02 15:58:42 UTC

Test is now stable.

Comment 3 Tim Rozet 2022-04-08 18:36:26 UTC

Jamo saw this in some cluster bot runs with GCP:
https://prow.ci.openshift.org/a12c8ac6-2c96-46ac-a482-06b0cd60d092

Dug into the test case and the failure, this is a legit problem. The test works like this:
1. There is a cluster ip service with a backend podserver-1 
2. client -> cluster IP service using the same source port, does it return podserver-1 using agnhost? Yes, pass.
3. create podserver-2 endpoint, delete podserver-1, keep sending traffic from client, does it return podserver-2? No, Bug


In OVNK-node we flush conntrack entries when the endpoint is deleted for a service. We also remove that endpoint from the service loadbalancer in OVN. However there is no synchronicity between these things. Therefore the following can happen:
1. conntrack entry exists for cluster IP-> podserver-1
2. podserver-1 deleted, podserver-2 added
3. client is continuously sending traffic to cluster IP
4. node watcher detects endpoint deletion, flushes conntrack entry
5. client sends another UDP packet to cluster IP, DNATs to podserver-1 cause OVN is not updated yet
6. new conntrack entry created for clusterIP -> podserver-1
7. OVN updates lb to remove podserver-1, adds podserver-2
8. client now stuck with stale conntrack entry and traffic is blackholed

Not sure how to fix this yet. Perhaps level driven node controller would help detect stale conntrack entries. For now we could have a periodic cleanup goroutine that detects invalid entries and flushes them.

Comment 4 Surya Seetharaman 2022-09-03 12:51:28 UTC


*** This bug has been marked as a duplicate of bug 2106554 ***

Comment 5 Surya Seetharaman 2023-01-09 18:41:14 UTC

Periodic go-routine discussion has happened in one of the PRs (need to dig it up OR dig up the convo on slack), was deemed as too heavy and right way was to fix this in OVN actually? See https://bugzilla.redhat.com/show_bug.cgi?id=1839103#c5 and https://bugzilla.redhat.com/show_bug.cgi?id=1839103#c8 for more details.