Bug 1990335

Summary: Flaky CI - [sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a ClusterIP service
Product: OpenShift Container Platform Reporter: Martin Kennelly <mkennell>
Component: NetworkingAssignee: Surya Seetharaman <surya>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: medium CC: aconstan, surya, trozet
Version: 4.9Keywords: Reopened
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-03 12:51:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Martin Kennelly 2021-08-05 08:35:00 UTC
Description of problem:
Flaky test - Fails roughly every 3 runs as seen here on master: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-ovn

test/suite name: [sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a ClusterIP service [Suite:openshift/conformance/parallel] [Suite:k8s]


How reproducible:
Unknown


Actual results:
"Failed to connect to backend 2" [1]


Expected results:
Tests passes consistently as observed on testgrid

Additional info:
[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-ovn/1420528506080595968

Comment 1 Martin Kennelly 2021-08-09 17:47:14 UTC
This suite/test above are passing on PRs (https://github.com/openshift/kubernetes/pull/862) latest test runs - e.g. https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_kubernetes/862/pull-ci-openshift-kubernetes-master-e2e-gcp/1423849437725200384/build-log.txt

Remains to be seen if this is consistent and fixes flakiness.

Comment 2 Martin Kennelly 2021-11-02 15:58:42 UTC
Test is now stable.

Comment 3 Tim Rozet 2022-04-08 18:36:26 UTC
Jamo saw this in some cluster bot runs with GCP:
https://prow.ci.openshift.org/a12c8ac6-2c96-46ac-a482-06b0cd60d092

Dug into the test case and the failure, this is a legit problem. The test works like this:
1. There is a cluster ip service with a backend podserver-1 
2. client -> cluster IP service using the same source port, does it return podserver-1 using agnhost? Yes, pass.
3. create podserver-2 endpoint, delete podserver-1, keep sending traffic from client, does it return podserver-2? No, Bug


In OVNK-node we flush conntrack entries when the endpoint is deleted for a service. We also remove that endpoint from the service loadbalancer in OVN. However there is no synchronicity between these things. Therefore the following can happen:
1. conntrack entry exists for cluster IP-> podserver-1
2. podserver-1 deleted, podserver-2 added
3. client is continuously sending traffic to cluster IP
4. node watcher detects endpoint deletion, flushes conntrack entry
5. client sends another UDP packet to cluster IP, DNATs to podserver-1 cause OVN is not updated yet
6. new conntrack entry created for clusterIP -> podserver-1
7. OVN updates lb to remove podserver-1, adds podserver-2
8. client now stuck with stale conntrack entry and traffic is blackholed

Not sure how to fix this yet. Perhaps level driven node controller would help detect stale conntrack entries. For now we could have a periodic cleanup goroutine that detects invalid entries and flushes them.

Comment 4 Surya Seetharaman 2022-09-03 12:51:28 UTC

*** This bug has been marked as a duplicate of bug 2106554 ***

Comment 5 Surya Seetharaman 2023-01-09 18:41:14 UTC
Periodic go-routine discussion has happened in one of the PRs (need to dig it up OR dig up the convo on slack), was deemed as too heavy and right way was to fix this in OVN actually? See https://bugzilla.redhat.com/show_bug.cgi?id=1839103#c5 and https://bugzilla.redhat.com/show_bug.cgi?id=1839103#c8 for more details.