Bug 1990335 - Flaky CI - [sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a ClusterIP service
Summary: Flaky CI - [sig-network] Conntrack should be able to preserve UDP traffic whe...
Keywords:
Status: CLOSED DUPLICATE of bug 2106554
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Surya Seetharaman
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-05 08:35 UTC by Martin Kennelly
Modified: 2023-01-09 18:41 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-03 12:51:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Martin Kennelly 2021-08-05 08:35:00 UTC
Description of problem:
Flaky test - Fails roughly every 3 runs as seen here on master: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-ovn

test/suite name: [sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a ClusterIP service [Suite:openshift/conformance/parallel] [Suite:k8s]


How reproducible:
Unknown


Actual results:
"Failed to connect to backend 2" [1]


Expected results:
Tests passes consistently as observed on testgrid

Additional info:
[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-ovn/1420528506080595968

Comment 1 Martin Kennelly 2021-08-09 17:47:14 UTC
This suite/test above are passing on PRs (https://github.com/openshift/kubernetes/pull/862) latest test runs - e.g. https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_kubernetes/862/pull-ci-openshift-kubernetes-master-e2e-gcp/1423849437725200384/build-log.txt

Remains to be seen if this is consistent and fixes flakiness.

Comment 2 Martin Kennelly 2021-11-02 15:58:42 UTC
Test is now stable.

Comment 3 Tim Rozet 2022-04-08 18:36:26 UTC
Jamo saw this in some cluster bot runs with GCP:
https://prow.ci.openshift.org/a12c8ac6-2c96-46ac-a482-06b0cd60d092

Dug into the test case and the failure, this is a legit problem. The test works like this:
1. There is a cluster ip service with a backend podserver-1 
2. client -> cluster IP service using the same source port, does it return podserver-1 using agnhost? Yes, pass.
3. create podserver-2 endpoint, delete podserver-1, keep sending traffic from client, does it return podserver-2? No, Bug


In OVNK-node we flush conntrack entries when the endpoint is deleted for a service. We also remove that endpoint from the service loadbalancer in OVN. However there is no synchronicity between these things. Therefore the following can happen:
1. conntrack entry exists for cluster IP-> podserver-1
2. podserver-1 deleted, podserver-2 added
3. client is continuously sending traffic to cluster IP
4. node watcher detects endpoint deletion, flushes conntrack entry
5. client sends another UDP packet to cluster IP, DNATs to podserver-1 cause OVN is not updated yet
6. new conntrack entry created for clusterIP -> podserver-1
7. OVN updates lb to remove podserver-1, adds podserver-2
8. client now stuck with stale conntrack entry and traffic is blackholed

Not sure how to fix this yet. Perhaps level driven node controller would help detect stale conntrack entries. For now we could have a periodic cleanup goroutine that detects invalid entries and flushes them.

Comment 4 Surya Seetharaman 2022-09-03 12:51:28 UTC

*** This bug has been marked as a duplicate of bug 2106554 ***

Comment 5 Surya Seetharaman 2023-01-09 18:41:14 UTC
Periodic go-routine discussion has happened in one of the PRs (need to dig it up OR dig up the convo on slack), was deemed as too heavy and right way was to fix this in OVN actually? See https://bugzilla.redhat.com/show_bug.cgi?id=1839103#c5 and https://bugzilla.redhat.com/show_bug.cgi?id=1839103#c8 for more details.


Note You need to log in before you can comment on or make changes to this bug.