1990335 – Flaky CI - [sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a ClusterIP service

Bug 1990335 - Flaky CI - [sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a ClusterIP service

Summary: Flaky CI - [sig-network] Conntrack should be able to preserve UDP traffic whe...

Keywords:
Status:	CLOSED DUPLICATE of bug 2106554
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.9
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Surya Seetharaman
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-05 08:35 UTC by Martin Kennelly
Modified:	2023-01-09 18:41 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-09-03 12:51:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Martin Kennelly 2021-08-05 08:35:00 UTC

Description of problem:
Flaky test - Fails roughly every 3 runs as seen here on master: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-ovn

test/suite name: [sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a ClusterIP service [Suite:openshift/conformance/parallel] [Suite:k8s]


How reproducible:
Unknown


Actual results:
"Failed to connect to backend 2" [1]


Expected results:
Tests passes consistently as observed on testgrid

Additional info:
[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-ovn/1420528506080595968

Comment 1 Martin Kennelly 2021-08-09 17:47:14 UTC

This suite/test above are passing on PRs (https://github.com/openshift/kubernetes/pull/862) latest test runs - e.g. https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_kubernetes/862/pull-ci-openshift-kubernetes-master-e2e-gcp/1423849437725200384/build-log.txt

Remains to be seen if this is consistent and fixes flakiness.

Comment 2 Martin Kennelly 2021-11-02 15:58:42 UTC

Test is now stable.

Comment 3 Tim Rozet 2022-04-08 18:36:26 UTC

Jamo saw this in some cluster bot runs with GCP:
https://prow.ci.openshift.org/a12c8ac6-2c96-46ac-a482-06b0cd60d092

Dug into the test case and the failure, this is a legit problem. The test works like this:
1. There is a cluster ip service with a backend podserver-1 
2. client -> cluster IP service using the same source port, does it return podserver-1 using agnhost? Yes, pass.
3. create podserver-2 endpoint, delete podserver-1, keep sending traffic from client, does it return podserver-2? No, Bug


In OVNK-node we flush conntrack entries when the endpoint is deleted for a service. We also remove that endpoint from the service loadbalancer in OVN. However there is no synchronicity between these things. Therefore the following can happen:
1. conntrack entry exists for cluster IP-> podserver-1
2. podserver-1 deleted, podserver-2 added
3. client is continuously sending traffic to cluster IP
4. node watcher detects endpoint deletion, flushes conntrack entry
5. client sends another UDP packet to cluster IP, DNATs to podserver-1 cause OVN is not updated yet
6. new conntrack entry created for clusterIP -> podserver-1
7. OVN updates lb to remove podserver-1, adds podserver-2
8. client now stuck with stale conntrack entry and traffic is blackholed

Not sure how to fix this yet. Perhaps level driven node controller would help detect stale conntrack entries. For now we could have a periodic cleanup goroutine that detects invalid entries and flushes them.

Comment 4 Surya Seetharaman 2022-09-03 12:51:28 UTC


*** This bug has been marked as a duplicate of bug 2106554 ***

Comment 5 Surya Seetharaman 2023-01-09 18:41:14 UTC

Periodic go-routine discussion has happened in one of the PRs (need to dig it up OR dig up the convo on slack), was deemed as too heavy and right way was to fix this in OVN actually? See https://bugzilla.redhat.com/show_bug.cgi?id=1839103#c5 and https://bugzilla.redhat.com/show_bug.cgi?id=1839103#c8 for more details.

Note You need to log in before you can comment on or make changes to this bug.