Bug 1990335
Summary: | Flaky CI - [sig-network] Conntrack should be able to preserve UDP traffic when server pod cycles for a ClusterIP service | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Martin Kennelly <mkennell> |
Component: | Networking | Assignee: | Surya Seetharaman <surya> |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | aconstan, surya, trozet |
Version: | 4.9 | Keywords: | Reopened |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-09-03 12:51:28 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Martin Kennelly
2021-08-05 08:35:00 UTC
This suite/test above are passing on PRs (https://github.com/openshift/kubernetes/pull/862) latest test runs - e.g. https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_kubernetes/862/pull-ci-openshift-kubernetes-master-e2e-gcp/1423849437725200384/build-log.txt Remains to be seen if this is consistent and fixes flakiness. Test is now stable. Jamo saw this in some cluster bot runs with GCP: https://prow.ci.openshift.org/a12c8ac6-2c96-46ac-a482-06b0cd60d092 Dug into the test case and the failure, this is a legit problem. The test works like this: 1. There is a cluster ip service with a backend podserver-1 2. client -> cluster IP service using the same source port, does it return podserver-1 using agnhost? Yes, pass. 3. create podserver-2 endpoint, delete podserver-1, keep sending traffic from client, does it return podserver-2? No, Bug In OVNK-node we flush conntrack entries when the endpoint is deleted for a service. We also remove that endpoint from the service loadbalancer in OVN. However there is no synchronicity between these things. Therefore the following can happen: 1. conntrack entry exists for cluster IP-> podserver-1 2. podserver-1 deleted, podserver-2 added 3. client is continuously sending traffic to cluster IP 4. node watcher detects endpoint deletion, flushes conntrack entry 5. client sends another UDP packet to cluster IP, DNATs to podserver-1 cause OVN is not updated yet 6. new conntrack entry created for clusterIP -> podserver-1 7. OVN updates lb to remove podserver-1, adds podserver-2 8. client now stuck with stale conntrack entry and traffic is blackholed Not sure how to fix this yet. Perhaps level driven node controller would help detect stale conntrack entries. For now we could have a periodic cleanup goroutine that detects invalid entries and flushes them. *** This bug has been marked as a duplicate of bug 2106554 *** Periodic go-routine discussion has happened in one of the PRs (need to dig it up OR dig up the convo on slack), was deemed as too heavy and right way was to fix this in OVN actually? See https://bugzilla.redhat.com/show_bug.cgi?id=1839103#c5 and https://bugzilla.redhat.com/show_bug.cgi?id=1839103#c8 for more details. |