2043249 – OVN-kubernetes - pod network is periodically unable to reach external addresses outside of node

Bug 2043249 - OVN-kubernetes - pod network is periodically unable to reach external addresses outside of node

Summary: OVN-kubernetes - pod network is periodically unable to reach external address...

Keywords:
Status:	CLOSED DUPLICATE of bug 2034459
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Andreas Karis
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-20 21:54 UTC by Will Russell
Modified:	2022-01-25 09:26 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-01-25 09:24:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	6664731	0	None	None	None	2022-01-25 09:26:24 UTC

Description Will Russell 2022-01-20 21:54:29 UTC

Description of problem:

application pods and pod builds are frequently (not 100% of the time) unable to reach external network addresses to pull resources or apply updates, reach endpoints, etc.

Version-Release number of selected component (if applicable):

Openshift 4.8.24
Multiple clusters - pre-production (golive on the 28th of january)


How reproducible:

80% of the time - pods deployed to namespaces (or build pods) will fail to reach upstream addresses external to cluster.
- sometimes it will succeed to reach the remote address. Bouncing the pods will re-roll the dice on whether or not it will allow outbound connections.

If a pod connects to the outbound address once, it will remain connected and there are no routing problems. If we modify something in the namespace (like add/remove an egressIP, add/remove a network policy or disable/re-enable multitenant) the problem can re-appear.

We have removed dynatrace from the picture on one of the clusters, and the issue persists.

Steps to Reproduce:
1. deploy namespace in cluster with no networkpolicy/egressIP/multitenant enabled
2. spin up test application pod or push a build refresh --> observe chance of failure on build refresh (timeout reaching to host address) or curl failure when rsh'd into pod
3. delete/redeploy pod, rsh in, try curl again - succeeds/fails (randomly)

issue can be present across multiple pods (but not all pods) on the same:
- node
- subnet
- egressIP
- pod application baseline

Actual results:

curls from inside pod to MULTIPLE different external addresses fail to return a result (not a DNS issue - upstream nameservers can resolve, and it is appearing to try to resolve at the listed IP of the upstream address)

interestingly, curls from the host node will always succeed - this only impacts pod traffic

there are no firewall rules in place (or firewalls in general) between the cluster nodes and the target remote addresses - running in same datacenter).

Expected results:

curls to external addresses should always succeed every time, not some of the time. 

Additional info:

suspect this is an issue with OVN rules/management preventing a successful allocation or route to external outbound addresses.

The fact that we can replicate this problem when egressIP is entirely disabled, no networkpolicy in place and dynatrace removed makes me think there's a northbounddatabase ruleset that is triggering partially.

Linked case has a lot of specific pull data, including debug OVN data - happy to request/gather any additional data points as needed - some urgency on this case unfortunately, clusters need to go live to production by the 28th of January (are currently pre-prod so testing is OK).

Comment 14 Will Russell 2022-01-24 21:22:57 UTC

https://access.redhat.com/solutions/6664731 created for this issue.

I agree, marking this as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2034459 is logical/warranted by reports from customer after implementing workaround as detailed above confirming duplicate egressIP nat entries in OVN nbdb table.

Thanks very much for the help, we'll follow the other BZ listed above for when the available patch is made available and will link case.

Best,
~Will

Comment 15 Andreas Karis 2022-01-25 09:24:14 UTC


*** This bug has been marked as a duplicate of bug 2034459 ***

Note You need to log in before you can comment on or make changes to this bug.