Description of problem: application pods and pod builds are frequently (not 100% of the time) unable to reach external network addresses to pull resources or apply updates, reach endpoints, etc. Version-Release number of selected component (if applicable): Openshift 4.8.24 Multiple clusters - pre-production (golive on the 28th of january) How reproducible: 80% of the time - pods deployed to namespaces (or build pods) will fail to reach upstream addresses external to cluster. - sometimes it will succeed to reach the remote address. Bouncing the pods will re-roll the dice on whether or not it will allow outbound connections. If a pod connects to the outbound address once, it will remain connected and there are no routing problems. If we modify something in the namespace (like add/remove an egressIP, add/remove a network policy or disable/re-enable multitenant) the problem can re-appear. We have removed dynatrace from the picture on one of the clusters, and the issue persists. Steps to Reproduce: 1. deploy namespace in cluster with no networkpolicy/egressIP/multitenant enabled 2. spin up test application pod or push a build refresh --> observe chance of failure on build refresh (timeout reaching to host address) or curl failure when rsh'd into pod 3. delete/redeploy pod, rsh in, try curl again - succeeds/fails (randomly) issue can be present across multiple pods (but not all pods) on the same: - node - subnet - egressIP - pod application baseline Actual results: curls from inside pod to MULTIPLE different external addresses fail to return a result (not a DNS issue - upstream nameservers can resolve, and it is appearing to try to resolve at the listed IP of the upstream address) interestingly, curls from the host node will always succeed - this only impacts pod traffic there are no firewall rules in place (or firewalls in general) between the cluster nodes and the target remote addresses - running in same datacenter). Expected results: curls to external addresses should always succeed every time, not some of the time. Additional info: suspect this is an issue with OVN rules/management preventing a successful allocation or route to external outbound addresses. The fact that we can replicate this problem when egressIP is entirely disabled, no networkpolicy in place and dynatrace removed makes me think there's a northbounddatabase ruleset that is triggering partially. Linked case has a lot of specific pull data, including debug OVN data - happy to request/gather any additional data points as needed - some urgency on this case unfortunately, clusters need to go live to production by the 28th of January (are currently pre-prod so testing is OK).
https://access.redhat.com/solutions/6664731 created for this issue. I agree, marking this as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2034459 is logical/warranted by reports from customer after implementing workaround as detailed above confirming duplicate egressIP nat entries in OVN nbdb table. Thanks very much for the help, we'll follow the other BZ listed above for when the available patch is made available and will link case. Best, ~Will
*** This bug has been marked as a duplicate of bug 2034459 ***