Description of problem:
Ingress cluster traffic directly towards pods will be not be delivered if the traffic is IP fragmented. This includes UDP traffic. The traffic will make it into the node and OVS, then get dropped as it is sent to the pod due to a refragmentation issue in OVS:
The purpose of this bug is to track this fix landing into OCP.
*** Bug 1936010 has been marked as a duplicate of this bug. ***
OVS patch was rejected upstream. So we need a different solution...
dcbw has an idea to move all of the MTUs to be equal in our cluster and then just use ip route <pod network> mtu <max mtu - geneve overhead> to force pods to send traffic that will fit the tunnel. I think this is a great idea. Need to think about it more and how it will affect upgrades, etc.
Tried this idea out:
Unfortunately I think it is going to introduce more potential pitfalls where we would have to lower the MTU for some nodeport services as well as external IP, because those would resolve to east/west endpoints. That is not very scalable because we would end up having to go into every pod and update routes based on service changes.
Spoke with the OVN team and new plan is to just allow OVN to detect if the packet is too large (larger than pod MTU) and then send correct ICMP message to indicate fragmentation needed. Moving dependent bug to OVN team.
Note, this issue affects 4.8 with shared gateway mode for accessing services and not local gateway mode. Local gateway mode service packets are handled via the kernel so the kernel will respond with ICMP needs frag or packet too big. However in 4.6->4.8, both gateway modes are affected for external gateway -> pod traffic. In both gateway mode cases, packets go directly via br-ex to the pod and not via kernel.
Rather than wait on an OVN fix to handle this case, I've implemented a fix in ovn-kubernetes for these releases where if we detect a packet in the respective mode that is going to be larger than the pod MTU, we send it to the kernel. The kernel then will send out the ICMP needs frag/pkt too big based on routes and next hop interfaces where we have the MTU set to the pod's MTU.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.