Description of problem: Consider the following topology: pod 1400 MTU -----OVS----eth0 1500 MTU --------external net ---------- UDP Server on port 9999 pod IP is 10.244.0.3 pod sends a packet towards UDP server. The UDP server will receive the packet, echo back the payload and send the packet back to the pod. This works just fine with small size packets. However, when using a jumbo packet: 1. pod sends 2k byte packet. It gets fragmented as it leaves the pod. We can see this in tcpdump in the pod: 22:13:07.578540 0a:58:0a:f4:00:03 > 0a:58:0a:f4:00:01, ethertype IPv4 (0x0800), length 1410: (tos 0x0, ttl 64, id 58534, offset 0, flags [+], proto UDP (17), length 1396) 10.244.0.3.47932 > 10.96.87.70.9999: UDP, bad length 2005 > 1368 22:13:07.578686 0a:58:0a:f4:00:03 > 0a:58:0a:f4:00:01, ethertype IPv4 (0x0800), length 671: (tos 0x0, ttl 64, id 58534, offset 1376, flags [none], proto UDP (17), length 657) 10.244.0.3 > 10.96.87.70: ip-proto-17 2. in OVS, OVN flows send the packets into conntrack, which forces reassembly of the packet as it traverses the OpenFlow pipeline. 3. the packet leaves the OVS, and is fragmented on the way out of eth0, and the packet is received by UDP Server. 4. UDP Server sends back a packet with the same payload, it is fragmented as it is sent back to the originating node. We can see on the node with tcpdump on eth0 that the fragments arrive: 21:35:16.952611 02:42:ac:12:00:02 > 02:42:ac:12:00:03, ethertype IPv4 (0x0800), length 1514: (tos 0x0, ttl 64, id 30325, offset 0, flags [+], proto UDP (17), length 1500) 172.18.0.2.9999 > 172.18.0.3.47932: UDP, bad length 2000 > 1472 21:35:16.952741 02:42:ac:12:00:02 > 02:42:ac:12:00:03, ethertype IPv4 (0x0800), length 562: (tos 0x0, ttl 64, id 30325, offset 1480, flags [none], proto UDP (17), length 548) 172.18.0.2 > 172.18.0.3: ip-proto-17 5. We can also see in OVS, that the packet is resassmbled by CT and sent through OVN pipeline: BEFORE packet received: cookie=0xdeff105, duration=75314.612s, table=1, n_packets=193, n_bytes=35606, priority=100,ct_state=+est+trk,ip actions=output:"patch-breth0_ov" AFTER packet received: cookie=0xdeff105, duration=75320.602s, table=1, n_packets=194, n_bytes=37648, priority=100,ct_state=+est+trk,ip actions=output:"patch-breth0_ov" ^shows a single packet, with around 2k byte size difference 6.The packet is then dropped on its way to the pod. This is because OVS/CT cannot refrag because the MTU of the pod is 1400, and the original maximum frag was 1500. From OVS documentation: If ct is executed on IPv4 (or IPv6) fragments, then the message is implicitly reassembled before sending to the connection tracker and refragmented upon output, to the original maximum received fragment size. 7. If I then change the MTU of the eth0 interface to be 1400, the re-frag works and the packet arrives at the pod. Unfortunately it is not an option to change the eth0 MTU to match the pod MTU. This is because with OVN we have to reserve headroom for Geneve traffic, so we must set the pod MTU lower than the NIC MTU. We need the ability for CT/OVS to re-fragment according to the new MTU on the veth interface and not just to the original frag size. Otherwise OVN/OVS cannot work with IP fragmentation.
Copying some slack discussion between Tim and I... I think the key difference is UDP here. With TCP, we have previously demonstrated that due to GSO large packets make it all the way to the NIC for fragmentation and are untouched by veth, OVS, and any tunnel drivers. I'd hope that would work the same on the receive path; that the kernel/OVS would know that the packet is getting delivered to a veth that can do GSO and thus it doesn't actually need fragment, but can just forward the whole large packet.
Posted a patch for this issue upstream: https://lore.kernel.org/netdev/20210319204307.3128280-1-aconole@redhat.com/T/#u
Upstream didn't like the quick fix I proposed, so it was NAK'd. I have scheduled a meeting to discuss this bug within network engineering for OVN/OVS.
Updates from meeting: - Tim will check with Girish about what he is doing with the custom VTEP - Aaron will push back upstream as much as he credibly can - OVN/ovn-kube will investigate using check_pkt_larger/gateway-mtu to send ICMP needsfrag for large packets but drop - OVN will implement rate control on pinctl thread - (step 2) OVN will investigate fragging larger packet, sending along, and then sending ICMP needsfrag I pushed back on upstream, but seems it probably won't be accepted. Needs follow up from OVN team.
After investigating alternative solutions it looks like our best bet at this point is using OVN check_pkt_larger and having OVN send the correct ICMP message. Moving this to OVN team.
upstream feature: http://patchwork.ozlabs.org/project/ovn/cover/cover.1627405420.git.lorenzo.bianconi@redhat.com/