The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1927046 - IP refragmentation does not work with MTU mismatch between orig and dest interfaces
Summary: IP refragmentation does not work with MTU mismatch between orig and dest inte...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: OVN
Version: RHEL 8.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: lorenzo bianconi
QA Contact: Ehsan Elahi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-09 22:24 UTC by Tim Rozet
Modified: 2023-03-13 07:06 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-13 07:06:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-1077 0 None None None 2021-09-22 13:11:33 UTC

Description Tim Rozet 2021-02-09 22:24:20 UTC
Description of problem:
Consider the following topology:
pod 1400 MTU -----OVS----eth0 1500 MTU  --------external net ---------- UDP Server on port 9999

pod IP is 10.244.0.3

pod sends a packet towards UDP server. The UDP server will receive the packet, echo back the payload and send the packet back to the pod. This works just fine with small size packets. However, when using a jumbo packet:

1. pod sends 2k byte packet. It gets fragmented as it leaves the pod. We can see this in tcpdump in the pod:
22:13:07.578540 0a:58:0a:f4:00:03 > 0a:58:0a:f4:00:01, ethertype IPv4 (0x0800), length 1410: (tos 0x0, ttl 64, id 58534, offset 0, flags [+], proto UDP (17), length 1396)
    10.244.0.3.47932 > 10.96.87.70.9999: UDP, bad length 2005 > 1368
22:13:07.578686 0a:58:0a:f4:00:03 > 0a:58:0a:f4:00:01, ethertype IPv4 (0x0800), length 671: (tos 0x0, ttl 64, id 58534, offset 1376, flags [none], proto UDP (17), length 657)
    10.244.0.3 > 10.96.87.70: ip-proto-17

2. in OVS, OVN flows send the packets into conntrack, which forces reassembly of the packet as it traverses the OpenFlow pipeline.

3. the packet leaves the OVS, and is fragmented on the way out of eth0, and the packet is received by UDP Server.

4. UDP Server sends back a packet with the same payload, it is fragmented as it is sent back to the originating node. We can see on the node with tcpdump on eth0 that the fragments arrive:
21:35:16.952611 02:42:ac:12:00:02 > 02:42:ac:12:00:03, ethertype IPv4 (0x0800), length 1514: (tos 0x0, ttl 64, id 30325, offset 0, flags [+], proto UDP (17), length 1500)
    172.18.0.2.9999 > 172.18.0.3.47932: UDP, bad length 2000 > 1472
21:35:16.952741 02:42:ac:12:00:02 > 02:42:ac:12:00:03, ethertype IPv4 (0x0800), length 562: (tos 0x0, ttl 64, id 30325, offset 1480, flags [none], proto UDP (17), length 548)
    172.18.0.2 > 172.18.0.3: ip-proto-17

5. We can also see in OVS, that the packet is resassmbled by CT and sent through OVN pipeline:
BEFORE packet received:
cookie=0xdeff105, duration=75314.612s, table=1, n_packets=193, n_bytes=35606, priority=100,ct_state=+est+trk,ip actions=output:"patch-breth0_ov"

AFTER packet received:
cookie=0xdeff105, duration=75320.602s, table=1, n_packets=194, n_bytes=37648, priority=100,ct_state=+est+trk,ip actions=output:"patch-breth0_ov"

^shows a single packet, with around 2k byte size difference

6.The packet is then dropped on its way to the pod. This is because OVS/CT cannot refrag because the MTU of the pod is 1400, and the original maximum frag was 1500. From OVS documentation:

If ct is executed on IPv4 (or IPv6) fragments, then the message
       is implicitly reassembled before sending to the connection
       tracker and refragmented upon output, to the original maximum
       received fragment size.

7. If I then change the MTU of the eth0 interface to be 1400, the re-frag works and the packet arrives at the pod.

Unfortunately it is not an option to change the eth0 MTU to match the pod MTU. This is because with OVN we have to reserve headroom for Geneve traffic, so we must set the pod MTU lower than the NIC MTU. We need the ability for CT/OVS to re-fragment according to the new MTU on the veth interface and not just to the original frag size. Otherwise OVN/OVS cannot work with IP fragmentation.

Comment 1 Dan Williams 2021-02-09 22:43:05 UTC
Copying some slack discussion between Tim and I... I think the key difference is UDP here. With TCP, we have previously demonstrated that due to GSO large packets make it all the way to the NIC for fragmentation and are untouched by veth, OVS, and any tunnel drivers. I'd hope that would work the same on the receive path; that the kernel/OVS would know that the packet is getting delivered to a veth that can do GSO and thus it doesn't actually need fragment, but can just forward the whole large packet.

Comment 8 Aaron Conole 2021-03-19 20:49:38 UTC
Posted a patch for this issue upstream:

https://lore.kernel.org/netdev/20210319204307.3128280-1-aconole@redhat.com/T/#u

Comment 9 Aaron Conole 2021-04-01 17:46:57 UTC
Upstream didn't like the quick fix I proposed, so it was NAK'd.

I have scheduled a meeting to discuss this bug within network engineering for OVN/OVS.

Comment 10 Aaron Conole 2021-04-13 17:23:09 UTC
Updates from meeting:
- Tim will check with Girish about what he is doing with the custom VTEP
- Aaron will push back upstream as much as he credibly can
- OVN/ovn-kube will investigate using check_pkt_larger/gateway-mtu to send ICMP needsfrag for large packets but drop
- OVN will implement rate control on pinctl thread
- (step 2) OVN will investigate fragging larger packet, sending along, and then sending ICMP needsfrag


I pushed back on upstream, but seems it probably won't be accepted.  Needs follow up from OVN team.

Comment 11 Tim Rozet 2021-04-20 14:43:48 UTC
After investigating alternative solutions it looks like our best bet at this point is using OVN check_pkt_larger and having OVN send the correct ICMP message. Moving this to OVN team.


Note You need to log in before you can comment on or make changes to this bug.