2241711 – [RFE] handle PMTUD on geneve tunnels

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2241711 - [RFE] handle PMTUD on geneve tunnels

Summary: [RFE] handle PMTUD on geneve tunnels

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	ovn23.06
Sub Component:
Version:	RHEL 9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	lorenzo bianconi
QA Contact:	Jianlin Shi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-10-02 08:48 UTC by Jaime Caamaño Ruiz
Modified:	2024-07-03 04:25 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ovn23.06-23.06.1-71.el8fdp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-03-04 19:56:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	FD-3207	0	None	None	None	2023-10-02 08:49:22 UTC

Description Jaime Caamaño Ruiz 2023-10-02 08:48:31 UTC

Description of problem:

Openshift has a use case on AWS where some nodes of a cluster, but not all, are deployed on AWS local zones. The result is that these nodes reside on a different network than the other nodes on the cluster. While both networks and all nodes on the cluster could be configured with a high MTU value (let's say 9001), in between those networks paths there are segments using a lower MTU (let's say 1300) value, forcing the cluster to be configured and function with that sub-optimum lower MTU value. Ideally intra-communication within those networks and to other external networks could use the higher MTU value.

While PMTUD discovery should work in such an scenario for intra-cluster traffic, there are issues when geneve traffic is involved.

When observing such a cluster configured with the higher MTU value, and inspecting geneve traffic we can see constant ICMP NEEDS FRAG replies as a result of the geneve traffic traversing the lower MTU segment:

sh-4.4# tcpdump -i br-ex -vveenn icmp
dropped privs to tcpdump
tcpdump: listening on br-ex, link-type EN10MB (Ethernet), capture size 262144 bytes
15:49:33.942921 16:e5:eb:e1:a6:37 > 16:69:d7:61:83:63, ethertype IPv4 (0x0800), length 70: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto ICMP (1), length 56)
    10.0.192.1 > 10.0.193.203: ICMP 10.0.22.231 unreachable - need to frag (mtu 1300), length 36
	(tos 0x0, ttl 64, id 38422, offset 0, flags [DF], proto UDP (17), length 2747)
    10.0.193.203.47768 > 10.0.22.231.6081: Geneve [|geneve]
15:49:35.430295 16:e5:eb:e1:a6:37 > 16:69:d7:61:83:63, ethertype IPv4 (0x0800), length 70: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto ICMP (1), length 56)
    10.0.192.1 > 10.0.193.203: ICMP 10.0.8.2 unreachable - need to frag (mtu 1300), length 36
	(tos 0x0, ttl 64, id 54341, offset 0, flags [DF], proto UDP (17), length 2747)
    10.0.193.203.39672 > 10.0.8.2.6081: Geneve [|geneve]

These ICMP NEEDS FRAG replies are not observed as inner traffic, as expected:

sh-4.4# tcpdump -i genev_sys_6081 -vveenn icmp
dropped privs to tcpdump
tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel


But the route exception due to PMTUD discovery does not seem to be happening, which is unexpected:

sh-4.4# ip r get 10.0.8.2
10.0.8.2 via 10.0.192.1 dev br-ex src 10.0.193.203 uid 0 
    cache

If we trigger the PMTUD route exception using tracepath for example:

sh-4.4# tracepath -m 1 -n 10.0.8.2
 1?: [LOCALHOST]                      pmtu 9001
 1:  10.0.192.1                                            0.260ms pmtu 1300
 1:  no reply
     Too many hops: pmtu 1300
     Resume: pmtu 1300

sh-4.4# ip r get 10.0.8.2
10.0.8.2 via 10.0.192.1 dev br-ex src 10.0.193.203 uid 0 
    cache expires 507sec mtu 1300 

Then we don't see ICMP NEEDS FRAG replies to the geneve traffic towards that peer (10.0.8.2):

sh-4.4# tcpdump -i br-ex -vveenn icmp
dropped privs to tcpdump
tcpdump: listening on br-ex, link-type EN10MB (Ethernet), capture size 262144 bytes
...
15:53:39.062932 16:e5:eb:e1:a6:37 > 16:69:d7:61:83:63, ethertype IPv4 (0x0800), length 70: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto ICMP (1), length 56)
    10.0.192.1 > 10.0.193.203: ICMP 10.0.22.231 unreachable - need to frag (mtu 1300), length 36
	(tos 0x0, ttl 64, id 23763, offset 0, flags [DF], proto UDP (17), length 2747)
    10.0.193.203.11297 > 10.0.22.231.6081: Geneve [|geneve]
15:53:42.244009 16:e5:eb:e1:a6:37 > 16:69:d7:61:83:63, ethertype IPv4 (0x0800), length 70: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto ICMP (1), length 56)
    10.0.192.1 > 10.0.193.203: ICMP 10.0.22.231 unreachable - need to frag (mtu 1300), length 36
	(tos 0x0, ttl 64, id 23960, offset 0, flags [DF], proto UDP (17), length 2747)
    10.0.193.203.17650 > 10.0.22.231.6081: Geneve [|geneve]
...

But now we start to see ICMP NEEDS FRAG replies to inner traffic that would be encapsulated and sent to that peer:

sh-4.4# tcpdump -i genev_sys_6081 -vveenn icmp
dropped privs to tcpdump
tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
16:00:05.470810 0a:58:a8:fe:00:07 > 0a:58:a8:fe:00:08, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto ICMP (1), length 576)
    10.129.2.16 > 10.130.2.3: ICMP 10.129.2.16 unreachable - need to frag (mtu 1242), length 556 (wrong icmp cksum 832f (->9532)!)
	(tos 0x0, ttl 63, id 35816, offset 0, flags [DF], proto TCP (6), length 2689)
    10.130.2.3.8443 > 10.129.2.16.55720: Flags [P.], seq 4026138638:4026141275, ack 1994649385, win 495, options [nop,nop,TS val 1293625627 ecr 1191795028], length 2637
16:00:05.686909 0a:58:a8:fe:00:07 > 0a:58:a8:fe:00:08, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto ICMP (1), length 576)
    10.129.2.16 > 10.130.2.3: ICMP 10.129.2.16 unreachable - need to frag (mtu 1242), length 556 (wrong icmp cksum 8257 (->945a)!)
	(tos 0x0, ttl 63, id 35817, offset 0, flags [DF], proto TCP (6), length 2689)

Assuming a working PMTUD towards the geneve peers, when the geneve kernel driver is about to encapsulate a packet and send it out through the geneve tunnel, it checks the PMTU towards the tunnel peer. If the packet plus the overhead would go above this PMTU, it drops the packet, fabricates an ICMP NEEDS FRAG packet and sends that back to the transmitter. Presumably, this ICMP packet reaches the OVS br-int bridge through the geneve OF port, and while this mechanism might work on simple OVS bridge implementations with standard switching via a NORMAL flow, it is likely that the more complex OVN pipeline relies on the geneve VNI/TLV options metadata to know what to do with it, and if that is missing or interpreted incorrectly, it may be dropped preventing it to reach the original transmitter.

I say presumably, because I am yet to find a way to trace for certain where those ICMP NEEDS FRAG replies are being dropped. But I know that a pod does not become aware of a proper PMTU:

❯ kubectl exec -ti nettools -- tracepath -n 10.129.2.6
 1?: [LOCALHOST]                      pmtu 8000
 1:  10.129.2.6                                            0.932ms asymm  2 
 1:  10.129.2.6                                            0.445ms asymm  2 
 2:  no reply
...
30:  no reply
     Too many hops: pmtu 8000
     Resume: pmtu 8000 

even though the ICMP NEEDS FRAG reply does happen (this required to manually trigger PMTUD in the involved nodes as mentioned earlier):

sh-4.4# tcpdump -i genev_sys_6081 -eennvv icmp
dropped privs to tcpdump
tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
...
18:00:29.800259 0a:58:a8:fe:00:07 > 0a:58:a8:fe:00:08, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl 4, id 0, offset 0, flags [DF], proto ICMP (1), length 576)
    10.129.2.6 > 10.130.2.5: ICMP 10.129.2.6 unreachable - need to frag (mtu 1242), length 556
	(tos 0x0, ttl 4, id 0, offset 0, flags [DF], proto UDP (17), length 8000)
    10.130.2.5.44754 > 10.129.2.6.44456: UDP, length 7972
...

So there are three aspects I would like to look at:

(1) Why the ICMP NEEDS FRAG replies to the geneve traffic are not triggering the route exception due to PMTUD
(2) How can I know for certain if those ICMP NEEDS FRAG replies are being dropped in the OVS pipeline
(2) If they are, would it be be possible to do something here. Thinking about some options:
- OVN to restore usable geneve metadata for that ICMP packet from conntrack
- Have the geneve driver include the geneve metadata if it is not doing so, or change OVN so that it interprets this metadata correctly if not doing so, or both.


Couple of links to relevant bits of the geneve kernel driver implementation
https://github.com/torvalds/linux/blob/9ed22ae6be817d7a3f5c15ca22cbc9d3963b481d/drivers/net/geneve.c#L923C18-L923C18
https://github.com/torvalds/linux/blob/9ed22ae6be817d7a3f5c15ca22cbc9d3963b481d/net/ipv4/ip_tunnel_core.c#L422

Comment 2 lorenzo bianconi 2023-11-28 16:39:26 UTC

upstream patch: https://patchwork.ozlabs.org/project/ovn/patch/9d44c99689fe17899ef9228c7149379929af3e80.1701167801.git.lorenzo.bianconi@redhat.com/

Comment 6 Jianlin Shi 2024-01-25 03:58:37 UTC

Hi lorenzo,
what is the status for this issue? from the changelog, it seems that the patch is reverted:
* Mon Dec 18 2023 Numan Siddique <numans> - 23.06.1-73                                                  
- Revert "ovn: add geneve PMTUD support"
[Upstream: bbeec7987576b3fe43dd15b080307ee9ae7333ed]

Comment 7 lorenzo bianconi 2024-01-27 16:14:18 UTC

(In reply to Jianlin Shi from comment #6)
> Hi lorenzo,
> what is the status for this issue? from the changelog, it seems that the
> patch is reverted:
> * Mon Dec 18 2023 Numan Siddique <numans> - 23.06.1-73              
> 
> - Revert "ovn: add geneve PMTUD support"
> [Upstream: bbeec7987576b3fe43dd15b080307ee9ae7333ed]

The new fix has been applied last week upstream:
https://github.com/ovn-org/ovn/commit/221476a01f2670cf4eb78cd9353e709cb8a16329

Comment 8 Jianlin Shi 2024-01-29 00:45:10 UTC

(In reply to lorenzo bianconi from comment #7)
> (In reply to Jianlin Shi from comment #6)
> > Hi lorenzo,
> > what is the status for this issue? from the changelog, it seems that the
> > patch is reverted:
> > * Mon Dec 18 2023 Numan Siddique <numans> - 23.06.1-73              
> > 
> > - Revert "ovn: add geneve PMTUD support"
> > [Upstream: bbeec7987576b3fe43dd15b080307ee9ae7333ed]
> 
> The new fix has been applied last week upstream:
> https://github.com/ovn-org/ovn/commit/
> 221476a01f2670cf4eb78cd9353e709cb8a16329

is it backported to downstream? if yes, which version?
as the bug is in ON_QA status, we need to find the right version to test.

Comment 9 Jianlin Shi 2024-02-01 01:46:14 UTC

the version in errata is ovn23.06.1-85, set the bug as assigned per comment 8

Comment 11 Jianlin Shi 2024-02-04 07:36:37 UTC

tested with following steps:

1. start ovn on server
systemctl start openvswitch                                                                           
systemctl start ovn-northd
ovn-nbctl set-connection ptcp:6641                                                                    
ovn-sbctl set-connection ptcp:6642
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:1.1.207.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=1.1.207.25
systemctl restart ovn-controller                                                                      
                                                                                                      
ovn-nbctl ls-add sw0                                                                                  
ovn-nbctl lsp-add sw0 sw0-port1                                                                       
ovn-nbctl lsp-set-addresses sw0-port1 "50:54:00:00:00:03 10.0.0.3 1000::3"                            
ovn-nbctl lsp-add sw0 sw0-port2                                                                       
ovn-nbctl lsp-set-addresses sw0-port2 "50:54:00:00:00:04 10.0.0.4 1000::4"                            
ovn-nbctl ls-add sw1                                                                                  
ovn-nbctl lsp-add sw1 sw1-port1                                                                       
ovn-nbctl lsp-set-addresses sw1-port1 "40:54:00:00:00:03 20.0.0.3 2000::3"                            
ovn-nbctl lr-add lr0                                                                                  
ovn-nbctl lrp-add lr0 lr0-sw0 00:00:00:00:ff:01 10.0.0.1/24 1000::a/64                                
ovn-nbctl lsp-add sw0 sw0-lr0                                                                         
ovn-nbctl lsp-set-type sw0-lr0 router                                                                 
ovn-nbctl lsp-set-addresses sw0-lr0 router                                                            
ovn-nbctl lsp-set-options sw0-lr0 router-port=lr0-sw0                                                 
ovn-nbctl lrp-add lr0 lr0-sw1 00:00:00:00:ff:02 20.0.0.1/24 2000::a/64                                
ovn-nbctl lsp-add sw1 sw1-lr0                                                                         
ovn-nbctl lsp-set-type sw1-lr0 router                                                                 
ovn-nbctl lsp-set-addresses sw1-lr0 router                                                            
ovn-nbctl lsp-set-options sw1-lr0 router-port=lr0-sw1                                                 
ovn-nbctl ls-add public                                                                               
ovn-nbctl lsp-add public ln-public                                                                    
ovn-nbctl lsp-set-type ln-public localnet                                                             
ovn-nbctl lsp-set-addresses ln-public unknown                                                         
ovn-nbctl lsp-set-options ln-public network_name=public                                               
ovn-nbctl lrp-add lr0 lr0-public 00:11:22:00:ff:01 172.20.0.100/24                                    
ovn-nbctl lsp-add public public-lr0                                                                   
ovn-nbctl lsp-set-type public-lr0 router                                                              
ovn-nbctl lsp-set-addresses public-lr0 router                                                         
ovn-nbctl lsp-set-options public-lr0 router-port=lr0-public                                           
ovn-nbctl lrp-set-gateway-chassis lr0-public hv1 10                                                   
ovn-nbctl lr-route-add lr0 0.0.0.0/0 172.20.0.1                                                       
ovn-nbctl lr-nat-add lr0 snat 172.20.0.100 10.0.0.0/24                                                
ovn-nbctl lr-nat-add lr0 snat 172.20.0.100 20.0.0.0/24                                                
ovn-nbctl acl-add sw0 from-lport 1002 'ip4 || ip6'  allow-related                                     
ovn-nbctl acl-add sw1 from-lport 1002 'ip4 || ip6'  allow-related                                     
                                                                                                      
ovs-vsctl add-br br-ex                                                                                
ovs-vsctl set open . external-ids:ovn-bridge-mappings=public:br-ex                                    
                                                                                                      
ip link add sw0p1_v type veth peer name sw0p1_vp                                                      
ovs-vsctl add-port br-int sw0p1_vp                                                                    
ovs-vsctl set interface sw0p1_vp external_ids:iface-id=sw0-port1                                      
ip link set sw0p1_vp up                                                                               
ip netns add sw0p1                                                                                    
ip link set sw0p1_v netns sw0p1                                                                       
ip netns exec sw0p1 ip link set sw0p1_v address 50:54:00:00:00:03                                     
ip netns exec sw0p1 ip link set sw0p1_v up                                                            
ip netns exec sw0p1 ip addr add 10.0.0.3/24 dev sw0p1_v                                               
ip netns exec sw0p1 ip route add default via 10.0.0.1                                                 
ip netns exec sw0p1 ip addr add 1000::3/64 dev sw0p1_v                                                
ip netns exec sw0p1 ip -6 route add default via 1000::a

2. start ovn-controller on client

systemctl start openvswitch
ovs-vsctl set open . external_ids:system-id=hv0 external_ids:ovn-remote=tcp:1.1.207.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=1.1.207.26
systemctl restart ovn-controller                                                                      

ovs-vsctl add-br br-ex
ovs-vsctl set open . external-ids:ovn-bridge-mappings=public:br-ex 

ovs-vsctl add-port br-int sw0p2 -- set interface sw0p2 type=internal external_ids:iface-id=sw0-port2  
ip netns add sw0p2
ip link set sw0p2 netns sw0p2
ip netns exec sw0p2 ip link set sw0p2 address 50:54:00:00:00:04
ip netns exec sw0p2 ip link set sw0p2 up
ip netns exec sw0p2 ip addr add 10.0.0.4/24 dev sw0p2
ip netns exec sw0p2 ip route add default via 10.0.0.1                                                 
ip netns exec sw0p2 ip addr add 1000::4/64 dev sw0p2                                                  
ip netns exec sw0p2 ip -6 route add default via 1000::a                                               

ovs-vsctl add-port br-int sw1p1 -- set interface sw1p1 type=internal external_ids:iface-id=sw1-port1  
ip netns add sw1p1
ip link set sw1p1 netns sw1p1
ip netns exec sw1p1 ip link set sw1p1 address 40:54:00:00:00:03                                       
ip netns exec sw1p1 ip link set sw1p1 up
ip netns exec sw1p1 ip addr add 20.0.0.3/24 dev sw1p1                                                 
ip netns exec sw1p1 ip route add default via 20.0.0.1                                                 
ip netns exec sw1p1 ip addr add 2000::3/64 dev sw1p1                                                  
ip netns exec sw1p1 ip -6 route add default via 2000::a

3. change mtu for the route used by geneve on server
ip route change 1.1.207.0/24 dev ens1f0np0 mtu 1000

4. run ping in sw0p1

ip netns exec sw0p1 ping 20.0.0.3 -c 2 -s 1100 -M do


result on ovn23.09-23.09.0-103.el9:

[root@wsfd-advnetlab18 bz2241711]# ip netns exec sw0p1 ping 20.0.0.3 -c 2 -s 1100 -M do               
PING 20.0.0.3 (20.0.0.3) 1100(1128) bytes of data.                                                    
                                                                                                      
--- 20.0.0.3 ping statistics ---                                                                      
2 packets transmitted, 0 received, 100% packet loss, time 1032ms                                      
                                                                                                      
[root@wsfd-advnetlab18 bz2241711]# rpm -qa | grep -E "ovn|openvswitch3.2"                             
openvswitch3.2-3.2.0-39.el9fdp.x86_64                                                                 
ovn23.09-23.09.0-103.el9fdp.x86_64                                                                    
ovn23.09-central-23.09.0-103.el9fdp.x86_64                                                            
ovn23.09-host-23.09.0-103.el9fdp.x86_64

result on ovn23.09-23.09.0-105.el9:

[root@wsfd-advnetlab18 bz2241711]# ip netns exec sw0p1 ping 20.0.0.3 -c 2 -s 1100 -M do               
PING 20.0.0.3 (20.0.0.3) 1100(1128) bytes of data.                                                    
From 20.0.0.3 icmp_seq=2 Frag needed and DF set (mtu = 942)                                           
                                                                                                      
--- 20.0.0.3 ping statistics ---                                                                      
2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1004ms                           
                                                                                                      
[root@wsfd-advnetlab18 bz2241711]# ip netns exec sw0p1 ip route get 20.0.0.3                          
20.0.0.3 via 10.0.0.1 dev sw0p1_v src 10.0.0.3 uid 0                                                  
    cache expires 582sec mtu 942

Comment 12 Jianlin Shi 2024-02-05 02:32:44 UTC

sw0p1 doesn't get pmtu when it ping 10.0.0.4 in the same subnet:

[root@wsfd-advnetlab18 bz2241711]# ip netns exec sw0p1 ping 10.0.0.4 -c 3 -s 1100 -M do               
PING 10.0.0.4 (10.0.0.4) 1100(1128) bytes of data.                                                    
                                                                                                      
--- 10.0.0.4 ping statistics ---                                                                      
3 packets transmitted, 0 received, 100% packet loss, time 2038ms                                      
                                                                                                      
[root@wsfd-advnetlab18 bz2241711]# ip netns exec sw0p1 ping 20.0.0.3 -c 3 -s 1100 -M do               
PING 20.0.0.3 (20.0.0.3) 1100(1128) bytes of data.                                                    
From 20.0.0.3 icmp_seq=2 Frag needed and DF set (mtu = 942)                                           
ping: local error: message too long, mtu=942                                                          
                                                                                                      
--- 20.0.0.3 ping statistics ---                                                                      
3 packets transmitted, 0 received, +2 errors, 100% packet loss, time 2022ms                           
                                                                                                      
[root@wsfd-advnetlab18 bz2241711]# ip netns exec sw0p1 ip route get 20.0.0.3                          
20.0.0.3 via 10.0.0.1 dev sw0p1_v src 10.0.0.3 uid 0                                                  
    cache expires 580sec mtu 942                                                                      
[root@wsfd-advnetlab18 bz2241711]# ip netns exec sw0p1 ip route get 10.0.0.4                          
10.0.0.4 dev sw0p1_v src 10.0.0.3 uid 0                                                               
    cache                                                                                             
[root@wsfd-advnetlab18 bz2241711]# rpm -qa | grep -E "openvswitch|ovn"                                
openvswitch-selinux-extra-policy-1.0-34.el9fdp.noarch                                                 
ovn23.09-23.09.0-105.el9fdp.x86_64                                                                    
ovn23.09-central-23.09.0-105.el9fdp.x86_64                                                            
ovn23.09-host-23.09.0-105.el9fdp.x86_64                                                               
openvswitch3.2-3.2.0-52.el9fdp.x86_64                                                                 
python3-openvswitch3.2-3.2.0-52.el9fdp.x86_64

Lorenzo, why is that?

Comment 13 Jianlin Shi 2024-02-06 02:10:00 UTC

report https://issues.redhat.com/browse/FDP-362 to track the issue in comment 12

Comment 14 Mark Michelson 2024-03-04 19:56:04 UTC

I'm closing this issue since FDP-362 is tracking the issue reported as a result of testing. I wanted to mark this as "MIGRATED" or "DUPLICATE" but Bugzilla won't allow me to link to the FDP issue mentioned in comment 13.

Comment 15 Red Hat Bugzilla 2024-07-03 04:25:05 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.