Description of problem: Openshift has a use case on AWS where some nodes of a cluster, but not all, are deployed on AWS local zones. The result is that these nodes reside on a different network than the other nodes on the cluster. While both networks and all nodes on the cluster could be configured with a high MTU value (let's say 9001), in between those networks paths there are segments using a lower MTU (let's say 1300) value, forcing the cluster to be configured and function with that sub-optimum lower MTU value. Ideally intra-communication within those networks and to other external networks could use the higher MTU value. While PMTUD discovery should work in such an scenario for intra-cluster traffic, there are issues when geneve traffic is involved. When observing such a cluster configured with the higher MTU value, and inspecting geneve traffic we can see constant ICMP NEEDS FRAG replies as a result of the geneve traffic traversing the lower MTU segment: sh-4.4# tcpdump -i br-ex -vveenn icmp dropped privs to tcpdump tcpdump: listening on br-ex, link-type EN10MB (Ethernet), capture size 262144 bytes 15:49:33.942921 16:e5:eb:e1:a6:37 > 16:69:d7:61:83:63, ethertype IPv4 (0x0800), length 70: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto ICMP (1), length 56) 10.0.192.1 > 10.0.193.203: ICMP 10.0.22.231 unreachable - need to frag (mtu 1300), length 36 (tos 0x0, ttl 64, id 38422, offset 0, flags [DF], proto UDP (17), length 2747) 10.0.193.203.47768 > 10.0.22.231.6081: Geneve [|geneve] 15:49:35.430295 16:e5:eb:e1:a6:37 > 16:69:d7:61:83:63, ethertype IPv4 (0x0800), length 70: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto ICMP (1), length 56) 10.0.192.1 > 10.0.193.203: ICMP 10.0.8.2 unreachable - need to frag (mtu 1300), length 36 (tos 0x0, ttl 64, id 54341, offset 0, flags [DF], proto UDP (17), length 2747) 10.0.193.203.39672 > 10.0.8.2.6081: Geneve [|geneve] These ICMP NEEDS FRAG replies are not observed as inner traffic, as expected: sh-4.4# tcpdump -i genev_sys_6081 -vveenn icmp dropped privs to tcpdump tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes ^C 0 packets captured 0 packets received by filter 0 packets dropped by kernel But the route exception due to PMTUD discovery does not seem to be happening, which is unexpected: sh-4.4# ip r get 10.0.8.2 10.0.8.2 via 10.0.192.1 dev br-ex src 10.0.193.203 uid 0 cache If we trigger the PMTUD route exception using tracepath for example: sh-4.4# tracepath -m 1 -n 10.0.8.2 1?: [LOCALHOST] pmtu 9001 1: 10.0.192.1 0.260ms pmtu 1300 1: no reply Too many hops: pmtu 1300 Resume: pmtu 1300 sh-4.4# ip r get 10.0.8.2 10.0.8.2 via 10.0.192.1 dev br-ex src 10.0.193.203 uid 0 cache expires 507sec mtu 1300 Then we don't see ICMP NEEDS FRAG replies to the geneve traffic towards that peer (10.0.8.2): sh-4.4# tcpdump -i br-ex -vveenn icmp dropped privs to tcpdump tcpdump: listening on br-ex, link-type EN10MB (Ethernet), capture size 262144 bytes ... 15:53:39.062932 16:e5:eb:e1:a6:37 > 16:69:d7:61:83:63, ethertype IPv4 (0x0800), length 70: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto ICMP (1), length 56) 10.0.192.1 > 10.0.193.203: ICMP 10.0.22.231 unreachable - need to frag (mtu 1300), length 36 (tos 0x0, ttl 64, id 23763, offset 0, flags [DF], proto UDP (17), length 2747) 10.0.193.203.11297 > 10.0.22.231.6081: Geneve [|geneve] 15:53:42.244009 16:e5:eb:e1:a6:37 > 16:69:d7:61:83:63, ethertype IPv4 (0x0800), length 70: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto ICMP (1), length 56) 10.0.192.1 > 10.0.193.203: ICMP 10.0.22.231 unreachable - need to frag (mtu 1300), length 36 (tos 0x0, ttl 64, id 23960, offset 0, flags [DF], proto UDP (17), length 2747) 10.0.193.203.17650 > 10.0.22.231.6081: Geneve [|geneve] ... But now we start to see ICMP NEEDS FRAG replies to inner traffic that would be encapsulated and sent to that peer: sh-4.4# tcpdump -i genev_sys_6081 -vveenn icmp dropped privs to tcpdump tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes 16:00:05.470810 0a:58:a8:fe:00:07 > 0a:58:a8:fe:00:08, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto ICMP (1), length 576) 10.129.2.16 > 10.130.2.3: ICMP 10.129.2.16 unreachable - need to frag (mtu 1242), length 556 (wrong icmp cksum 832f (->9532)!) (tos 0x0, ttl 63, id 35816, offset 0, flags [DF], proto TCP (6), length 2689) 10.130.2.3.8443 > 10.129.2.16.55720: Flags [P.], seq 4026138638:4026141275, ack 1994649385, win 495, options [nop,nop,TS val 1293625627 ecr 1191795028], length 2637 16:00:05.686909 0a:58:a8:fe:00:07 > 0a:58:a8:fe:00:08, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto ICMP (1), length 576) 10.129.2.16 > 10.130.2.3: ICMP 10.129.2.16 unreachable - need to frag (mtu 1242), length 556 (wrong icmp cksum 8257 (->945a)!) (tos 0x0, ttl 63, id 35817, offset 0, flags [DF], proto TCP (6), length 2689) Assuming a working PMTUD towards the geneve peers, when the geneve kernel driver is about to encapsulate a packet and send it out through the geneve tunnel, it checks the PMTU towards the tunnel peer. If the packet plus the overhead would go above this PMTU, it drops the packet, fabricates an ICMP NEEDS FRAG packet and sends that back to the transmitter. Presumably, this ICMP packet reaches the OVS br-int bridge through the geneve OF port, and while this mechanism might work on simple OVS bridge implementations with standard switching via a NORMAL flow, it is likely that the more complex OVN pipeline relies on the geneve VNI/TLV options metadata to know what to do with it, and if that is missing or interpreted incorrectly, it may be dropped preventing it to reach the original transmitter. I say presumably, because I am yet to find a way to trace for certain where those ICMP NEEDS FRAG replies are being dropped. But I know that a pod does not become aware of a proper PMTU: ❯ kubectl exec -ti nettools -- tracepath -n 10.129.2.6 1?: [LOCALHOST] pmtu 8000 1: 10.129.2.6 0.932ms asymm 2 1: 10.129.2.6 0.445ms asymm 2 2: no reply ... 30: no reply Too many hops: pmtu 8000 Resume: pmtu 8000 even though the ICMP NEEDS FRAG reply does happen (this required to manually trigger PMTUD in the involved nodes as mentioned earlier): sh-4.4# tcpdump -i genev_sys_6081 -eennvv icmp dropped privs to tcpdump tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes ... 18:00:29.800259 0a:58:a8:fe:00:07 > 0a:58:a8:fe:00:08, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl 4, id 0, offset 0, flags [DF], proto ICMP (1), length 576) 10.129.2.6 > 10.130.2.5: ICMP 10.129.2.6 unreachable - need to frag (mtu 1242), length 556 (tos 0x0, ttl 4, id 0, offset 0, flags [DF], proto UDP (17), length 8000) 10.130.2.5.44754 > 10.129.2.6.44456: UDP, length 7972 ... So there are three aspects I would like to look at: (1) Why the ICMP NEEDS FRAG replies to the geneve traffic are not triggering the route exception due to PMTUD (2) How can I know for certain if those ICMP NEEDS FRAG replies are being dropped in the OVS pipeline (2) If they are, would it be be possible to do something here. Thinking about some options: - OVN to restore usable geneve metadata for that ICMP packet from conntrack - Have the geneve driver include the geneve metadata if it is not doing so, or change OVN so that it interprets this metadata correctly if not doing so, or both. Couple of links to relevant bits of the geneve kernel driver implementation https://github.com/torvalds/linux/blob/9ed22ae6be817d7a3f5c15ca22cbc9d3963b481d/drivers/net/geneve.c#L923C18-L923C18 https://github.com/torvalds/linux/blob/9ed22ae6be817d7a3f5c15ca22cbc9d3963b481d/net/ipv4/ip_tunnel_core.c#L422
upstream patch: https://patchwork.ozlabs.org/project/ovn/patch/9d44c99689fe17899ef9228c7149379929af3e80.1701167801.git.lorenzo.bianconi@redhat.com/
Hi lorenzo, what is the status for this issue? from the changelog, it seems that the patch is reverted: * Mon Dec 18 2023 Numan Siddique <numans> - 23.06.1-73 - Revert "ovn: add geneve PMTUD support" [Upstream: bbeec7987576b3fe43dd15b080307ee9ae7333ed]
(In reply to Jianlin Shi from comment #6) > Hi lorenzo, > what is the status for this issue? from the changelog, it seems that the > patch is reverted: > * Mon Dec 18 2023 Numan Siddique <numans> - 23.06.1-73 > > - Revert "ovn: add geneve PMTUD support" > [Upstream: bbeec7987576b3fe43dd15b080307ee9ae7333ed] The new fix has been applied last week upstream: https://github.com/ovn-org/ovn/commit/221476a01f2670cf4eb78cd9353e709cb8a16329
(In reply to lorenzo bianconi from comment #7) > (In reply to Jianlin Shi from comment #6) > > Hi lorenzo, > > what is the status for this issue? from the changelog, it seems that the > > patch is reverted: > > * Mon Dec 18 2023 Numan Siddique <numans> - 23.06.1-73 > > > > - Revert "ovn: add geneve PMTUD support" > > [Upstream: bbeec7987576b3fe43dd15b080307ee9ae7333ed] > > The new fix has been applied last week upstream: > https://github.com/ovn-org/ovn/commit/ > 221476a01f2670cf4eb78cd9353e709cb8a16329 is it backported to downstream? if yes, which version? as the bug is in ON_QA status, we need to find the right version to test.
the version in errata is ovn23.06.1-85, set the bug as assigned per comment 8
tested with following steps: 1. start ovn on server systemctl start openvswitch systemctl start ovn-northd ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:1.1.207.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=1.1.207.25 systemctl restart ovn-controller ovn-nbctl ls-add sw0 ovn-nbctl lsp-add sw0 sw0-port1 ovn-nbctl lsp-set-addresses sw0-port1 "50:54:00:00:00:03 10.0.0.3 1000::3" ovn-nbctl lsp-add sw0 sw0-port2 ovn-nbctl lsp-set-addresses sw0-port2 "50:54:00:00:00:04 10.0.0.4 1000::4" ovn-nbctl ls-add sw1 ovn-nbctl lsp-add sw1 sw1-port1 ovn-nbctl lsp-set-addresses sw1-port1 "40:54:00:00:00:03 20.0.0.3 2000::3" ovn-nbctl lr-add lr0 ovn-nbctl lrp-add lr0 lr0-sw0 00:00:00:00:ff:01 10.0.0.1/24 1000::a/64 ovn-nbctl lsp-add sw0 sw0-lr0 ovn-nbctl lsp-set-type sw0-lr0 router ovn-nbctl lsp-set-addresses sw0-lr0 router ovn-nbctl lsp-set-options sw0-lr0 router-port=lr0-sw0 ovn-nbctl lrp-add lr0 lr0-sw1 00:00:00:00:ff:02 20.0.0.1/24 2000::a/64 ovn-nbctl lsp-add sw1 sw1-lr0 ovn-nbctl lsp-set-type sw1-lr0 router ovn-nbctl lsp-set-addresses sw1-lr0 router ovn-nbctl lsp-set-options sw1-lr0 router-port=lr0-sw1 ovn-nbctl ls-add public ovn-nbctl lsp-add public ln-public ovn-nbctl lsp-set-type ln-public localnet ovn-nbctl lsp-set-addresses ln-public unknown ovn-nbctl lsp-set-options ln-public network_name=public ovn-nbctl lrp-add lr0 lr0-public 00:11:22:00:ff:01 172.20.0.100/24 ovn-nbctl lsp-add public public-lr0 ovn-nbctl lsp-set-type public-lr0 router ovn-nbctl lsp-set-addresses public-lr0 router ovn-nbctl lsp-set-options public-lr0 router-port=lr0-public ovn-nbctl lrp-set-gateway-chassis lr0-public hv1 10 ovn-nbctl lr-route-add lr0 0.0.0.0/0 172.20.0.1 ovn-nbctl lr-nat-add lr0 snat 172.20.0.100 10.0.0.0/24 ovn-nbctl lr-nat-add lr0 snat 172.20.0.100 20.0.0.0/24 ovn-nbctl acl-add sw0 from-lport 1002 'ip4 || ip6' allow-related ovn-nbctl acl-add sw1 from-lport 1002 'ip4 || ip6' allow-related ovs-vsctl add-br br-ex ovs-vsctl set open . external-ids:ovn-bridge-mappings=public:br-ex ip link add sw0p1_v type veth peer name sw0p1_vp ovs-vsctl add-port br-int sw0p1_vp ovs-vsctl set interface sw0p1_vp external_ids:iface-id=sw0-port1 ip link set sw0p1_vp up ip netns add sw0p1 ip link set sw0p1_v netns sw0p1 ip netns exec sw0p1 ip link set sw0p1_v address 50:54:00:00:00:03 ip netns exec sw0p1 ip link set sw0p1_v up ip netns exec sw0p1 ip addr add 10.0.0.3/24 dev sw0p1_v ip netns exec sw0p1 ip route add default via 10.0.0.1 ip netns exec sw0p1 ip addr add 1000::3/64 dev sw0p1_v ip netns exec sw0p1 ip -6 route add default via 1000::a 2. start ovn-controller on client systemctl start openvswitch ovs-vsctl set open . external_ids:system-id=hv0 external_ids:ovn-remote=tcp:1.1.207.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=1.1.207.26 systemctl restart ovn-controller ovs-vsctl add-br br-ex ovs-vsctl set open . external-ids:ovn-bridge-mappings=public:br-ex ovs-vsctl add-port br-int sw0p2 -- set interface sw0p2 type=internal external_ids:iface-id=sw0-port2 ip netns add sw0p2 ip link set sw0p2 netns sw0p2 ip netns exec sw0p2 ip link set sw0p2 address 50:54:00:00:00:04 ip netns exec sw0p2 ip link set sw0p2 up ip netns exec sw0p2 ip addr add 10.0.0.4/24 dev sw0p2 ip netns exec sw0p2 ip route add default via 10.0.0.1 ip netns exec sw0p2 ip addr add 1000::4/64 dev sw0p2 ip netns exec sw0p2 ip -6 route add default via 1000::a ovs-vsctl add-port br-int sw1p1 -- set interface sw1p1 type=internal external_ids:iface-id=sw1-port1 ip netns add sw1p1 ip link set sw1p1 netns sw1p1 ip netns exec sw1p1 ip link set sw1p1 address 40:54:00:00:00:03 ip netns exec sw1p1 ip link set sw1p1 up ip netns exec sw1p1 ip addr add 20.0.0.3/24 dev sw1p1 ip netns exec sw1p1 ip route add default via 20.0.0.1 ip netns exec sw1p1 ip addr add 2000::3/64 dev sw1p1 ip netns exec sw1p1 ip -6 route add default via 2000::a 3. change mtu for the route used by geneve on server ip route change 1.1.207.0/24 dev ens1f0np0 mtu 1000 4. run ping in sw0p1 ip netns exec sw0p1 ping 20.0.0.3 -c 2 -s 1100 -M do result on ovn23.09-23.09.0-103.el9: [root@wsfd-advnetlab18 bz2241711]# ip netns exec sw0p1 ping 20.0.0.3 -c 2 -s 1100 -M do PING 20.0.0.3 (20.0.0.3) 1100(1128) bytes of data. --- 20.0.0.3 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 1032ms [root@wsfd-advnetlab18 bz2241711]# rpm -qa | grep -E "ovn|openvswitch3.2" openvswitch3.2-3.2.0-39.el9fdp.x86_64 ovn23.09-23.09.0-103.el9fdp.x86_64 ovn23.09-central-23.09.0-103.el9fdp.x86_64 ovn23.09-host-23.09.0-103.el9fdp.x86_64 result on ovn23.09-23.09.0-105.el9: [root@wsfd-advnetlab18 bz2241711]# ip netns exec sw0p1 ping 20.0.0.3 -c 2 -s 1100 -M do PING 20.0.0.3 (20.0.0.3) 1100(1128) bytes of data. From 20.0.0.3 icmp_seq=2 Frag needed and DF set (mtu = 942) --- 20.0.0.3 ping statistics --- 2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1004ms [root@wsfd-advnetlab18 bz2241711]# ip netns exec sw0p1 ip route get 20.0.0.3 20.0.0.3 via 10.0.0.1 dev sw0p1_v src 10.0.0.3 uid 0 cache expires 582sec mtu 942
sw0p1 doesn't get pmtu when it ping 10.0.0.4 in the same subnet: [root@wsfd-advnetlab18 bz2241711]# ip netns exec sw0p1 ping 10.0.0.4 -c 3 -s 1100 -M do PING 10.0.0.4 (10.0.0.4) 1100(1128) bytes of data. --- 10.0.0.4 ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 2038ms [root@wsfd-advnetlab18 bz2241711]# ip netns exec sw0p1 ping 20.0.0.3 -c 3 -s 1100 -M do PING 20.0.0.3 (20.0.0.3) 1100(1128) bytes of data. From 20.0.0.3 icmp_seq=2 Frag needed and DF set (mtu = 942) ping: local error: message too long, mtu=942 --- 20.0.0.3 ping statistics --- 3 packets transmitted, 0 received, +2 errors, 100% packet loss, time 2022ms [root@wsfd-advnetlab18 bz2241711]# ip netns exec sw0p1 ip route get 20.0.0.3 20.0.0.3 via 10.0.0.1 dev sw0p1_v src 10.0.0.3 uid 0 cache expires 580sec mtu 942 [root@wsfd-advnetlab18 bz2241711]# ip netns exec sw0p1 ip route get 10.0.0.4 10.0.0.4 dev sw0p1_v src 10.0.0.3 uid 0 cache [root@wsfd-advnetlab18 bz2241711]# rpm -qa | grep -E "openvswitch|ovn" openvswitch-selinux-extra-policy-1.0-34.el9fdp.noarch ovn23.09-23.09.0-105.el9fdp.x86_64 ovn23.09-central-23.09.0-105.el9fdp.x86_64 ovn23.09-host-23.09.0-105.el9fdp.x86_64 openvswitch3.2-3.2.0-52.el9fdp.x86_64 python3-openvswitch3.2-3.2.0-52.el9fdp.x86_64 Lorenzo, why is that?
report https://issues.redhat.com/browse/FDP-362 to track the issue in comment 12
I'm closing this issue since FDP-362 is tracking the issue reported as a result of testing. I wanted to mark this as "MIGRATED" or "DUPLICATE" but Bugzilla won't allow me to link to the FDP issue mentioned in comment 13.