+++ This bug was initially created as a clone of Bug #2169839 +++ OVS version: [root@ovn-worker ~]# rpm -qa | grep openvswitch openvswitch-2.17.0-3.fc36.x86_64 Description of problem: Recently OVN added the ability to handle sending ICMP needs frag. In my case I am triggering this behavior with a UDP packet and ovn-controller gets the packet_in and responds with an ICMP needs frag, however...somehow the ICMP response ends up being packet_in'ed again to ovn-controller: 2023-02-14T17:29:49.506Z|00004|pinctrl(ovn_pinctrl0)|DBG|pinctrl received packet-in | opcode=ICMP4_ERROR| OF_Table_ID=27| OF_Cookie_ID=0xde8a2ac| in-port=3| src-mac=0a:58:64:40:00:02, dst-mac=0a:58:64:40:00:01| src-ip=169.254.169.2, dst-ip=10.244.0.6 2023-02-14T17:29:49.506Z|00005|pinctrl(ovn_pinctrl0)|DBG|pinctrl received packet-in | opcode=PUT_ICMP4_FRAG_MTU| OF_Table_ID=0| OF_Cookie_ID=0xffffffffffffffff| in-port=65533| src-mac=0a:58:64:40:00:02, dst-mac=02:42:ac:12:00:02| src-ip=172.18.0.2, dst-ip=169.254.169.2 Version-Release number of selected component (if applicable): [root@ovn-worker ~]# rpm -qa | grep ovn ovn-22.12.90-1.fc37.x86_64 ovn-central-22.12.90-1.fc37.x86_64 ovn-host-22.12.90-1.fc37.x86_64 ovn-vtep-22.12.90-1.fc37.x86_64 How reproducible: Everytime Steps to Reproduce: 1. Launch kind in ovn-kubernetes (./kind.sh -wk 1) 2. Create a pod that will serve as an endpoint to a service: cat ~/basic.yaml --- apiVersion: v1 kind: Pod metadata: name: trozet3 labels: pod-name: client #role: webserver #app: spk-coredns spec: containers: - name: pod1 image: python #command: # - /sbin/init command: [ "/bin/bash", "-c", "--" ] args: [ "while true; do sleep 3000000; done;" ] imagePullPolicy: IfNotPresent ports: - name: dns-tcp containerPort: 53 protocol: TCP - name: dns-udp containerPort: 9999 protocol: UDP 3. Create a service: [trozet@fedora ~]$ cat ~/service.yaml apiVersion: v1 kind: Service metadata: name: my-service1 #namespace: exgw2 spec: selector: pod-name: client ports: - protocol: UDP port: 1337 targetPort: 1337 name: blah - protocol: TCP port: 80 targetPort: 80 name: blah2 externalIPs: - 172.18.0.99 type: NodePort externalTrafficPolicy: Local 4. The pod should be placed on ovn-worker node. Exec into the pod and run a UDP server (change localIP to match your pod IP, or use 0.0.0.0 and bind all): import socket localIP = "10.244.0.6" localPort = 1337 bufferSize = 1024 msgFromServer = "Hello UDP Client" bytesToSend = str.encode(msgFromServer) # Create a datagram socket UDPServerSocket = socket.socket(family=socket.AF_INET, type=socket.SOCK_DGRAM) # Bind to address and ip UDPServerSocket.bind((localIP, localPort)) print("UDP server up and listening") # Listen for incoming datagrams while(True): bytesAddressPair = UDPServerSocket.recvfrom(bufferSize) message = bytesAddressPair[0] address = bytesAddressPair[1] clientMsg = "Message from Client:{}".format(message) clientIP = "Client IP Address:{}".format(address) print(clientMsg) print(clientIP) # Sending a reply to client UDPServerSocket.sendto(bytesToSend, address) 5. Login to ovn-worker node and modify the 2 routes to not have MTU 1400: 169.254.169.1 dev breth0 src 172.18.0.2 10.96.0.0/16 via 169.254.169.4 dev breth0 6. Login to the ovn-worker node and run a python client (modify service IP): import socket import sys s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) send_data = "echo n" + "o" * 1400 send_work = "echo blah" service_ip = "10.96.165.251" service_port = 1337 s.sendto(send_work.encode('utf-8'), (service_ip, service_port)) ###^this will work and send a small packet data, address = s.recvfrom(4096) s.sendto(send_data.encode('utf-8'), (service_ip, service_port)) ###^this will send the large packet which will trigger ICMP needs frag Actual results: In tcpdump I only see the single large packet sent and no ICMP needs frag: [root@ovn-worker ~]# tcpdump -i any port 1337 or icmp -nneev tcpdump: data link type LINUX_SLL2 dropped privs to tcpdump tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes 17:27:35.929295 breth0 Out ifindex 6 02:42:ac:12:00:02 ethertype IPv4 (0x0800), length 57: (tos 0x0, ttl 64, id 20241, offset 0, flags [DF], proto UDP (17), length 37) 169.254.169.2.34358 > 10.96.165.251.1337: UDP, length 9 17:27:35.929980 171f3e0078841b7 Out ifindex 10 0a:58:0a:f4:00:01 ethertype IPv4 (0x0800), length 57: (tos 0x0, ttl 62, id 20241, offset 0, flags [DF], proto UDP (17), length 37) 100.64.0.2.34358 > 10.244.0.6.1337: UDP, length 9 17:27:35.930179 171f3e0078841b7 P ifindex 10 0a:58:0a:f4:00:06 ethertype IPv4 (0x0800), length 64: (tos 0x0, ttl 64, id 6994, offset 0, flags [DF], proto UDP (17), length 44) 10.244.0.6.1337 > 100.64.0.2.34358: UDP, length 16 17:27:35.930524 breth0 In ifindex 6 0a:58:a9:fe:a9:04 ethertype IPv4 (0x0800), length 64: (tos 0x0, ttl 62, id 6994, offset 0, flags [DF], proto UDP (17), length 44) 10.96.165.251.1337 > 169.254.169.2.34358: UDP, length 16 ^ Small packet works fine 17:28:21.861554 breth0 Out ifindex 6 02:42:ac:12:00:02 ethertype IPv4 (0x0800), length 1454: (tos 0x0, ttl 64, id 21977, offset 0, flags [DF], proto UDP (17), length 1434) 169.254.169.2.34358 > 10.96.165.251.1337: UDP, length 1406 ^nothing after sending large packet Expected results: Should get an ICMP needs frag back to my host. --- Additional comment from Tim Rozet on 2023-02-14 20:54:21 UTC --- After some live debugging with Dumitru, we noticed the MACs do not look correct in the output of: 2023-02-14T17:29:49.506Z|00004|pinctrl(ovn_pinctrl0)|DBG|pinctrl received packet-in | opcode=ICMP4_ERROR| OF_Table_ID=27| OF_Cookie_ID=0xde8a2ac| in-port=3| src-mac=0a:58:64:40:00:02, dst-mac=0a:58:64:40:00:01| src-ip=169.254.169.2, dst-ip=10.244.0.6 Dumitru found that this is because this log actually prints the packet after it is modified to be sent back out, rather than the original packet_in. This should also be fixed. Additionally the MAC 0a:58:64:40:00:02 belongs to ovn_cluster_router (the next hop after routing in the GR). Looking at an ovn-trace: 12. lr_in_ip_routing_pre (northd.c:11927): 1, priority 0, uuid d6dd449c reg7 = 0; next; 13. lr_in_ip_routing (northd.c:10317): reg7 == 0 && ip4.dst == 10.244.0.0/16, priority 49, uuid 049e28ea ip.ttl--; reg8[0..15] = 0; reg0 = 100.64.0.1; reg1 = 100.64.0.2; eth.src = 0a:58:64:40:00:02; outport = "rtoj-GR_ovn-worker"; flags.loopback = 1; next; 14. lr_in_ip_routing_ecmp (northd.c:12005): reg8[0..15] == 0, priority 150, uuid b1ae4386 next; 15. lr_in_policy (northd.c:12172): 1, priority 0, uuid 41142d7b reg8[0..15] = 0; next; 16. lr_in_policy_ecmp (northd.c:12174): reg8[0..15] == 0, priority 150, uuid 7e839f56 next; 17. lr_in_arp_resolve (northd.c:12209): ip4, priority 1, uuid ad2b17bc get_arp(outport, reg0); /* MAC binding to 0a:58:64:40:00:01. */ next; 18. lr_in_chk_pkt_len (northd.c:11532): outport == "rtoj-GR_ovn-worker", priority 50, uuid bb1f45da reg9[1] = check_pkt_larger(1414); next; 21. lr_in_arp_request (northd.c:12908): 1, priority 0, uuid bf4acee0 output; Routing and then setting the next hop mac occur before checking if the packet would be too large to forward and punting to controller. This seems like another bug as the next hop mac should not be set because this packet will not be forwarded to the destination, and ovn-controller is using this packet to determine the reply mac. Finally, Dumitru pointed out that the OVN flow resets the dst mac to be itself, which also does not make much sense: actions : "icmp4_error {reg9[0] = 1; reg9[1] = 0; eth.dst = 02:42:ac:12:00:02; ip4.dst = ip4.src; ip4.src = 172.18.0.2; ip.ttl = 255; icmp4.type = 3; /* Destination Unreachable. */ icmp4.code = 4; /* Frag Needed and DF was Set. */ icmp4.frag_mtu = 1400; next(pipeline=ingress, table=0); };" ds_put_format(actions, "icmp4_error {" REGBIT_EGRESS_LOOPBACK" = 1; " REGBIT_PKT_LARGER" = 0; " "eth.dst = %s; " "ip4.dst = ip4.src; " "ip4.src = %s; " "ip.ttl = 255; " "icmp4.type = 3; /* Destination Unreachable. */ " "icmp4.code = 4; /* Frag Needed and DF was Set. */ " "icmp4.frag_mtu = %d; " "next(pipeline=ingress, table=%d); };", op->lrp_networks.ea_s, op->lrp_networks.ipv4_addrs[0].addr_s, mtu, ovn_stage_get_table(S_ROUTER_IN_ADMISSION)); ovn_lflow_add_with_hint__(lflows, op->od, stage, 150, ds_cstr(match), ds_cstr(actions), NULL, copp_meter_get( COPP_ICMP4_ERR, op->od->nbr->copp, meter_groups), &op->nbrp->header_); --- Additional comment from Tim Rozet on 2023-02-14 21:04:00 UTC --- --- Additional comment from Tim Rozet on 2023-02-15 20:00:37 UTC --- Simulating a trace with the same packet out: [root@ovn-worker openvswitch]# ovs-appctl ofproto/trace-packet-out br-int in_port=controller,icmp,vlan_tci=0x0000,dl_src=0a:58:64:40:00:02,dl_dst=0a:58:64:40:00:01,nw_src=169.254.169.2,nw_dst=10.244.0.6,nw_tos=0,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=3,icmp_code=1 'set_field:0x64400001->reg0,set_field:0x64400002->reg1,set_field:0x5390006->reg9,set_field:0x49->reg10,set_field:0x8->reg11,set_field:0x2->reg14,set_field:0x1->reg15,set_field:0x7->metadata,set_field:0x1/0x1->xreg4,set_field:0/0x2->xreg4,set_field:02:42:ac:12:00:02->eth_dst,move:NXM_OF_IP_SRC[]->NXM_OF_IP_DST[],set_field:172.18.0.2->ip_src,set_field:255->nw_ttl,set_field:3->icmp_type,set_field:4->icmp_code,controller(userdata=00.00.00.0d.00.00.00.00.05.78,pause,meter_id=2),resubmit(,8)' Flow: icmp,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=0a:58:64:40:00:02,dl_dst=0a:58:64:40:00:01,nw_src=169.254.169.2,nw_dst=10.244.0.6,nw_tos=0,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=3,icmp_code=1 bridge("br-int") ---------------- set_field:0x64400001->reg0 set_field:0x64400002->reg1 set_field:0x5390006->reg9 set_field:0x49->reg10 set_field:0x8->reg11 set_field:0x2->reg14 set_field:0x1->reg15 set_field:0x7->metadata set_field:0x1/0x1->xreg4 set_field:0/0x2->xreg4 set_field:02:42:ac:12:00:02->eth_dst move:NXM_OF_IP_SRC[]->NXM_OF_IP_DST[] -> NXM_OF_IP_DST[] is now 169.254.169.2 set_field:172.18.0.2->ip_src set_field:255->nw_ttl set_field:3->icmp_type set_field:4->icmp_code controller(userdata=00.00.00.0d.00.00.00.00.05.78,pause,meter_id=2) Final flow: icmp,reg0=0x64400001,reg1=0x64400002,reg9=0x5390005,reg10=0x49,reg11=0x8,reg14=0x2,reg15=0x1,metadata=0x7,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=0a:58:64:40:00:02,dl_dst=02:42:ac:12:00:02,nw_src=172.18.0.2,nw_dst=169.254.169.2,nw_tos=0,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=3,icmp_code=4 Megaflow: recirc_id=0,eth,icmp,in_port=CONTROLLER,dl_dst=0a:58:64:40:00:01,nw_src=169.254.169.2,nw_dst=10.244.0.6,nw_ttl=255,nw_frag=no,icmp_type=0x3/0xff,icmp_code=0x1/0xff Datapath actions: set(eth(dst=02:42:ac:12:00:02)),set(ipv4(src=172.18.0.2,dst=169.254.169.2)),set(icmp(type=3,code=4)) This flow is handled by the userspace slow path because it: - Uses action(s) not supported by datapath. --- Additional comment from Tim Rozet on 2023-02-17 14:38:21 UTC --- vconn logs: 023-02-15T19:43:58.774Z|7013790|vconn|DBG|unix#556308: received: OFPT_PACKET_OUT (OF1.5) (xid=0x1a27): in_port=CONTROLLER actions=set_field:0x64400001->reg0,set_field:0x64400002->reg1,set_field:0x5390006->reg9,set_field:0x49->reg10,set_field:0x8->reg11,set_field:0x2->reg14,set_field:0x1->reg15,set_field:0x7->metadata,set_field:0x1/0x1->xreg4,set_field:0/0x2->xreg4,set_field:02:42:ac:12:00:02->eth_dst,move:NXM_OF_IP_SRC[]->NXM_OF_IP_DST[],set_field:172.18.0.2->ip_src,set_field:255->nw_ttl,set_field:3->icmp_type,set_field:4->icmp_code,controller(userdata=00.00.00.0d.00.00.00.00.05.78,pause,meter_id=2),resubmit(,8) data_len=576 icmp,vlan_tci=0x0000,dl_src=0a:58:64:40:00:02,dl_dst=0a:58:64:40:00:01,nw_src=169.254.169.2,nw_dst=10.244.0.6,nw_tos=0,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=3,icmp_code=1 icmp_csum:4273 2023-02-15T19:43:58.775Z|7013791|vconn|DBG|unix#556308: sent (Success): NXT_PACKET_IN2 (OF1.5) (xid=0x0): total_len=576 reg0=0x64400001,reg1=0x64400002,reg9=0x5390005,reg10=0x49,reg11=0x8,reg14=0x2,reg15=0x1,metadata=0x7,in_port=CONTROLLER (via action) data_len=576 (unbuffered) userdata=00.00.00.0d.00.00.00.00.05.78 continuation.bridge=2c7f8bd4-501c-42ee-8607-7be74fc19d13 continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,8) continuation.odp_port=4294967295 icmp,vlan_tci=0x0000,dl_src=0a:58:64:40:00:02,dl_dst=02:42:ac:12:00:02,nw_src=172.18.0.2,nw_dst=169.254.169.2,nw_tos=0,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=3,icmp_code=4 icmp_csum:4270 2023-02-15T19:43:58.775Z|7013792|vconn|DBG|unix#556308: received: NXT_RESUME (OF1.5) (xid=0x0): total_len=576 reg0=0x64400001,reg1=0x64400002,reg9=0x5390005,reg10=0x49,reg11=0x8,reg14=0x2,reg15=0x1,metadata=0x7,in_port=CONTROLLER (via action) data_len=576 (unbuffered) continuation.bridge=2c7f8bd4-501c-42ee-8607-7be74fc19d13 continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,8) continuation.odp_port=4294967295 icmp,vlan_tci=0x0000,dl_src=0a:58:64:40:00:02,dl_dst=02:42:ac:12:00:02,nw_src=172.18.0.2,nw_dst=169.254.169.2,nw_tos=0,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=3,icmp_code=4 icmp_csum:3cf8 2023-02-15T19:43:58.775Z|00001|ofproto_dpif_upcall(handler2)|INFO|received packet on unassociated datapath port 4294967295 (no OpenFlow port for datapath port 4294967295) 2023-02-15T19:43:59.102Z|00011|ofproto_dpif_upcall(revalidator10)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:ceca1501-c11d-4ccb-8fd2-446514f32fa8 --- Additional comment from Tim Rozet on 2023-02-17 14:41:19 UTC --- missed the first vconn log: 023-02-15T19:43:58.774Z|7013789|vconn|DBG|unix#556308: sent (Success): NXT_PACKET_IN2 (OF1.5) (xid=0x0): table_id=27 cookie=0xde8a2ac total_len=1448 ct_state=new|trk|snat|dnat,ct_zone=8,ct_mark=0xa,ct_nw_src=169.254.169.2,ct_nw_dst=10.96.165.251,ct_nw_proto=17,ct_tp_src=52159,ct_tp_dst=1337,ip,reg0=0x64400001,reg1=0x64400002,reg9=0x5390006,reg10=0x49,reg11=0x8,reg14=0x2,reg15=0x1,metadata=0x7,in_port=3 (via action) data_len=1448 (unbuffered) userdata=00.00.00.0e.00.00.00.00.00.19.00.18.80.01.09.10.00.00.00.00.00.00.00.01.00.00.00.00.00.00.00.01.00.19.00.18.80.01.09.10.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.02.00.19.00.10.80.00.06.06.02.42.ac.12.00.02.00.00.00.1c.00.18.00.20.00.00.00.00.00.00.80.00.16.04.80.00.18.04.00.00.00.00.00.19.00.10.80.00.16.04.ac.12.00.02.00.00.00.00.00.19.00.10.00.01.3a.01.ff.00.00.00.00.00.00.00.00.19.00.10.80.00.26.01.03.00.00.00.00.00.00.00.00.19.00.10.80.00.28.01.04.00.00.00.00.00.00.00.ff.ff.00.30.00.00.23.20.00.25.00.00.00.00.00.00.00.03.00.0e.00.00.00.0d.00.00.00.00.05.78.00.00.00.04.00.04.00.00.00.00.00.05.00.08.00.00.00.02.ff.ff.00.10.00.00.23.20.00.0e.ff.f8.08.00.00.00 udp,vlan_tci=0x0000,dl_src=0a:58:64:40:00:02,dl_dst=0a:58:64:40:00:01,nw_src=169.254.169.2,nw_dst=10.244.0.6,nw_tos=0,nw_ecn=0,nw_ttl=63,nw_frag=no,tp_src=52159,tp_dst=1337 udp_csum:22f9 --- Additional comment from Tim Rozet on 2023-02-17 16:29:13 UTC --- OK, so after some further investigation with Lorenzo this is the outcome: 1. This problem looks to be an OVS bug. We have not exactly identified what triggers it, but we think it may have something do with traffic originating from the LOCAL port. I'm going to clone this bz to the OVS team. OVS error: 2023-02-15T19:43:58.775Z|00001|ofproto_dpif_upcall(handler2)|INFO|received packet on unassociated datapath port 4294967295 (no OpenFlow port for datapath port 4294967295) 2023-02-15T19:43:59.102Z|00011|ofproto_dpif_upcall(revalidator10)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:ceca1501-c11d-4ccb-8fd2-446514f32fa8 2. This BZ will remain open to track 2 issues: a. Lorenzo and Dumitru uncovered that when fragmented packets are sent to an OVN load balancer, only the first fragment (with the UDP dest port) makes it to the destination. The other packet does not. b. OVN team to investigate if the method of using 2 PACKET_INs to handle an ICMP needs frag could/should be reduced to 1.
Paolo, is there any update on this bug or when you will start working on it?
Paolo provided a patch which fixes the issue: diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c index c01177718..23f510ffb 100644 --- a/ofproto/ofproto-dpif-xlate.c +++ b/ofproto/ofproto-dpif-xlate.c @@ -1532,7 +1532,8 @@ xlate_lookup_ofproto_(const struct dpif_backer *backer, } ofp_port_t in_port = recirc_id_node->state.metadata.in_port; - if (in_port != OFPP_NONE && in_port != OFPP_CONTROLLER) { + if (in_port != OFPP_NONE && in_port != OFPP_CONTROLLER && + !uuid_is_zero(&recirc_id_node->state.xport_uuid)) { struct uuid xport_uuid = recirc_id_node->state.xport_uuid; xport = xport_lookup_by_uuid(xcfg, &xport_uuid); if (xport && xport->xbridge && xport->xbridge->ofproto) {
After some tests on both ovn and ovs, the problem got reproduced in a simpler setup involving ovs only. It is observable when the packet originating from the controller toward br1 gets then sent to a patch port and then ends up recirculating before proceeding in the pipeline. Continuation and local ports don't seem to be relevant to trigger the issue. As Tim wrote in the previous comment, a quick patch got tested.
(In reply to Paolo Valerio from comment #4) > After some tests on both ovn and ovs, the problem got reproduced in a > simpler setup involving > ovs only. > It is observable when the packet originating from the controller toward br1 > gets then sent > to a patch port and then ends up recirculating before proceeding in the > pipeline. > Continuation and local ports don't seem to be relevant to trigger the issue. > As Tim wrote in the previous comment, a quick patch got tested. for completeness, the patch, aimed to confirm and root cause the issue, got tested on both the ovs (affected setup) and ovn-fake-multinode (not affected setup) Lorenzo and I worked on in order to reproduce the issue.
Paolo, do you mind providing a link to the posted fix? Thanks.
(In reply to Tim Rozet from comment #6) > Paolo, do you mind providing a link to the posted fix? Thanks. Sure, here's the link to the proposed fix: https://patchwork.ozlabs.org/project/openvswitch/patch/168416414918.7273.15419064875718310114.stgit@rawp/
Hi Paolo, do you know if you this patch will be backported to OCP 4.12?
(In reply to William Zhao from comment #8) > Hi Paolo, do you know if you this patch will be backported to OCP 4.12? Hi William, can't say for sure here as it depends on upstream. If backporting there works out, we will provide the updated downstream for ocp 4.12.
Any update on this BZ or ETA on when it will land?
The next version is under test. I'll post a new revision soon.
v2: https://patchwork.ozlabs.org/project/openvswitch/patch/169394118209.2946148.11773928893766161946.stgit@rawp/
Fix included in: https://errata.devel.redhat.com/advisory/120352