Description of problem: Recently OVN added the ability to handle sending ICMP needs frag. In my case I am triggering this behavior with a UDP packet and ovn-controller gets the packet_in and responds with an ICMP needs frag, however...somehow the ICMP response ends up being packet_in'ed again to ovn-controller: 2023-02-14T17:29:49.506Z|00004|pinctrl(ovn_pinctrl0)|DBG|pinctrl received packet-in | opcode=ICMP4_ERROR| OF_Table_ID=27| OF_Cookie_ID=0xde8a2ac| in-port=3| src-mac=0a:58:64:40:00:02, dst-mac=0a:58:64:40:00:01| src-ip=169.254.169.2, dst-ip=10.244.0.6 2023-02-14T17:29:49.506Z|00005|pinctrl(ovn_pinctrl0)|DBG|pinctrl received packet-in | opcode=PUT_ICMP4_FRAG_MTU| OF_Table_ID=0| OF_Cookie_ID=0xffffffffffffffff| in-port=65533| src-mac=0a:58:64:40:00:02, dst-mac=02:42:ac:12:00:02| src-ip=172.18.0.2, dst-ip=169.254.169.2 Version-Release number of selected component (if applicable): [root@ovn-worker ~]# rpm -qa | grep ovn ovn-22.12.90-1.fc37.x86_64 ovn-central-22.12.90-1.fc37.x86_64 ovn-host-22.12.90-1.fc37.x86_64 ovn-vtep-22.12.90-1.fc37.x86_64 How reproducible: Everytime Steps to Reproduce: 1. Launch kind in ovn-kubernetes (./kind.sh -wk 1) 2. Create a pod that will serve as an endpoint to a service: cat ~/basic.yaml --- apiVersion: v1 kind: Pod metadata: name: trozet3 labels: pod-name: client #role: webserver #app: spk-coredns spec: containers: - name: pod1 image: python #command: # - /sbin/init command: [ "/bin/bash", "-c", "--" ] args: [ "while true; do sleep 3000000; done;" ] imagePullPolicy: IfNotPresent ports: - name: dns-tcp containerPort: 53 protocol: TCP - name: dns-udp containerPort: 9999 protocol: UDP 3. Create a service: [trozet@fedora ~]$ cat ~/service.yaml apiVersion: v1 kind: Service metadata: name: my-service1 #namespace: exgw2 spec: selector: pod-name: client ports: - protocol: UDP port: 1337 targetPort: 1337 name: blah - protocol: TCP port: 80 targetPort: 80 name: blah2 externalIPs: - 172.18.0.99 type: NodePort externalTrafficPolicy: Local 4. The pod should be placed on ovn-worker node. Exec into the pod and run a UDP server (change localIP to match your pod IP, or use 0.0.0.0 and bind all): import socket localIP = "10.244.0.6" localPort = 1337 bufferSize = 1024 msgFromServer = "Hello UDP Client" bytesToSend = str.encode(msgFromServer) # Create a datagram socket UDPServerSocket = socket.socket(family=socket.AF_INET, type=socket.SOCK_DGRAM) # Bind to address and ip UDPServerSocket.bind((localIP, localPort)) print("UDP server up and listening") # Listen for incoming datagrams while(True): bytesAddressPair = UDPServerSocket.recvfrom(bufferSize) message = bytesAddressPair[0] address = bytesAddressPair[1] clientMsg = "Message from Client:{}".format(message) clientIP = "Client IP Address:{}".format(address) print(clientMsg) print(clientIP) # Sending a reply to client UDPServerSocket.sendto(bytesToSend, address) 5. Login to ovn-worker node and modify the 2 routes to not have MTU 1400: 169.254.169.1 dev breth0 src 172.18.0.2 10.96.0.0/16 via 169.254.169.4 dev breth0 6. Login to the ovn-worker node and run a python client (modify service IP): import socket import sys s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) send_data = "echo n" + "o" * 1400 send_work = "echo blah" service_ip = "10.96.165.251" service_port = 1337 s.sendto(send_work.encode('utf-8'), (service_ip, service_port)) ###^this will work and send a small packet data, address = s.recvfrom(4096) s.sendto(send_data.encode('utf-8'), (service_ip, service_port)) ###^this will send the large packet which will trigger ICMP needs frag Actual results: In tcpdump I only see the single large packet sent and no ICMP needs frag: [root@ovn-worker ~]# tcpdump -i any port 1337 or icmp -nneev tcpdump: data link type LINUX_SLL2 dropped privs to tcpdump tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes 17:27:35.929295 breth0 Out ifindex 6 02:42:ac:12:00:02 ethertype IPv4 (0x0800), length 57: (tos 0x0, ttl 64, id 20241, offset 0, flags [DF], proto UDP (17), length 37) 169.254.169.2.34358 > 10.96.165.251.1337: UDP, length 9 17:27:35.929980 171f3e0078841b7 Out ifindex 10 0a:58:0a:f4:00:01 ethertype IPv4 (0x0800), length 57: (tos 0x0, ttl 62, id 20241, offset 0, flags [DF], proto UDP (17), length 37) 100.64.0.2.34358 > 10.244.0.6.1337: UDP, length 9 17:27:35.930179 171f3e0078841b7 P ifindex 10 0a:58:0a:f4:00:06 ethertype IPv4 (0x0800), length 64: (tos 0x0, ttl 64, id 6994, offset 0, flags [DF], proto UDP (17), length 44) 10.244.0.6.1337 > 100.64.0.2.34358: UDP, length 16 17:27:35.930524 breth0 In ifindex 6 0a:58:a9:fe:a9:04 ethertype IPv4 (0x0800), length 64: (tos 0x0, ttl 62, id 6994, offset 0, flags [DF], proto UDP (17), length 44) 10.96.165.251.1337 > 169.254.169.2.34358: UDP, length 16 ^ Small packet works fine 17:28:21.861554 breth0 Out ifindex 6 02:42:ac:12:00:02 ethertype IPv4 (0x0800), length 1454: (tos 0x0, ttl 64, id 21977, offset 0, flags [DF], proto UDP (17), length 1434) 169.254.169.2.34358 > 10.96.165.251.1337: UDP, length 1406 ^nothing after sending large packet Expected results: Should get an ICMP needs frag back to my host.
After some live debugging with Dumitru, we noticed the MACs do not look correct in the output of: 2023-02-14T17:29:49.506Z|00004|pinctrl(ovn_pinctrl0)|DBG|pinctrl received packet-in | opcode=ICMP4_ERROR| OF_Table_ID=27| OF_Cookie_ID=0xde8a2ac| in-port=3| src-mac=0a:58:64:40:00:02, dst-mac=0a:58:64:40:00:01| src-ip=169.254.169.2, dst-ip=10.244.0.6 Dumitru found that this is because this log actually prints the packet after it is modified to be sent back out, rather than the original packet_in. This should also be fixed. Additionally the MAC 0a:58:64:40:00:02 belongs to ovn_cluster_router (the next hop after routing in the GR). Looking at an ovn-trace: 12. lr_in_ip_routing_pre (northd.c:11927): 1, priority 0, uuid d6dd449c reg7 = 0; next; 13. lr_in_ip_routing (northd.c:10317): reg7 == 0 && ip4.dst == 10.244.0.0/16, priority 49, uuid 049e28ea ip.ttl--; reg8[0..15] = 0; reg0 = 100.64.0.1; reg1 = 100.64.0.2; eth.src = 0a:58:64:40:00:02; outport = "rtoj-GR_ovn-worker"; flags.loopback = 1; next; 14. lr_in_ip_routing_ecmp (northd.c:12005): reg8[0..15] == 0, priority 150, uuid b1ae4386 next; 15. lr_in_policy (northd.c:12172): 1, priority 0, uuid 41142d7b reg8[0..15] = 0; next; 16. lr_in_policy_ecmp (northd.c:12174): reg8[0..15] == 0, priority 150, uuid 7e839f56 next; 17. lr_in_arp_resolve (northd.c:12209): ip4, priority 1, uuid ad2b17bc get_arp(outport, reg0); /* MAC binding to 0a:58:64:40:00:01. */ next; 18. lr_in_chk_pkt_len (northd.c:11532): outport == "rtoj-GR_ovn-worker", priority 50, uuid bb1f45da reg9[1] = check_pkt_larger(1414); next; 21. lr_in_arp_request (northd.c:12908): 1, priority 0, uuid bf4acee0 output; Routing and then setting the next hop mac occur before checking if the packet would be too large to forward and punting to controller. This seems like another bug as the next hop mac should not be set because this packet will not be forwarded to the destination, and ovn-controller is using this packet to determine the reply mac. Finally, Dumitru pointed out that the OVN flow resets the dst mac to be itself, which also does not make much sense: actions : "icmp4_error {reg9[0] = 1; reg9[1] = 0; eth.dst = 02:42:ac:12:00:02; ip4.dst = ip4.src; ip4.src = 172.18.0.2; ip.ttl = 255; icmp4.type = 3; /* Destination Unreachable. */ icmp4.code = 4; /* Frag Needed and DF was Set. */ icmp4.frag_mtu = 1400; next(pipeline=ingress, table=0); };" ds_put_format(actions, "icmp4_error {" REGBIT_EGRESS_LOOPBACK" = 1; " REGBIT_PKT_LARGER" = 0; " "eth.dst = %s; " "ip4.dst = ip4.src; " "ip4.src = %s; " "ip.ttl = 255; " "icmp4.type = 3; /* Destination Unreachable. */ " "icmp4.code = 4; /* Frag Needed and DF was Set. */ " "icmp4.frag_mtu = %d; " "next(pipeline=ingress, table=%d); };", op->lrp_networks.ea_s, op->lrp_networks.ipv4_addrs[0].addr_s, mtu, ovn_stage_get_table(S_ROUTER_IN_ADMISSION)); ovn_lflow_add_with_hint__(lflows, op->od, stage, 150, ds_cstr(match), ds_cstr(actions), NULL, copp_meter_get( COPP_ICMP4_ERR, op->od->nbr->copp, meter_groups), &op->nbrp->header_);
Created attachment 1944176 [details] nbdb, sbdb, OF flows, ovsdb, etc
Simulating a trace with the same packet out: [root@ovn-worker openvswitch]# ovs-appctl ofproto/trace-packet-out br-int in_port=controller,icmp,vlan_tci=0x0000,dl_src=0a:58:64:40:00:02,dl_dst=0a:58:64:40:00:01,nw_src=169.254.169.2,nw_dst=10.244.0.6,nw_tos=0,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=3,icmp_code=1 'set_field:0x64400001->reg0,set_field:0x64400002->reg1,set_field:0x5390006->reg9,set_field:0x49->reg10,set_field:0x8->reg11,set_field:0x2->reg14,set_field:0x1->reg15,set_field:0x7->metadata,set_field:0x1/0x1->xreg4,set_field:0/0x2->xreg4,set_field:02:42:ac:12:00:02->eth_dst,move:NXM_OF_IP_SRC[]->NXM_OF_IP_DST[],set_field:172.18.0.2->ip_src,set_field:255->nw_ttl,set_field:3->icmp_type,set_field:4->icmp_code,controller(userdata=00.00.00.0d.00.00.00.00.05.78,pause,meter_id=2),resubmit(,8)' Flow: icmp,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=0a:58:64:40:00:02,dl_dst=0a:58:64:40:00:01,nw_src=169.254.169.2,nw_dst=10.244.0.6,nw_tos=0,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=3,icmp_code=1 bridge("br-int") ---------------- set_field:0x64400001->reg0 set_field:0x64400002->reg1 set_field:0x5390006->reg9 set_field:0x49->reg10 set_field:0x8->reg11 set_field:0x2->reg14 set_field:0x1->reg15 set_field:0x7->metadata set_field:0x1/0x1->xreg4 set_field:0/0x2->xreg4 set_field:02:42:ac:12:00:02->eth_dst move:NXM_OF_IP_SRC[]->NXM_OF_IP_DST[] -> NXM_OF_IP_DST[] is now 169.254.169.2 set_field:172.18.0.2->ip_src set_field:255->nw_ttl set_field:3->icmp_type set_field:4->icmp_code controller(userdata=00.00.00.0d.00.00.00.00.05.78,pause,meter_id=2) Final flow: icmp,reg0=0x64400001,reg1=0x64400002,reg9=0x5390005,reg10=0x49,reg11=0x8,reg14=0x2,reg15=0x1,metadata=0x7,in_port=CONTROLLER,vlan_tci=0x0000,dl_src=0a:58:64:40:00:02,dl_dst=02:42:ac:12:00:02,nw_src=172.18.0.2,nw_dst=169.254.169.2,nw_tos=0,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=3,icmp_code=4 Megaflow: recirc_id=0,eth,icmp,in_port=CONTROLLER,dl_dst=0a:58:64:40:00:01,nw_src=169.254.169.2,nw_dst=10.244.0.6,nw_ttl=255,nw_frag=no,icmp_type=0x3/0xff,icmp_code=0x1/0xff Datapath actions: set(eth(dst=02:42:ac:12:00:02)),set(ipv4(src=172.18.0.2,dst=169.254.169.2)),set(icmp(type=3,code=4)) This flow is handled by the userspace slow path because it: - Uses action(s) not supported by datapath.
vconn logs: 023-02-15T19:43:58.774Z|7013790|vconn|DBG|unix#556308: received: OFPT_PACKET_OUT (OF1.5) (xid=0x1a27): in_port=CONTROLLER actions=set_field:0x64400001->reg0,set_field:0x64400002->reg1,set_field:0x5390006->reg9,set_field:0x49->reg10,set_field:0x8->reg11,set_field:0x2->reg14,set_field:0x1->reg15,set_field:0x7->metadata,set_field:0x1/0x1->xreg4,set_field:0/0x2->xreg4,set_field:02:42:ac:12:00:02->eth_dst,move:NXM_OF_IP_SRC[]->NXM_OF_IP_DST[],set_field:172.18.0.2->ip_src,set_field:255->nw_ttl,set_field:3->icmp_type,set_field:4->icmp_code,controller(userdata=00.00.00.0d.00.00.00.00.05.78,pause,meter_id=2),resubmit(,8) data_len=576 icmp,vlan_tci=0x0000,dl_src=0a:58:64:40:00:02,dl_dst=0a:58:64:40:00:01,nw_src=169.254.169.2,nw_dst=10.244.0.6,nw_tos=0,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=3,icmp_code=1 icmp_csum:4273 2023-02-15T19:43:58.775Z|7013791|vconn|DBG|unix#556308: sent (Success): NXT_PACKET_IN2 (OF1.5) (xid=0x0): total_len=576 reg0=0x64400001,reg1=0x64400002,reg9=0x5390005,reg10=0x49,reg11=0x8,reg14=0x2,reg15=0x1,metadata=0x7,in_port=CONTROLLER (via action) data_len=576 (unbuffered) userdata=00.00.00.0d.00.00.00.00.05.78 continuation.bridge=2c7f8bd4-501c-42ee-8607-7be74fc19d13 continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,8) continuation.odp_port=4294967295 icmp,vlan_tci=0x0000,dl_src=0a:58:64:40:00:02,dl_dst=02:42:ac:12:00:02,nw_src=172.18.0.2,nw_dst=169.254.169.2,nw_tos=0,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=3,icmp_code=4 icmp_csum:4270 2023-02-15T19:43:58.775Z|7013792|vconn|DBG|unix#556308: received: NXT_RESUME (OF1.5) (xid=0x0): total_len=576 reg0=0x64400001,reg1=0x64400002,reg9=0x5390005,reg10=0x49,reg11=0x8,reg14=0x2,reg15=0x1,metadata=0x7,in_port=CONTROLLER (via action) data_len=576 (unbuffered) continuation.bridge=2c7f8bd4-501c-42ee-8607-7be74fc19d13 continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,8) continuation.odp_port=4294967295 icmp,vlan_tci=0x0000,dl_src=0a:58:64:40:00:02,dl_dst=02:42:ac:12:00:02,nw_src=172.18.0.2,nw_dst=169.254.169.2,nw_tos=0,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=3,icmp_code=4 icmp_csum:3cf8 2023-02-15T19:43:58.775Z|00001|ofproto_dpif_upcall(handler2)|INFO|received packet on unassociated datapath port 4294967295 (no OpenFlow port for datapath port 4294967295) 2023-02-15T19:43:59.102Z|00011|ofproto_dpif_upcall(revalidator10)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:ceca1501-c11d-4ccb-8fd2-446514f32fa8
missed the first vconn log: 023-02-15T19:43:58.774Z|7013789|vconn|DBG|unix#556308: sent (Success): NXT_PACKET_IN2 (OF1.5) (xid=0x0): table_id=27 cookie=0xde8a2ac total_len=1448 ct_state=new|trk|snat|dnat,ct_zone=8,ct_mark=0xa,ct_nw_src=169.254.169.2,ct_nw_dst=10.96.165.251,ct_nw_proto=17,ct_tp_src=52159,ct_tp_dst=1337,ip,reg0=0x64400001,reg1=0x64400002,reg9=0x5390006,reg10=0x49,reg11=0x8,reg14=0x2,reg15=0x1,metadata=0x7,in_port=3 (via action) data_len=1448 (unbuffered) userdata=00.00.00.0e.00.00.00.00.00.19.00.18.80.01.09.10.00.00.00.00.00.00.00.01.00.00.00.00.00.00.00.01.00.19.00.18.80.01.09.10.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.02.00.19.00.10.80.00.06.06.02.42.ac.12.00.02.00.00.00.1c.00.18.00.20.00.00.00.00.00.00.80.00.16.04.80.00.18.04.00.00.00.00.00.19.00.10.80.00.16.04.ac.12.00.02.00.00.00.00.00.19.00.10.00.01.3a.01.ff.00.00.00.00.00.00.00.00.19.00.10.80.00.26.01.03.00.00.00.00.00.00.00.00.19.00.10.80.00.28.01.04.00.00.00.00.00.00.00.ff.ff.00.30.00.00.23.20.00.25.00.00.00.00.00.00.00.03.00.0e.00.00.00.0d.00.00.00.00.05.78.00.00.00.04.00.04.00.00.00.00.00.05.00.08.00.00.00.02.ff.ff.00.10.00.00.23.20.00.0e.ff.f8.08.00.00.00 udp,vlan_tci=0x0000,dl_src=0a:58:64:40:00:02,dl_dst=0a:58:64:40:00:01,nw_src=169.254.169.2,nw_dst=10.244.0.6,nw_tos=0,nw_ecn=0,nw_ttl=63,nw_frag=no,tp_src=52159,tp_dst=1337 udp_csum:22f9
OK, so after some further investigation with Lorenzo this is the outcome: 1. This problem looks to be an OVS bug. We have not exactly identified what triggers it, but we think it may have something do with traffic originating from the LOCAL port. I'm going to clone this bz to the OVS team. OVS error: 2023-02-15T19:43:58.775Z|00001|ofproto_dpif_upcall(handler2)|INFO|received packet on unassociated datapath port 4294967295 (no OpenFlow port for datapath port 4294967295) 2023-02-15T19:43:59.102Z|00011|ofproto_dpif_upcall(revalidator10)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:ceca1501-c11d-4ccb-8fd2-446514f32fa8 2. This BZ will remain open to track 2 issues: a. Lorenzo and Dumitru uncovered that when fragmented packets are sent to an OVN load balancer, only the first fragment (with the UDP dest port) makes it to the destination. The other packet does not. b. OVN team to investigate if the method of using 2 PACKET_INs to handle an ICMP needs frag could/should be reduced to 1.
(In reply to Tim Rozet from comment #6) > OK, so after some further investigation with Lorenzo this is the outcome: > > 1. This problem looks to be an OVS bug. We have not exactly identified what > triggers it, but we think it may have something do with traffic originating > from the LOCAL port. I'm going to clone this bz to the OVS team. OVS error: > 2023-02-15T19:43:58.775Z|00001|ofproto_dpif_upcall(handler2)|INFO|received > packet on unassociated datapath port 4294967295 (no OpenFlow port for > datapath port 4294967295) > 2023-02-15T19:43:59. > 102Z|00011|ofproto_dpif_upcall(revalidator10)|WARN|Failed to acquire > udpif_key corresponding to unexpected flow (Invalid argument): > ufid:ceca1501-c11d-4ccb-8fd2-446514f32fa8 a possible ovs fix for this issue [0] has been proposed here: https://patchwork.ozlabs.org/project/openvswitch/patch/168416414918.7273.15419064875718310114.stgit@rawp/ > > 2. This BZ will remain open to track 2 issues: > a. Lorenzo and Dumitru uncovered that when fragmented packets are sent to > an OVN load balancer, only the first fragment (with the UDP dest port) makes > it to the destination. The other packet does not. this is an ovn issue fixed in the following commit: commit ce46a1bacf69140e64689a066a507471ba72d80f Author: Ales Musil <amusil> Date: Thu Mar 9 07:21:44 2023 +0100 northd: Use generic ct.est flows for LR LBs Currently, there is one ct.est flow per LB VIP, that was required to keep track if we need to pass the "skip_snat" or "force_snat" flags. However since c1d6b8ac ("northd: Store skip_snat and force_snat in ct_label/mark") the flags are carried in the ct entry and we can use match on them the same way we do for related traffic. Simplify the logic for established traffic through load balancers, by removing the requirement for one ct.est flow per VIP and replacing them with three generic ct.est flows: match=(ct.est && !ct.rel && !ct.new && ct_mark.natted), action=(next;) match=(ct.est && !ct.rel && !ct.new && ct_mark.natted && ct_mark.force_snat == 1), action=(flags.force_snat_for_lb = 1; next;) match=(ct.est && !ct.rel && !ct.new && ct_mark.natted && ct_mark.skip_snat == 1), action=(flags.skip_snat_for_lb = 1; next;) This allows us avoiding of matching on L4 in defrag stage by not storing the L3 and L4 destination in registers. Match directly on L3 and L4 destination for ct.new in DNAT stage. Populate the registers in LB affinity check stage as they are needed for LB affinity learn. Reported-at: https://bugzilla.redhat.com/2172048 Reported-at: https://bugzilla.redhat.com/2170885 Signed-off-by: Ales Musil <amusil> Reviewed-by: Simon Horman <simon.horman> Signed-off-by: Dumitru Ceara <dceara> > b. OVN team to investigate if the method of using 2 PACKET_INs to handle > an ICMP needs frag could/should be reduced to 1. this approach has been explicitly requested by Ben Plaff and it has not performance implication since it occurs just on the first incoming packet that is transformed into an ICMP error message. For the reason above I will close the bz as 'not a bug' since (1) and (2.a) are not directly tracked by this bz [0] https://bugzilla.redhat.com/show_bug.cgi?id=2170920