check_pkt_larger action translation is incomplete in OVS When this action is encountered, then the translation stops once check_pkt_larger is translated. If there are flows like below then, it doesn't work as expected. table=0,in_port=1 actions=load:0x1->NXM_NX_REG1[[]],resubmit(,1),load:0x2->NXM_NX_REG1[[]],resubmit(,1),load:0x3->NXM_NX_REG1[[]],resubmit(,1) table=1,in_port=1,reg1=0x1 actions=check_pkt_larger(200)->NXM_NX_REG0[[0]],resubmit(,4) table=1,in_port=1,reg1=0x2 actions=output:2 table=1,in_port=1,reg1=0x3 actions=output:4 table=4,in_port=1 actions=output:3 For the above openflows, the datapath flow should be check_pkt_len(size=200,gt(3),le(3)),2,4 But right now it is: check_pkt_len(size=200,gt(3),le(3)) Notice that the packet is not output to ports 2 and 4. +++ This bug was initially created as a clone of Bug #2017424 +++ Description of problem: This issue is reproduced on a RHOS 16.2 environment, using ovn-2021-21.09.0-12. There is a problem with the VMs connected directly to the external/provider network. They have no connectivity with the metadata service. Actually, I try to ping from the VM instance the IP of the metadata namespace, I use tcpdump to capture traffic in the metadata namespace, and I capture no packets at all. This issue is not a race condition, it happens always with ovn-2021-21.09.0-12 + external network with vlan-transparency=True. It doesn't occur with instances connected to a tenant network with ovn-2021-21.09.0-12. It doesn't occur with instances connected to the external network with ovn-2021-21.09.0-12 when the external network has vlan-transparency=False. It doesn't occur with ovn-2021-21.06.0-29, the latest OVN version officially included in RHOS 16.2. # VM INSTANCE RUNNING ON COMPUTE-1 AND CONNECTED TO THE EXTERNAL NETWORK [root@localhost ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether fa:16:3e:32:c9:c7 brd ff:ff:ff:ff:ff:ff inet 10.218.0.188/24 brd 10.218.0.255 scope global dynamic noprefixroute eth0 valid_lft 37959sec preferred_lft 37959sec inet6 fe80::f816:3eff:fe32:c9c7/64 scope link valid_lft forever preferred_lft forever [root@localhost ~]# ip r default via 10.218.0.10 dev eth0 proto dhcp metric 100 10.218.0.0/24 dev eth0 proto kernel scope link src 10.218.0.188 metric 100 169.254.169.254 via 10.218.0.160 dev eth0 proto dhcp metric 100 [root@localhost ~]# ping 10.218.0.160 -c1 PING 10.218.0.160 (10.218.0.160) 56(84) bytes of data. From 10.218.0.188 icmp_seq=1 Destination Host Unreachable --- 10.218.0.160 ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms # METADATA NAMESPACE ON THE SAME COMPUTE-1 (NO PACKETS CAPTURED / EXPECTED ONE ICMP PACKET) [root@compute-1 ~]# ip netns e ovnmeta-50aaa274-19b2-4d99-93fd-84843917fd27 ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: tap50aaa274-11@if279: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether fa:16:3e:3f:0b:3d brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.218.0.160/24 brd 10.218.0.255 scope global tap50aaa274-11 valid_lft forever preferred_lft forever inet 169.254.169.254/16 brd 169.254.255.255 scope global tap50aaa274-11 valid_lft forever preferred_lft forever [root@compute-1 ~]# ip netns e ovnmeta-50aaa274-19b2-4d99-93fd-84843917fd27 tcpdump -vne -i tap50aaa274-11 dropped privs to tcpdump tcpdump: listening on tap50aaa274-11, link-type EN10MB (Ethernet), capture size 262144 bytes ^C 0 packets captured 0 packets received by filter Version-Release number of selected component (if applicable): ovn-2021-21.09.0-12 How reproducible: 100% Steps to Reproduce: 1. create external network with vlan-transparency 2. create a VM with a port connected to the external network 3. check the VM could not connect to the metadata service during the cloud-init stage --- Additional comment from Eduardo Olivares on 2021-10-26 14:48:08 UTC --- Due to this issue, several tempest tests failed during this RHOS 16.2 job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp_3net-ipv4-geneve-composable-vlan-provider-network/16/testReport/ For example, test_dscp_marking_external_network: tempest.lib.exceptions.SSHTimeout: Connection to the 10.218.0.170 via SSH timed out. The SSH connection cannot be established because the VM instance could not access the metadata service and could not obtain the authorized keys. --- Additional comment from Ihar Hrachyshka on 2021-10-28 22:33:48 UTC --- Several clarifications to the original description after spending quality time on the cluster w/ Numan (thanks!): 1) This is not specific to ovnmeta- localport port; the whole network is affected, where broadcast traffic is not delivered to any ports (the same issue affects flows for communication between two VIF ports); 2) This affects all broadcast flows, not just ARP; we noticed the bug by the virtue of VLAN transparent networks disabling local ARP responder, forcing OVN to send broadcast ARP requests originating from one of switch ports to all other ports; broadcast broken in transparent networks effectively means no IP-to-MAC resolution working, which is a lot more visible than for usual networks; 3) We noticed that the flow in table=37 that is meant to fan out the same broadcast frame to all switch ports (including VIFs and localport ovnmeta- port) does resubmit() the frame to some ports but not others; 4) We noticed that the flow in table=37 sends the frame to the first port, which is a router port, but not the rest; example of the fanout flow: cookie=0xd28ed8be, duration=40941.405s, table=37, n_packets=26458, n_bytes=1259820, idle_age=0, hard_age=40376, priority=100,reg15=0x8000,metadata=0x1 actions=load:0x9->NXM_NX_REG15[], resubmit(,39),load:0x5->NXM_NX_REG15[],resubmit(,39),load:0x2->NXM_NX_REG15[],resubmit(,39),load:0x3->NXM_NX_REG15[],resubmit(,39),load:0x6->NXM_NX_REG15[],resubmit(,39),load:0x8000->NXM_NX_REG15[],resubmit(,38) 5) The first fanout of the frame to route port is processed through the tables, reaching table=65, where it's redirected to router datapath and resubmitted into table=8 again to run through router pipeline; it's then blocked there; 6) one would expect that once router copy of the frame is handled, OVN will return to the original action list in table=37 and continue to deliver to the rest of ports, but it doesn't; we can see it in counters in table=39 updating by 1, not by the number of ports in the table=37 action list; 7) we had a hypothesis that something in router pipeline breaks in such a way that the whole pipeline is short circuited; 8) one change in 21.09 that is related to router pipeline and that we were aware of was the PMTU discovery enforcement, enabled by gateway_mtu option set on router port; 9) when we manually unset the option for the router port, the pipeline reverted to fanning out broadcast traffic to all switch ports. We suspect that there is a bug in PMTU enforcement mechanism that circumvents execution of the whole actions list from table=37, and that the bug may be visible when router port is not processed the last in the action list. Numan suggests the culprit is check_pkt_larger implementation. That said, we were not able to reproduce the bug on a new network that seems identical to the one affected; and attaching a router port to the network didn't help in reproducing the issue either, (despite that the router port was inserted first in the list of fanout actions in table=37 flow). So we are actually not sure about the exact mechanism how PMTU affects the environment, we just know that unsetting gateway_mtu helps with processing the complete actions list. Also note that another PMTU related bug was reported for the same environment that affects N-S direction (not for broadcast): https://bugzilla.redhat.com/show_bug.cgi?id=2018179 AFAIU check_pkt_larger is part of OVS proper, not OVN, and we will need to clone the bug to get a fix there. I am not sure if anything can be optimized / fixed on OVN side though. Numan also suggested there may be some inefficiencies in how and when OpenStack OVN driver sets gateway_mtu (setting it when it's not required, exacerbating problems with the OVS packet length check.) If that's the case, we may also want to clone the bug to openstack-neutron. Side note: This is the 4th bug I am aware that was revealed (but not necessarily triggered) by disabling ARP responder for VLAN transparent networks. Looks like people don't really use broadcast for anything but ARP resolution, and this is generally localized in OVN and not offloaded to port owners... --- Additional comment from Numan Siddique on 2021-10-28 22:39:44 UTC --- In IMO to unblock this we can handle in neutron too. Neutron ml2 ovn driver should set gateway_mtu option only if the neutron geneve networks mtu is greater (or perhaps even lesser) than the provider network mtu. I think this would also help performance wise too. As there will be no need to check the pkt length and check_pkt_len datapath action can't be offloaded. --- Additional comment from Ihar Hrachyshka on 2021-10-28 22:58:51 UTC --- "nova" network is the affected one --- Additional comment from Ihar Hrachyshka on 2021-10-28 23:00:39 UTC --- Attached OVN dbs since Numan confirmed they can be used to reproduce the issue (note the "nova" switch). Here is the output that shows what happens in router port pipeline: [root@ovn-chassis-1 data]# ovs-dpctl dump-flows | grep "in_port(9" [18:50:13] recirc_id(0),in_port(9),ct_state(-new-est-rel-rpl-inv-trk),ct_label(0/0x1),eth(src=fa:16:3e:ae:a5:59,dst=ff:ff:ff:ff:ff:ff),eth_type(0x0806),arp(sip=10.218.0.227,tip=10.218.0.188,op=1/0xff,sha=fa:16:3e:ae:a5:59), packets:93, bytes:3906, used:0.997s, actions:check_pkt_len(size=1518,gt(drop),le(drop)) --- Additional comment from Ihar Hrachyshka on 2021-10-29 00:39:10 UTC --- Since this affects gateway_mtu handling, I think a neutron side workaround could be setting ovn_emit_need_to_frag to False. (Achieved by removing OVNEmitNeedToFrag: true from the job definition.) That said, this should really be fixed on OVS side. Numan suggested this: https://paste.centos.org/view/7db11803
The patch is up for review - https://patchwork.ozlabs.org/project/openvswitch/patch/20211029172648.3859172-1-numans@ovn.org/
Reproduced the issue on openvswitch2.16-2.16.0-31.el8fdp.x86_64 / ovn-2021-21.09.1-23.el8fdp.x86_64 and verified the fix on openvswitch2.16-2.16.0-32.el8fdp.x86_64 / ovn-2021-21.09.1-23.el8fdp.x86_64 Using below reproducer. The IPs are netns ls1p1 192.168.1.1 netns ls1p1 192.168.1.2 netns lp 192.168.1.11 systemctl start openvswitch systemctl start ovn-northd ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.183.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.183.25 systemctl restart ovn-controller ovs-vsctl add-br br-ext ovs-vsctl set Open_vSwitch . external-ids:ovn-bridge-mappings=provider:br-ext ovn-nbctl ls-add ls1 ovn-nbctl lsp-add ls1 ls1p1 ovn-nbctl lsp-set-addresses ls1p1 "00:00:00:01:01:01 192.168.1.1 2001::1" ovn-nbctl lsp-add ls1 ls1p2 ovn-nbctl lsp-set-addresses ls1p2 "00:00:00:01:01:02 192.168.1.2 2001::2" ovn-nbctl lsp-add ls1 lp ovn-nbctl lsp-set-type lp localport ovn-nbctl lsp-set-addresses lp "00:00:00:01:01:11 192.168.1.11 2001::11" ovn-nbctl lsp-add ls1 public ovn-nbctl lsp-set-type public localnet ovn-nbctl lsp-set-addresses public unknown ovn-nbctl lsp-set-options public network_name=provider ovn-nbctl lr-add lr1 ovn-nbctl lrp-add lr1 lr1-ls1 00:00:00:00:00:01 192.168.1.254/24 2001::a/64 ovn-nbctl lsp-add ls1 ls1-lr1 ovn-nbctl lsp-set-addresses ls1-lr1 router ovn-nbctl lsp-set-type ls1-lr1 router ovn-nbctl lsp-set-options ls1-lr1 router-port=lr1-ls1 ovn-nbctl set logical_switch ls1 other_config:vlan-passthru=true ovn-nbctl set logical_router_port lr1-ls1 options:gateway_mtu=1300 ovs-vsctl add-port br-int ls1p1 -- set interface ls1p1 type=internal external_ids:iface-id=ls1p1 ip netns add ls1p1 ip link set ls1p1 netns ls1p1 ip netns exec ls1p1 ip link set ls1p1 address 00:00:00:01:01:01 ip netns exec ls1p1 ip link set ls1p1 up ip netns exec ls1p1 ip addr add 192.168.1.1/24 dev ls1p1 ip netns exec ls1p1 ip addr add 2001::1/64 dev ls1p1 ovs-vsctl add-port br-int ls1p2 -- set interface ls1p2 type=internal external_ids:iface-id=ls1p2 ip netns add ls1p2 ip link set ls1p2 netns ls1p2 ip netns exec ls1p2 ip link set ls1p2 address 00:00:00:01:01:02 ip netns exec ls1p2 ip link set ls1p2 up ip netns exec ls1p2 ip addr add 192.168.1.2/24 dev ls1p2 ip netns exec ls1p2 ip addr add 2001::2/64 dev ls1p2 ovs-vsctl add-port br-int lp -- set interface lp type=internal external_ids:iface-id=lp ip netns ad lp ip link set lp netns lp ip netns exec lp ip link set lp address 00:00:00:01:01:11 ip netns exec lp ip link set lp up ip netns exec lp ip addr add 192.168.1.11/24 dev lp ip netns exec lp ip addr add 2001::11/64 dev lp ip netns exec ls1p1 ping 192.168.1.2 -c 1 ip netns exec ls1p1 ping 192.168.1.11 -c 1
Based on github test "AT_SETUP([ofproto-dpif - check_pkt_larger action])" I verified the fix for ovs2.15 without ovn. Issue was reproduced on openvswitch2.15-2.15.0-51.el8fdp and fix was verified on openvswitch2.15-2.15.0-55.el8fdp The steps are: systemctl start openvswitch ovs-vsctl add-br br0 ovs-vsctl add-port br0 p1 -- set Interface p1 type=internal ofport_request=1 ovs-vsctl add-port br0 p2 -- set Interface p2 type=internal ofport_request=2 ovs-vsctl add-port br0 p3 -- set Interface p3 type=internal ofport_request=3 ovs-vsctl add-port br0 p4 -- set Interface p4 type=internal ofport_request=4 ip netns add p1 ip link set p1 netns p1 ip netns exec p1 ip link set p1 address 50:54:00:00:00:09 ip netns exec p1 ip link set p1 up ip netns exec p1 ip address add 10.10.10.1/24 dev p1 ip netns add p2 ip link set p2 netns p2 ip netns exec p2 ip link set p2 address 50:54:00:00:00:a1 ip netns exec p2 ip link set p2 up ip netns exec p2 ip address add 10.10.10.2/24 dev p2 ip netns add p3 ip link set p3 netns p3 ip netns exec p3 ip link set p3 address 50:54:00:00:00:0a ip netns exec p3 ip link set p3 up ip netns exec p3 ip address add 10.10.10.3/24 dev p3 ip netns add p4 ip link set p4 netns p4 ip netns exec p4 ip link set p4 address 50:54:00:00:00:a2 ip netns exec p4 ip link set p4 up ip netns exec p4 ip address add 10.10.10.4/24 dev p4 #ovs-vsctl -- --columns=name,ofport list Interface ovs-ofctl dump-flows br0 ovs-ofctl del-flows br0 ovs-ofctl dump-flows br0 ovs-ofctl --protocols=OpenFlow10 add-flows br0 flows.txt ovs-ofctl dump-flows br0 ovs-appctl ofproto/trace br0 in_port=p1,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,dl_type=0x0800,nw_src=10.10.10.2,nw_dst=10.10.10.1,nw_proto=1,nw_tos=1,icmp_type=8,icmp_code=0 #Through 2nd ssh session monitor ip netns exec p2 tcpdump -i p2 -n #Through 3rd ssh session monitor ip netns exec p3 tcpdump -i p3 -n #Through 4th ssh session monitor ip netns exec p4 tcpdump -i p4 -n # send ping packets from p1 ip netns exec p1 ping 10.10.10.2 ip netns exec p1 ping 10.10.10.3 ip netns exec p1 ping 10.10.10.4 #cat flows.txt table=0,in_port=1 actions=load:0x1->NXM_NX_REG1[],resubmit(,1),load:0x2->NXM_NX_REG1[],resubmit(,1),load:0x3->NXM_NX_REG1[],resubmit(,1) table=1,in_port=1,reg1=0x1 actions=check_pkt_larger(200)->NXM_NX_REG0[0],resubmit(,4) table=1,in_port=1,reg1=0x2 actions=output:2 table=1,in_port=1,reg1=0x3 actions=output:4 table=4,in_port=1 actions=output:3 ~~~~~~ On openvswitch2.15-2.15.0-51.el8fdp.x86_64, the output of ovs-appctl ofproto/trace is below, [root@netqe6 ~]# ovs-appctl ofproto/trace br0 in_port=p1,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,dl_type=0x0800,nw_src=10.10.10.2,nw_dst=10.10.10.1,nw_proto=1,nw_tos=1,icmp_type=8,icmp_code=0 Flow: icmp,in_port=1,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.10.10.2,nw_dst=10.10.10.1,nw_tos=0,nw_ecn=0,nw_ttl=0,icmp_type=8,icmp_code=0 bridge("br0") ------------- 0. in_port=1, priority 32768 load:0x1->NXM_NX_REG1[] resubmit(,1) 1. reg1=0x1,in_port=1, priority 32768 check_pkt_larger(200)->NXM_NX_REG0[0] resubmit(,4) 4. in_port=1, priority 32768 output:3 resubmit(,4) 4. in_port=1, priority 32768 output:3 Final flow: icmp,reg1=0x1,in_port=1,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.10.10.2,nw_dst=10.10.10.1,nw_tos=0,nw_ecn=0,nw_ttl=0,icmp_type=8,icmp_code=0 Megaflow: recirc_id=0,eth,ip,in_port=1,nw_frag=no Datapath actions: check_pkt_len(size=200,gt(4),le(4)) <<------ Ping test result (from namespace p1) is that only namespace p3 (on 3rd ovs port) can receive icmp packets, namespace p2 and p4 didn't receive any packets. ~~~~~~~ On openvswitch2.15-2.15.0-55.el8fdp.x86_64, the output is below, [root@netqe6 ~]# ovs-appctl ofproto/trace br0 in_port=p1,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,dl_type=0x0800,nw_src=10.10.10.2,nw_dst=10.10.10.1,nw_proto=1,nw_tos=1,icmp_type=8,icmp_code=0 Flow: icmp,in_port=1,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.10.10.2,nw_dst=10.10.10.1,nw_tos=0,nw_ecn=0,nw_ttl=0,icmp_type=8,icmp_code=0 bridge("br0") ------------- 0. in_port=1, priority 32768 load:0x1->NXM_NX_REG1[] resubmit(,1) 1. reg1=0x1,in_port=1, priority 32768 check_pkt_larger(200)->NXM_NX_REG0[0] resubmit(,4) 4. in_port=1, priority 32768 output:3 resubmit(,4) 4. in_port=1, priority 32768 output:3 load:0x2->NXM_NX_REG1[] resubmit(,1) 1. reg1=0x2,in_port=1, priority 32768 output:2 load:0x3->NXM_NX_REG1[] resubmit(,1) 1. reg1=0x3,in_port=1, priority 32768 output:4 Final flow: icmp,reg1=0x3,in_port=1,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.10.10.2,nw_dst=10.10.10.1,nw_tos=0,nw_ecn=0,nw_ttl=0,icmp_type=8,icmp_code=0 Megaflow: recirc_id=0,eth,ip,in_port=1,nw_frag=no Datapath actions: check_pkt_len(size=200,gt(4),le(4)),3,5 <<--------------- other two ports have been added Ping test result (from namespace p1) is that all three namespace p2, p3, p4 can receive icmp packets. On p2: 23:38:51.006804 IP 10.10.10.1 > 10.10.10.2: ICMP echo request, id 25888, seq 1, length 64 23:39:08.228559 IP 10.10.10.1 > 10.10.10.3: ICMP echo request, id 25895, seq 1, length 64 23:39:28.152465 IP 10.10.10.1 > 10.10.10.4: ICMP echo request, id 25904, seq 1, length 64 On p3: 23:38:51.006803 IP 10.10.10.1 > 10.10.10.2: ICMP echo request, id 25888, seq 1, length 64 23:39:08.228558 IP 10.10.10.1 > 10.10.10.3: ICMP echo request, id 25895, seq 1, length 64 23:39:28.152465 IP 10.10.10.1 > 10.10.10.4: ICMP echo request, id 25904, seq 1, length 64 On p4: 23:38:51.006804 IP 10.10.10.1 > 10.10.10.2: ICMP echo request, id 25888, seq 1, length 64 23:39:08.228559 IP 10.10.10.1 > 10.10.10.3: ICMP echo request, id 25895, seq 1, length 64 23:39:28.152466 IP 10.10.10.1 > 10.10.10.4: ICMP echo request, id 25904, seq 1, length 64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (openvswitch2.15 update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0052