Bug 2123837
Summary: | [OVN][DVR][VLAN] some FIP's are unreachable | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Fiorella Yanac <fyanac> | |
Component: | ovn22.03 | Assignee: | Ales Musil <amusil> | |
Status: | CLOSED ERRATA | QA Contact: | Jianlin Shi <jishi> | |
Severity: | high | Docs Contact: | ||
Priority: | urgent | |||
Version: | FDP 22.B | CC: | amusil, chrisw, ctrautma, jiji, jishi, jlibosva, mmichels, scohen, spower | |
Target Milestone: | --- | Keywords: | AutomationBlocker, Regression | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | ovn22.03-22.03.0-104.el8fdp | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2125061 (view as bug list) | Environment: | ||
Last Closed: | 2022-11-03 00:30:13 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2125061 |
Description
Fiorella Yanac
2022-09-02 18:26:56 UTC
Also: . VM that cannot reach ping-icmp from the undercloud, VM:tempest-maclearning-server-1449286456 (10.100.0.11) cannot reach ping-icmp to its gw router (10.100.0.1) -> ping NOK - VM that can reach ping-icmp from the undercloud, VM:tempest-maclearning-server-524944209 (10.100.0.8)can reach ping-icmp to its gw router (10.100.0.1) -> ping OK I looked at this problem and the reason why one VM works and the other not is because of the vlan network. Given that the ports have port_security disabled, OVN doesn't have their IPs and MACs. Therefore the router interfaces need to learn the VMs' macs and store them to the mac_binding table in OVN. For some reason, the ARP requests generated by ovn-controller arrive to one VM tagged with the network vlan tag 1092. See the difference between the broadcasted ARP requests from router port to 10.100.0.43 (non-working) on the two VM interfaces: [root@compute-1 ~]# tcpdump -ennvvi tap33e0c10b-b2 -c2 dropped privs to tcpdump tcpdump: listening on tap33e0c10b-b2, link-type EN10MB (Ethernet), snapshot length 262144 bytes 17:58:17.060090 fa:16:3e:95:5c:f3 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1092, p 0, ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), Request who-has 10.100.0.43 tell 10.100.0.1, length 28 17:58:18.084180 fa:16:3e:95:5c:f3 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1092, p 0, ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), Request who-has 10.100.0.43 tell 10.100.0.1, length 28 2 packets captured 2 packets received by filter 0 packets dropped by kernel [root@compute-1 ~]# tcpdump -ennvvi tap09ce0d1b-54 -c2 dropped privs to tcpdump tcpdump: listening on tap09ce0d1b-54, link-type EN10MB (Ethernet), snapshot length 262144 bytes 17:58:25.252080 fa:16:3e:46:bc:58 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 10.100.0.43 tell 10.100.0.1, length 28 17:58:26.276405 fa:16:3e:46:bc:58 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 10.100.0.43 tell 10.100.0.1, length 28 2 packets captured 2 packets received by filter 0 packets dropped by kernel In the first case the packets come tagged, which is wrong because the vlan tag should be stripped. I'm continuing investigating why one VM gets the tag while the other does not. I also tried to create the mac binding entry manually and the ping started to work so this is related only to the l2 broadcast traffic. After I sent my comment, I noticed that source mac differs too. In the first non-working case the mac is from the ovn-chassis-mac mapping: [root@compute-1 ~]# ovs-vsctl get open . external_ids:ovn-chassis-mac-mappings "datacentre:fa:16:3e:31:f9:d8,tenant:fa:16:3e:95:5c:f3" while the working case correctly uses the router port: [root@controller-0 /]# ovn-nbctl --no-leader get logical_router_port lrp-3101dcb6-9ed4-4200-a162-ddc12790126c mac "fa:16:3e:46:bc:58" I'm switching this BZ to core OVN component as the problem seems to be with the flows. The used OVN version is ovn22.03-22.03.0-69. I'm also gonna attach the DBs. ovncontroller generated arp requestes: working: 2022-09-07T19:31:47.651Z|00258|vconn(ovn_pinctrl0)|DBG|unix:/var/run/openvswitch/br-int.mgmt: sent (Success): OFPT_PACKET_OUT (OF1.5) (xid=0xd8e3): in_port=CONTROLLER actions=set_field:0xa640008->reg0,set_field:0xa640001->reg1,set_field:0x4->reg9,set_field:0x1->reg10,set_field:0x1->reg11,set_field:0x3->reg12,set_field:0x1->reg14,set_field:0x3->reg15,set_field:0x6->metadata,set_field:ff:ff:ff:ff:ff:ff->eth_dst,move:NXM_NX_XXREG0[64..95]->NXM_OF_ARP_SPA[],move:NXM_NX_XXREG0[96..127]->NXM_OF_ARP_TPA[],set_field:1->arp_op,resubmit(,37) data_len=42 arp,vlan_tci=0x0000,dl_src=fa:16:3e:46:bc:58,dl_dst=00:00:00:00:00:00,arp_spa=10.0.0.63,arp_tpa=10.100.0.8,arp_op=1,arp_sha=fa:16:3e:46:bc:58,arp_tha=00:00:00:00:00:00 not working: 2022-09-07T19:31:31.513Z|00249|vconn(ovn_pinctrl0)|DBG|unix:/var/run/openvswitch/br-int.mgmt: sent (Success): OFPT_PACKET_OUT (OF1.5) (xid=0xd8dd): in_port=CONTROLLER actions=set_field:0xa64002b->reg0,set_field:0xa640001->reg1,set_field:0x4->reg9,set_field:0x1->reg10,set_field:0x1->reg11,set_field:0x3->reg12,set_field:0x1->reg14,set_field:0x3->reg15,set_field:0x6->metadata,set_field:ff:ff:ff:ff:ff:ff->eth_dst,move:NXM_NX_XXREG0[64..95]->NXM_OF_ARP_SPA[],move:NXM_NX_XXREG0[96..127]->NXM_OF_ARP_TPA[],set_field:1->arp_op,resubmit(,37) data_len=42 arp,vlan_tci=0x0000,dl_src=fa:16:3e:46:bc:58,dl_dst=00:00:00:00:00:00,arp_spa=10.0.0.63,arp_tpa=10.100.0.43,arp_op=1,arp_sha=fa:16:3e:46:bc:58,arp_tha=00:00:00:00:00:00 Hi, this one looks the same to me as https://bugzilla.redhat.com/show_bug.cgi?id=2119194 WDYT? (In reply to Ales Musil from comment #7) > Hi, > > this one looks the same to me as > https://bugzilla.redhat.com/show_bug.cgi?id=2119194 WDYT? It sounds to me like a different problem. Bug 2119194 is about arp replies not being forwarded to the localnet port while this BZ 2119194 is about broadcast traffic in the vlan backed network still delivers tagged packets to the ports. Correct me if I am wrong, but the ping works if the VMs are on the same compute node right? For the DVR the traffic leaving through localnet port will add vlan tag of that port + the chassis mac. So that would explain why the tag is present there only if the traffic is going between two compute nodes. There should be flow on the other side which takes care of stripping the tag and changing the MAC address back to the router MAC. The flow should be present in table 0 e.g.: cookie=0xd73de030, duration=17.164s, table=0, n_packets=0, n_bytes=0, idle_age=17, priority=180,conj_id=100,in_port=9,dl_vlan=1092 actions=strip_vlan,load:0xa->NXM_NX_REG11[],load:0x3->NXM_NX_REG12[],load:0x5->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],mod_dl_src:fa:16:3e:46:bc:58,resubmit(,8) (In reply to Ales Musil from comment #10) > Correct me if I am wrong, but the ping works if the VMs are on the same > compute node right? > My understanding is that it doesn't matter because VM to VM works as long as it's the same network. To the best of my knowledge what doesn't work is an ARP response generated by ovn-controller from LRP to LSP on the same LS because that comes tagged. > For the DVR the traffic leaving through localnet port will add vlan tag of > that port + the chassis mac. > So that would explain why the tag is present there only if the traffic is > going between two compute nodes. > > There should be flow on the other side which takes care of stripping the tag > and changing the MAC address back to the > router MAC. The flow should be present in table 0 e.g.: > cookie=0xd73de030, duration=17.164s, table=0, n_packets=0, n_bytes=0, > idle_age=17, priority=180,conj_id=100,in_port=9,dl_vlan=1092 > actions=strip_vlan,load:0xa->NXM_NX_REG11[],load:0x3->NXM_NX_REG12[],load: > 0x5->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],mod_dl_src:fa:16:3e:46:bc:58, > resubmit(,8) (In reply to Jakub Libosvar from comment #11) > (In reply to Ales Musil from comment #10) > > Correct me if I am wrong, but the ping works if the VMs are on the same > > compute node right? > > > > My understanding is that it doesn't matter because VM to VM works as long as > it's the same network. To the best of my knowledge what doesn't work is an > ARP response generated by ovn-controller from LRP to LSP on the same LS > because that comes tagged. It might matter because the ARP request is generated on the node that is doing the initial routing. In that case we would need to find out why the flow that strips the tag and changes the MAC address back is not applied to the ARP request. Is this the first test of this scenario with ovn22.03-22.03.0-69? Also does "ovn-appctl inc-engine/recompute" help with that issue? Thanks, Ales > > > For the DVR the traffic leaving through localnet port will add vlan tag of > > that port + the chassis mac. > > So that would explain why the tag is present there only if the traffic is > > going between two compute nodes. > > > > There should be flow on the other side which takes care of stripping the tag > > and changing the MAC address back to the > > router MAC. The flow should be present in table 0 e.g.: > > cookie=0xd73de030, duration=17.164s, table=0, n_packets=0, n_bytes=0, > > idle_age=17, priority=180,conj_id=100,in_port=9,dl_vlan=1092 > > actions=strip_vlan,load:0xa->NXM_NX_REG11[],load:0x3->NXM_NX_REG12[],load: > > 0x5->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],mod_dl_src:fa:16:3e:46:bc:58, > > resubmit(,8) Patch posted: https://patchwork.ozlabs.org/project/ovn/patch/20220916092353.889390-1-amusil@redhat.com/ Accidentally set to modified, it should be post (In reply to Ales Musil from comment #12) > (In reply to Jakub Libosvar from comment #11) > > (In reply to Ales Musil from comment #10) > > > Correct me if I am wrong, but the ping works if the VMs are on the same > > > compute node right? > > > > > > > My understanding is that it doesn't matter because VM to VM works as long as > > it's the same network. To the best of my knowledge what doesn't work is an > > ARP response generated by ovn-controller from LRP to LSP on the same LS > > because that comes tagged. > > It might matter because the ARP request is generated on the node that is > doing the > initial routing. In that case we would need to find out why the flow that > strips the tag > and changes the MAC address back is not applied to the ARP request. > > Is this the first test of this scenario with ovn22.03-22.03.0-69? > Also does "ovn-appctl inc-engine/recompute" help with that issue? I *think* we tried that and it didn't help. But I don't have the environment. > > Thanks, > Ales > > > > > > For the DVR the traffic leaving through localnet port will add vlan tag of > > > that port + the chassis mac. > > > So that would explain why the tag is present there only if the traffic is > > > going between two compute nodes. > > > > > > There should be flow on the other side which takes care of stripping the tag > > > and changing the MAC address back to the > > > router MAC. The flow should be present in table 0 e.g.: > > > cookie=0xd73de030, duration=17.164s, table=0, n_packets=0, n_bytes=0, > > > idle_age=17, priority=180,conj_id=100,in_port=9,dl_vlan=1092 > > > actions=strip_vlan,load:0xa->NXM_NX_REG11[],load:0x3->NXM_NX_REG12[],load: > > > 0x5->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],mod_dl_src:fa:16:3e:46:bc:58, > > > resubmit(,8) ovn22.03 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2131896 ovn22.06 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2131897 ovn22.06 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2131898 ovn22.09 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2131901 ovn22.09 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2131902 Hi Fiorella, I'm not sure if the fix can really sove the issue on your part. could you please help to test with the fixed ovn : http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn22.03/22.03.0/106.el8fdp/? thanks Thanks & Best Regards, Jianlin Shi create reproducer based on https://patchwork.ozlabs.org/project/ovn/patch/20220916092353.889390-1-amusil@redhat.com/: systemctl start openvswitch systemctl start ovn-northd ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.36.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.36.25 systemctl restart ovn-controller ovs-vsctl add-br br-ext ovs-vsctl set open . external_ids:ovn-bridge-mappings=phys:br-ext ovs-vsctl set open . external-ids:ovn-chassis-mac-mappings="phys:ee:00:00:00:00:10" ovn-nbctl ls-add internal ovn-nbctl lsp-add internal ln_internal "" 100 ovn-nbctl lsp-set-addresses ln_internal unknown ovn-nbctl lsp-set-type ln_internal localnet ovn-nbctl lsp-set-options ln_internal network_name=phys ovn-nbctl lsp-add internal internal-gw ovn-nbctl lsp-set-type internal-gw router ovn-nbctl lsp-set-addresses internal-gw router ovn-nbctl lsp-set-options internal-gw router-port=gw-internal ovn-nbctl lsp-add internal vif0 ovn-nbctl lsp-set-addresses vif0 unknown ovn-nbctl lr-add gw ovn-nbctl lrp-add gw gw-internal 00:00:00:00:20:00 192.168.20.1/24 ip netns add vif0 ovs-vsctl add-port br-int vif0 -- set interface vif0 type=internal external_ids:iface-id=vif0 ip link set vif0 netns vif0 ip netns exec vif0 ip link set vif0 address 00:00:00:00:20:10 ip netns exec vif0 ip link set vif0 up ip netns exec vif0 ip addr add 192.168.20.10/24 dev vif0 ip netns exec vif0 ping 192.168.20.1 -c 3 result on ovn22.03-22.03.0-101.el8: + ip netns exec vif0 ip addr add 192.168.20.10/24 dev vif0 + ip netns exec vif0 ping 192.168.20.1 -c 3 PING 192.168.20.1 (192.168.20.1) 56(84) bytes of data. --- 192.168.20.1 ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 2085ms [root@dell-per740-12 bz2123837]# rpm -qa | grep -E "openvswitch2.17|ovn22.03" openvswitch2.17-2.17.0-58.el8fdp.x86_64 ovn22.03-22.03.0-101.el8fdp.x86_64 ovn22.03-host-22.03.0-101.el8fdp.x86_64 ovn22.03-central-22.03.0-101.el8fdp.x86_64 result on ovn22.03-22.03.0-106.el8: + ip netns exec vif0 ip addr add 192.168.20.10/24 dev vif0 + ip netns exec vif0 ping 192.168.20.1 -c 3 PING 192.168.20.1 (192.168.20.1) 56(84) bytes of data. 64 bytes from 192.168.20.1: icmp_seq=1 ttl=254 time=1008 ms 64 bytes from 192.168.20.1: icmp_seq=2 ttl=254 time=4.100 ms 64 bytes from 192.168.20.1: icmp_seq=3 ttl=254 time=0.414 ms --- 192.168.20.1 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2005ms rtt min/avg/max/mdev = 0.414/337.841/1008.112/473.956 ms, pipe 2 [root@dell-per740-12 bz2123837]# rpm -qa | grep -E "openvswitch2.17|ovn22.03" ovn22.03-22.03.0-106.el8fdp.x86_64 openvswitch2.17-2.17.0-58.el8fdp.x86_64 ovn22.03-central-22.03.0-106.el8fdp.x86_64 ovn22.03-host-22.03.0-106.el8fdp.x86_64 Hi Ales, please help to check comment 25, thanks Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn22.03), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:7393 |