Bug 2172048
| Summary: | lsp failed to ping it's own floating ip with special configuration | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Jianlin Shi <jishi> |
| Component: | ovn23.03 | Assignee: | Ales Musil <amusil> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Jianlin Shi <jishi> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | FDP 23.B | CC: | amusil, ctrautma, jiji, mmichels |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | ovn23.03-23.03.0-11.el8fdp | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-05-24 07:02:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
List of commits between ovn22.09-22.09.0-47 and ovn22.09-22.09.0-59
6b7d2d835 actions: Add new action called ct_commit_nat
ec4474b0f northd: Allow related traffic through LB
17134a2f5 northd: Add logical flow to defrag ICMP traffic
33020b776 northd: Store skip_snat and force_snat in ct_label/mark
545989afd northd: Add flag for CT related
26bb3fa8f northd: Add logical flows to allow rpl/rel traffic in acl_after_lb stage.
3a9aabb09 ovn-trace: Use the original ovnact for execute_load
3d9a8ca40 Add the metalLB install flag for CI actions
d0c83ca47 ovn-macros: support ipv6 in ovn_attach
f90f3961b pinctrl: Send RARPs for external ipv6 interfaces
c538e6bc3 tests: Fix flaky test "IPv6 Neighbor Solicitation for unknown MAC"
6ba977581 northd.c: Validate port type to avoid unexpected behavior.
Of those, the bottom 6 should have nothing to do with this regression. It is likely one of the top 6 that resulted in this issue. This should make it relatively easy for an OVN dev to bisect and find the offending commit.
My uneducated guess is that since removing the LB causes the problem to go away, it is likely
17134a2f5 northd: Add logical flow to defrag ICMP traffic
that is causing the problem. This commit causes ICMP traffic to go through a "ct_dnat" action when a load balancer with a port is configured, which may result in "ct.inv" for this packet.
@jishi Can you re-run the test, but add the following line after northd is started:
ovn-nbctl set NB_Global . options:use_ct_inv_match=false
Check if this causes the test to pass. If it does, then we at least know what is causing the issue to happen. It doesn't mean we can close this bug, but it at least gives us a root cause for the regression. Thanks.
(In reply to Mark Michelson from comment #1) > List of commits between ovn22.09-22.09.0-47 and ovn22.09-22.09.0-59 > > 6b7d2d835 actions: Add new action called ct_commit_nat > ec4474b0f northd: Allow related traffic through LB > 17134a2f5 northd: Add logical flow to defrag ICMP traffic > 33020b776 northd: Store skip_snat and force_snat in ct_label/mark > 545989afd northd: Add flag for CT related > 26bb3fa8f northd: Add logical flows to allow rpl/rel traffic in acl_after_lb > stage. > 3a9aabb09 ovn-trace: Use the original ovnact for execute_load > 3d9a8ca40 Add the metalLB install flag for CI actions > d0c83ca47 ovn-macros: support ipv6 in ovn_attach > f90f3961b pinctrl: Send RARPs for external ipv6 interfaces > c538e6bc3 tests: Fix flaky test "IPv6 Neighbor Solicitation for unknown MAC" > 6ba977581 northd.c: Validate port type to avoid unexpected behavior. > > Of those, the bottom 6 should have nothing to do with this regression. It is > likely one of the top 6 that resulted in this issue. This should make it > relatively easy for an OVN dev to bisect and find the offending commit. > > My uneducated guess is that since removing the LB causes the problem to go > away, it is likely > > 17134a2f5 northd: Add logical flow to defrag ICMP traffic > > that is causing the problem. This commit causes ICMP traffic to go through a > "ct_dnat" action when a load balancer with a port is configured, which may > result in "ct.inv" for this packet. > > @jishi Can you re-run the test, but add the following line after > northd is started: > > ovn-nbctl set NB_Global . options:use_ct_inv_match=false > > Check if this causes the test to pass. If it does, then we at least know > what is causing the issue to happen. It doesn't mean we can close this bug, > but it at least gives us a root cause for the regression. Thanks. still failed even after adding the configuration: [root@dell-per740-67 bz2172048]# ip netns exec ls1p1 ping 172.16.1.11 -c 1 PING 172.16.1.11 (172.16.1.11) 56(84) bytes of data. --- 172.16.1.11 ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms [root@dell-per740-67 bz2172048]# ovn-nbctl list nb_global _uuid : 6214443d-4e5b-4b1a-8da1-e8557f59161a connections : [4e07cafb-4b9c-4d7b-bb9d-a6d06e85f9cf] external_ids : {} hv_cfg : 0 hv_cfg_timestamp : 0 ipsec : false name : "" nb_cfg : 0 nb_cfg_timestamp : 0 options : {mac_prefix="7e:d8:05", max_tunid="16711680", northd_internal_version="22.09.2-20.25.0-69.4", svc_monitor_mac="26:90:ad:71:93:2f", use_ct_inv_match="false"} sb_cfg : 0 sb_cfg_timestamp : 0 ssl : [] As Mark pointed out it is because of the 17134a2f5 northd: Add logical flow to defrag ICMP traffic. We need to limit this flow for the LB VIP, I'll create a fix for that. ovn23.03 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2182255 ovn23.03 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2182256 It is fixed in 23.03 and further. |
Description of problem: lsp failed to ping it's own floating ip under special configuration Version-Release number of selected component (if applicable): ovn22.09-22.09.0-59.el8 How reproducible: Always Steps to Reproduce: 1. setup on server systemctl start openvswitch systemctl start ovn-northd ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.203.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.203.25 systemctl restart ovn-controller systemctl restart openvswitch ovs-vsctl add-br br-ext ovs-vsctl set Open_vSwitch . external-ids:ovn-bridge-mappings=phynet:br-ext ovs-vsctl add-port br-ext ens1f1 ip link set ens1f1 up ovn-nbctl lr-add lr1 ovn-nbctl lrp-add lr1 lr1-ls1 00:00:01:ff:02:03 192.168.1.254/24 ovn-nbctl ls-add ls1 ovn-nbctl lsp-add ls1 ls1p1 ovn-nbctl lsp-set-addresses ls1p1 "00:00:01:01:01:01 192.168.1.1" ovn-nbctl lsp-add ls1 ls1p2 ovn-nbctl lsp-add ls1 ls1p2.11 ls1p2 11 ovn-nbctl lsp-set-addresses ls1p2.11 "00:00:01:01:01:11 192.168.1.11" ovn-nbctl lsp-add ls1 ls1-lr1 ovn-nbctl lsp-set-type ls1-lr1 router ovn-nbctl lsp-set-options ls1-lr1 router-port=lr1-ls1 ovn-nbctl lsp-set-addresses ls1-lr1 router ovn-nbctl ls-add pub ovn-nbctl lrp-add lr1 lr1-pub 00:00:01:ff:01:03 172.16.1.1/24 ovn-nbctl lsp-add pub pub-lr1 ovn-nbctl lsp-set-type pub-lr1 router ovn-nbctl lsp-set-addresses pub-lr1 router ovn-nbctl lsp-set-options pub-lr1 router-port=lr1-pub ovn-nbctl lsp-add pub pub-ln ovn-nbctl lsp-set-type pub-ln localnet ovn-nbctl lsp-set-addresses pub-ln unknown ovn-nbctl lsp-set-options pub-ln network_name=phynet ovn-nbctl lb-add lb_r1_tcp 172.16.1.101:50001 192.168.1.11:50001,192.168.1.1:50001 tcp ovn-nbctl lr-lb-add lr1 lb_r1_tcp ovs-vsctl add-port br-int ls1p1 -- set interface ls1p1 type=internal external_ids:iface-id=ls1p1 ip netns add ls1p1 ip link set ls1p1 netns ls1p1 ip netns exec ls1p1 ip link set ls1p1 address 00:00:01:01:01:01 ip netns exec ls1p1 ip link set ls1p1 up ip netns exec ls1p1 ip addr add 192.168.1.1/24 dev ls1p1 ip netns exec ls1p1 ip route add default via 192.168.1.254 ovs-vsctl add-port br-int ls1p2 -- set interface ls1p2 type=internal external_ids:iface-id=ls1p2 ip link add link ls1p2 name ls1p2.11 type vlan id 11 ip link set ls1p2 up ip netns add ls1p2.11 ip link set ls1p2.11 netns ls1p2.11 ip netns exec ls1p2.11 ip link set ls1p2.11 address 00:00:01:01:01:11 ip netns exec ls1p2.11 ip link set ls1p2.11 up ip netns exec ls1p2.11 ip addr add 192.168.1.11/24 dev ls1p2.11 ip netns exec ls1p2.11 ip route add default via 192.168.1.254 2. setup on client systemctl start openvswitch systemctl start ovn-northd ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.203.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.203.26 systemctl restart ovn-controller systemctl restart openvswitch ovs-vsctl add-br br-ext ovs-vsctl set Open_vSwitch . external-ids:ovn-bridge-mappings=phynet:br-ext ovs-vsctl add-port br-ext ens1f1 ip link set ens1f1 up 3. set gateway chassis on server client1=wsfd-advnetlab17.anl.lab.eng.bos.redhat.com #hv1_system_id=$(ovn-sbctl find chassis hostname=$(hostname) | awk '/^name/{print $3}' | sed 's/"//g') hv0_system_id=$(ovn-sbctl find chassis hostname=$client1 | awk '/^name/{print $3}' | sed 's/"//g') ovn-nbctl ha-chassis-group-add hagrp1 ovn-nbctl ha-chassis-group-add-chassis hagrp1 $hv0_system_id 100 group1_id=$(ovn-nbctl get ha_chassis_group hagrp1 _uuid) ovn-nbctl set logical_router_port lr1-pub ha_chassis_group=$group1_id ovn-nbctl lr-nat-add lr1 dnat_and_snat 172.16.1.11 192.168.1.1 ls1p1 00:00:00:0a:0a:11 ovn-nbctl lr-nat-add lr1 dnat_and_snat 172.16.1.21 192.168.1.11 ls1p2.11 00:00:00:0a:0a:21 4. ip netns exec ls1p1 ping 172.16.1.11 -c 1 Actual results: failed Expected results: pass Additional info: [root@wsfd-advnetlab16 test]# rpm -qa | grep -E "openvswitch2.17|ovn22.09" openvswitch2.17-2.17.0-74.el8fdp.x86_64 ovn22.09-host-22.09.0-59.el8fdp.x86_64 ovn22.09-22.09.0-59.el8fdp.x86_64 ovn22.09-central-22.09.0-59.el8fdp.x86_64 python3-openvswitch2.17-2.17.0-74.el8fdp.x86_64 [root@wsfd-advnetlab16 scenario]# ip netns exec ls1p1 tcpdump -i ls1p1 -nnle -v dropped privs to tcpdump tcpdump: listening on ls1p1, link-type EN10MB (Ethernet), capture size 262144 bytes 05:00:07.340215 00:00:01:01:01:01 > 00:00:01:ff:02:03, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 2865, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.1.1 > 172.16.1.11: ICMP echo request, id 32760, seq 1, length 64 05:00:07.343209 00:00:01:ff:02:03 > 00:00:01:01:01:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 2865, offset 0, flags [DF], proto ICMP (1), length 84) 172.16.1.11 > 192.168.1.1: ICMP echo request, id 32760, seq 1, length 64 05:00:07.343272 00:00:01:01:01:01 > 00:00:01:ff:02:03, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 2866, offset 0, flags [none], proto ICMP (1), length 84) 192.168.1.1 > 172.16.1.11: ICMP echo reply, id 32760, seq 1, length 64 if I remove the lb, the ping would pass and if set hagrp1 to hv1, the ping would also pass. btw, tcp can pass and the issue didn't exist on ovn22.09-22.09.0-47.el8