Bug 1849162
Summary: | Traffic fails to unDNAT without an allow-related ACL existing on the logical switch | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Tim Rozet <trozet> | ||||||
Component: | ovn2.13 | Assignee: | Numan Siddique <nusiddiq> | ||||||
Status: | CLOSED ERRATA | QA Contact: | ying xu <yinxu> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | RHEL 8.0 | CC: | ctrautma, dcbw, jishi, mark.d.gray, nusiddiq, ralongi, rkhan, vpunj | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1867844 (view as bug list) | Environment: | |||||||
Last Closed: | 2020-07-27 05:11:50 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1867844 | ||||||||
Attachments: |
|
Description
Tim Rozet
2020-06-19 18:03:00 UTC
Created attachment 1698139 [details]
logs, dbs
For some reason this problem does not happen every deployment, I would say it happens around 50% of the time. I'll attach all the logs,dbs from a working setup as well so they can be compared. Created attachment 1698172 [details]
logs and dbs for when things work
This problem can be hard to address without using conntrack. I'm working on an approach to send the traffic to conntrack only if necessary as opposed to sending all the traffic to conntrack even if there is one ACL with allow-related action. I'm still not sure if that approach would work out fine. But I'm giving a try and working on a POC. I'll keep updating the status here. There is another BZ - https://bugzilla.redhat.com/show_bug.cgi?id=1836804 related to this. So if this approach works, then ovn-k8s can continue to use allow-related (or a new type - allow-reply) ACLs. Thanks Numan Found the issue. I've submitted the patch to fix it - https://patchwork.ozlabs.org/project/openvswitch/patch/20200707131622.581859-1-numans@ovn.org/ Steps to reproduce the issue -------- ovn-nbctl ls-add ls1 ovn-nbctl lsp-add ls1 ls1p1 -- lsp-set-addresses ls1p1 "10:14:00:00:00:04 10.0.0.4" ovn-nbctl lsp-add ls1 ls1p2 -- lsp-set-addresses ls1p2 "10:14:00:00:00:05 10.0.0.5" ovn-nbctl lr-add lr1 ovn-nbctl lrp-add lr1 lr1-ls1 00:00:00:00:ff:01 10.0.0.1/24 ovn-nbctl lsp-add ls1 ls1-lr1 ovn-nbctl lsp-set-type ls1-lr1 router ovn-nbctl lsp-set-addresses ls1-lr1 router ovn-nbctl lsp-set-options ls1-lr1 router-port=lr1-ls1 ovn-nbctl lb-add lb1 "10.0.0.10" "10.0.0.5" ovn-nbctl ls-lb-add ls1 lb1 ovn-nbctl lr-lb-add lr1 lb1 ovn-nbctl lb-add lb2 "10.0.0.20" "10.0.0.5" ovn-nbctl ls-lb-add ls1 lb2 ovn-nbctl lr-lb-add lr1 lb2 # On any node where ovn-controller is running ovs-vsctl add-port br-int ls1p1 -- set interface ls1p1 type=internal ip netns add ls1p1 ip link set ls1p1 netns ls1p1 ip netns exec ls1p1 ip link set lo up ip netns exec ls1p1 ip link set ls1p1 up ip netns exec ls1p1 ip link set ls1p1 address 10:14:00:00:00:04 ip netns exec ls1p1 ip addr add 10.0.0.4/24 dev ls1p1 ip netns exec ls1p1 ip route add default via 10.0.0.1 dev ls1p1 ovs-vsctl set Interface ls1p1 external_ids:iface-id=ls1p1 ovs-vsctl add-port br-int ls1p2 -- set interface ls1p2 type=internal ip netns add ls1p2 ip link set ls1p2 netns ls1p2 ip netns exec ls1p2 ip link set lo up ip netns exec ls1p2 ip link set ls1p2 up ip netns exec ls1p2 ip link set ls1p2 address 10:14:00:00:00:05 ip netns exec ls1p2 ip addr add 10.0.0.5/24 dev ls1p2 ip netns exec ls1p2 ip route add default via 10.0.0.1 dev ls1p2 ovs-vsctl set Interface ls1p2 external_ids:iface-id=ls1p2 # ping to vips. Should work fine ip netns exec ls1p1 ping 10.0.0.10 -c3 ip netns exec ls1p1 ping 10.0.0.20 -c3 lb=$(ovn-nbctl --bare --columns load_balancer list logical_switch ls1 | cut -d ' ' -f2) ovn-nbctl clear load_balancer $lb vips # Now ping from ls1p1 to the load balancer vip which is still set lb1=$(ovn-nbctl --bare --columns load_balancer list logical_switch ls1 | cut -d ' ' -f1) ovn-nbctl get load_balancer $lb1 vips If vip set on $lb1 is 10.0.0.20 then Actual [root@ovn-chassis-1 ~]# ip netns exec ls1p1 ping 10.0.0.20 PING 10.0.0.20 (10.0.0.20) 56(84) bytes of data. 64 bytes from 10.0.0.5: icmp_seq=1 ttl=64 time=1.13 ms 64 bytes from 10.0.0.5: icmp_seq=2 ttl=64 time=0.126 ms This is wrong. The reply should be from the VIP - 10.0.0.20 Expected [root@ovn-chassis-1 ~]# ip netns exec ls1p1 ping 10.0.0.20 PING 10.0.0.20 (10.0.0.20) 56(84) bytes of data. 64 bytes from 10.0.0.20: icmp_seq=1 ttl=64 time=2.19 ms 64 bytes from 10.0.0.20: icmp_seq=2 ttl=64 time=1.30 ms 64 bytes from 10.0.0.20: icmp_seq=3 ttl=64 time=0.165 ms I tagged the build into OCP 4.6 since we're still under development branch rules there. Use the reproducer in comment6,I can reproduce the issue on version: # rpm -qa|grep ovn ovn2.13-host-2.13.0-37.el8fdp.x86_64 ovn2.13-2.13.0-37.el8fdp.x86_64 ovn2.13-central-2.13.0-37.el8fdp.x86_64 about half of times,the ping will get the wrong reply ip. # ip netns exec ls1p1 ping 10.0.0.20 PING 10.0.0.20 (10.0.0.20) 56(84) bytes of data. 64 bytes from 10.0.0.5: icmp_seq=1 ttl=64 time=1.13 ms 64 bytes from 10.0.0.5: icmp_seq=2 ttl=64 time=0.126 ms on the latest version: # rpm -qa|grep ovn ovn2.13-host-2.13.0-39.el8fdp.x86_64 ovn2.13-2.13.0-39.el8fdp.x86_64 ovn2.13-central-2.13.0-39.el8fdp.x86_64 I ran many times,and the ping could get the right reply ip every time. # ip netns exec ls1p1 ping 10.0.0.20 PING 10.0.0.20 (10.0.0.20) 56(84) bytes of data. 64 bytes from 10.0.0.20: icmp_seq=1 ttl=64 time=2.19 ms 64 bytes from 10.0.0.20: icmp_seq=2 ttl=64 time=1.30 ms Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3150 |