Description of problem: traffic is still snatted when 1- loadbalancer has skip_snat=true 2- logical router has lb_force_snat_ip=router_ip 3- loadbalancer has affinity_timeout In the context of OVN-Kubernetes, traffic for services with externalTrafficPolicy: Local and sessionAffinity: ClientIP are still getting snatted Version-Release number of selected component (if applicable): ovn main branch (commit 30952c248d4f804c25af9b1c9565f23c0045e915) How reproducible: all the time Steps to Reproduce: (greatly helped by reusing instructions from bz1995326) 1. in OVN sandbox: # Create the first logical switch with one port ovn-nbctl ls-add sw0 ovn-nbctl lsp-add sw0 sw0-port1 ovn-nbctl lsp-set-addresses sw0-port1 "50:54:00:00:00:01 192.168.0.2" ovs-vsctl add-port br-int sw0-port1 -- set interface sw0-port1 type=internal external_ids:iface-id=sw0-port1 ip netns add sw0-port1 ip link set sw0-port1 netns sw0-port1 ip netns exec sw0-port1 ip link set sw0-port1 address 50:54:00:00:00:01 ip netns exec sw0-port1 ip link set sw0-port1 up ip netns exec sw0-port1 ip addr add 192.168.0.2/24 dev sw0-port1 ip netns exec sw0-port1 ip route add default via 192.168.0.1 # Create the second logical switch with one port ovn-nbctl ls-add sw1 ovn-nbctl lsp-add sw1 sw1-port1 ovn-nbctl lsp-set-addresses sw1-port1 "50:54:00:00:00:03 11.0.0.2" ovs-vsctl add-port br-int sw1-port1 -- set interface sw1-port1 type=internal external_ids:iface-id=sw1-port1 ip netns add sw1-port1 ip link set sw1-port1 netns sw1-port1 ip netns exec sw1-port1 ip link set sw1-port1 address 50:54:00:00:00:03 ip netns exec sw1-port1 ip link set sw1-port1 up ip netns exec sw1-port1 ip addr add 11.0.0.2/24 dev sw1-port1 ip netns exec sw1-port1 ip route add default via 11.0.0.1 # Create a logical router and attach both logical switches ovn-nbctl lr-add lr0 ovn-nbctl lrp-add lr0 lrp0 00:00:00:00:ff:01 192.168.0.1/24 ovn-nbctl lsp-add sw0 lrp0-attachment ovn-nbctl lsp-set-type lrp0-attachment router ovn-nbctl lsp-set-addresses lrp0-attachment 00:00:00:00:ff:01 ovn-nbctl lsp-set-options lrp0-attachment router-port=lrp0 ovn-nbctl lrp-add lr0 lrp1 00:00:00:00:ff:02 11.0.0.1/24 ovn-nbctl lsp-add sw1 lrp1-attachment ovn-nbctl lsp-set-type lrp1-attachment router ovn-nbctl lsp-set-addresses lrp1-attachment 00:00:00:00:ff:02 ovn-nbctl lsp-set-options lrp1-attachment router-port=lrp1 ovn-nbctl set Logical_Router lr0 options:chassis=chassis-1 ovn-nbctl set Logical_Router lr0 options:lb_force_snat_ip=router_ip ovn-nbctl lb-add lb0 11.0.0.200:1234 192.168.0.2:8080 ovn-nbctl set Load_Balancer lb0 options:skip_snat=true ovn-nbctl set load_balancer lb0 options:affinity_timeout=1200 ovn-nbctl lr-lb-add lr0 lb0 ovn-sbctl dump-flows lr0 | grep lr_in_dnat ovn-nbctl --wait=hv sync ip netns exec sw0-port1 python3 -m http.server 8080 & ip netns exec sw1-port1 curl 11.0.0.200:1234 ip netns exec sw1-port1 curl 11.0.0.200:1234 Actual results: at least the second curl succeeds but after SNAT: 192.168.0.1 - - [20/Jul/2023 09:24:39] "GET / HTTP/1.1" 200 - Expected results: curl succeeds with the proper IP 11.0.0.2 - - [20/Jul/2023 09:27:27] "GET / HTTP/1.1" 200 - 11.0.0.2 - - [20/Jul/2023 09:27:27] "GET / HTTP/1.1" 200 - (as it is the case when removing the affinity_timeout with ovn-nbctl remove load_balancer lb0 options affinity_timeout=1200 ) Additional info: also RH case https://access.redhat.com/support/cases/#/case/03563137
Patch posted u/s: https://patchwork.ozlabs.org/project/ovn/patch/20230720125708.132830-1-amusil@redhat.com/
I tried it and it works (thanks!), note that this fails now: ip netns exec sw0-port1 curl 11.0.0.200:1234 (the fun case of the pod contacting the service for which it is its own endpoint, and thus requires the hairpin thing)
@amusil Thanks, can you make sure that this gets backported to whichever version of OVN is present in OCP 4.12?
(In reply to François Rigault from comment #2) > I tried it and it works (thanks!), note that this fails now: > > > ip netns exec sw0-port1 curl 11.0.0.200:1234 > > (the fun case of the pod contacting the service for which it is its own > endpoint, and thus requires the hairpin thing) That also fails when you remove the affinity_timeout (on current main). AFAIK that's correct. (In reply to Scott Dodson from comment #4) > @amusil Thanks, can you make sure that this gets backported to > whichever version of OVN is present in OCP 4.12? Yeah, I'll make sure it gets backported. Thanks, Ales