Bug 2224260

Summary: LB skip_snat improperly applied with affinity_timeout
Product: Red Hat Enterprise Linux Fast Datapath Reporter: François Rigault <francois.rigault>
Component: ovn23.09Assignee: Ales Musil <amusil>
Status: POST --- QA Contact: ying xu <yinxu>
Severity: medium Docs Contact:
Priority: unspecified    
Version: FDP 23.KCC: amusil, ctrautma, jiji, jishi, sdodson
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description François Rigault 2023-07-20 09:42:22 UTC
Description of problem:
traffic is still snatted when 
1- loadbalancer has skip_snat=true
2- logical router has lb_force_snat_ip=router_ip
3- loadbalancer has affinity_timeout
In the context of OVN-Kubernetes, traffic for services with externalTrafficPolicy: Local and  sessionAffinity: ClientIP are still getting snatted

Version-Release number of selected component (if applicable):
ovn main branch (commit 30952c248d4f804c25af9b1c9565f23c0045e915)

How reproducible:
all the time

Steps to Reproduce:
(greatly helped by reusing instructions from bz1995326)

1. in OVN sandbox:
# Create the first logical switch with one port
ovn-nbctl ls-add sw0                                         
ovn-nbctl lsp-add sw0 sw0-port1                           
ovn-nbctl lsp-set-addresses sw0-port1 "50:54:00:00:00:01 192.168.0.2"
                                     
ovs-vsctl add-port br-int sw0-port1 -- set interface sw0-port1 type=internal external_ids:iface-id=sw0-port1
ip netns add sw0-port1                                       
ip link set sw0-port1 netns sw0-port1                     
ip netns exec sw0-port1 ip link set sw0-port1 address 50:54:00:00:00:01
ip netns exec sw0-port1 ip link set sw0-port1 up
ip netns exec sw0-port1 ip addr add 192.168.0.2/24 dev sw0-port1
ip netns exec sw0-port1 ip route add default via 192.168.0.1
                                                     
# Create the second logical switch with one port
ovn-nbctl ls-add sw1                                
ovn-nbctl lsp-add sw1 sw1-port1                                    
ovn-nbctl lsp-set-addresses sw1-port1 "50:54:00:00:00:03 11.0.0.2"
                                                     
ovs-vsctl add-port br-int sw1-port1 -- set interface sw1-port1 type=internal external_ids:iface-id=sw1-port1
ip netns add sw1-port1     
ip link set sw1-port1 netns sw1-port1
ip netns exec sw1-port1 ip link set sw1-port1 address 50:54:00:00:00:03
ip netns exec sw1-port1 ip link set sw1-port1 up
ip netns exec sw1-port1 ip addr add 11.0.0.2/24 dev sw1-port1
ip netns exec sw1-port1 ip route add default via 11.0.0.1

# Create a logical router and attach both logical switches
ovn-nbctl lr-add lr0                   
ovn-nbctl lrp-add lr0 lrp0 00:00:00:00:ff:01 192.168.0.1/24
ovn-nbctl lsp-add sw0 lrp0-attachment           
ovn-nbctl lsp-set-type lrp0-attachment router   
ovn-nbctl lsp-set-addresses lrp0-attachment 00:00:00:00:ff:01
ovn-nbctl lsp-set-options lrp0-attachment router-port=lrp0
ovn-nbctl lrp-add lr0 lrp1 00:00:00:00:ff:02 11.0.0.1/24
ovn-nbctl lsp-add sw1 lrp1-attachment
ovn-nbctl lsp-set-type lrp1-attachment router
ovn-nbctl lsp-set-addresses lrp1-attachment 00:00:00:00:ff:02
ovn-nbctl lsp-set-options lrp1-attachment router-port=lrp1

ovn-nbctl set Logical_Router lr0 options:chassis=chassis-1
ovn-nbctl set Logical_Router lr0 options:lb_force_snat_ip=router_ip
ovn-nbctl lb-add lb0 11.0.0.200:1234 192.168.0.2:8080
ovn-nbctl set Load_Balancer lb0 options:skip_snat=true
ovn-nbctl set load_balancer lb0 options:affinity_timeout=1200
ovn-nbctl lr-lb-add lr0 lb0

ovn-sbctl dump-flows lr0 | grep lr_in_dnat
ovn-nbctl --wait=hv sync

ip netns exec sw0-port1 python3 -m http.server 8080 &

 
ip netns exec sw1-port1 curl 11.0.0.200:1234
ip netns exec sw1-port1 curl 11.0.0.200:1234


Actual results:
at least the second curl succeeds but after SNAT:
192.168.0.1 - - [20/Jul/2023 09:24:39] "GET / HTTP/1.1" 200 -

Expected results:
curl succeeds with the proper IP
11.0.0.2 - - [20/Jul/2023 09:27:27] "GET / HTTP/1.1" 200 -
11.0.0.2 - - [20/Jul/2023 09:27:27] "GET / HTTP/1.1" 200 - 

(as it is the case when removing the affinity_timeout with 
ovn-nbctl remove load_balancer lb0 options affinity_timeout=1200
)


Additional info:
also RH case https://access.redhat.com/support/cases/#/case/03563137

Comment 2 François Rigault 2023-07-20 14:05:30 UTC
I tried it and it works (thanks!), note that this fails now:


ip netns exec sw0-port1 curl 11.0.0.200:1234

(the fun case of the pod contacting the service for which it is its own endpoint, and thus requires the hairpin thing)

Comment 4 Scott Dodson 2023-07-20 14:08:09 UTC
@amusil Thanks, can you make sure that this gets backported to whichever version of OVN is present in OCP 4.12?

Comment 5 Ales Musil 2023-07-20 14:38:10 UTC
(In reply to François Rigault from comment #2)
> I tried it and it works (thanks!), note that this fails now:
> 
> 
> ip netns exec sw0-port1 curl 11.0.0.200:1234
> 
> (the fun case of the pod contacting the service for which it is its own
> endpoint, and thus requires the hairpin thing)

That also fails when you remove the affinity_timeout (on current main). AFAIK that's correct.

(In reply to Scott Dodson from comment #4)
> @amusil Thanks, can you make sure that this gets backported to
> whichever version of OVN is present in OCP 4.12?

Yeah, I'll make sure it gets backported. 

Thanks,
Ales