Description of problem: Currently OVN load balancing works by selecting a backend (DNAT IP) according to a hash function (by default 5-tuple but can also be changed to other hash functions, e.g., src-ip+dst-ip+proto). Some CMS implementations might require more customization to the load balancing algorithm. For example, Kubernetes services have an option to select affinity based on the client's IP address. This translates to always selecting the same backend if the same client IP is used to access a service (a load balancer VIP). ovn-kubernetes implements this by overriding the default hash function of the load balancer and using the 3-tuple: src + dst-ip + proto. On top of this, kubernetes also allows users to specify a timeout for the "sticky" backend described above. This BZ tracks adding native support for load balancer backend affinity (with a timeout) in OVN. A potential implementation is described below (this is likely not the only option and it might also not be the most efficient): Assuming a load balancer: ovn-nbctl lb-add lb-test 42.42.42.42:4242 43.43.43.43:4343 tcp Applied to a logical switch: ovn-nbctl ls-lb-add sw0 lb-test And to a logical router: ovn-nbctl lr-lb-add lr0 lb-test The load balancer backend is selected for a new connection by using the following logical flows: uuid=0xbb033afb, table=11(ls_in_lb ), priority=120 , match=(ct.new && ip4.dst == 42.42.42.42 && tcp.dst == 4242), action=(reg0[1] = 0; ct_lb_mark(backends=43.43.43.43:4343);) uuid=0xdeacd560, table=6 (lr_in_dnat ), priority=120 , match=(ct.new && ip4 && reg0 == 42.42.42.42 && tcp && reg9[16..31] == 4242), action=(ct_lb_mark(backends=43.43.43.43:4343);) Action ct_lb_mark translates to the following OF flow/group: # For the switch: cookie=0xbb033afb, duration=346.664s, table=19, n_packets=0, n_bytes=0, idle_age=346, priority=120,ct_state=+new+trk,tcp,metadata=0x1,nw_dst=42.42.42.42,tp_dst=4242 actions=load:0->NXM_NX_XXREG0[97],group:1 group_id=1,type=select,selection_method=dp_hash,bucket=bucket_id:0,weight:100,actions=ct(commit,table=20,zone=NXM_NX_REG13[0..15],nat(dst=43.43.43.43:4343),exec(load:0x1->NXM_NX_CT_MARK[1])) # For the router: cookie=0xdeacd560, duration=284.447s, table=14, n_packets=0, n_bytes=0, idle_age=284, priority=120,ct_state=+new+trk,tcp,reg0=0x2a2a2a2a,reg9=0x10920000/0xffff0000,metadata=0x3 actions=group:2 group_id=2,type=select,selection_method=dp_hash,bucket=bucket_id:0,weight:100,actions=ct(commit,table=15,zone=NXM_NX_REG11[0..15],nat(dst=43.43.43.43:4343),exec(load:0x1->NXM_NX_CT_MARK[1])) In order to support affinity, if the load balancer is configured to implement that, we could add a new logical action, e.g., apply_lb_affinity() and add logical flows in the stage immediately preceding ls_in_lb and lr_in_dnat that use this action and store its result in a register bit, e.g. "reg0[16]". Logically, the action would return true (1) if the packet matches an existing session from the same client IP to the load balancer with session affinity configured. If so, then the backend to be used should be the one already used by the session with affinity configured. To implement this we could use an additional OF table, e.g., OFTABLE_CHK_LB_AFFINITY, similar to OFTABLE_CHK_LB_HAIRPIN/ OFTABLE_CHK_IN_PORT_SEC/etc. For load balancers with affinity configured we could change the group buckets associated to its backends and add action learn to insert rows in this new OF table. These rows could match on: ct.new && ip.src == <pkt.src_ip> && ip.dst == <lb.bip> && proto == <lb.proto> && l4.dst == <lb.port> then set reg0[16] to 1 and "load group-id and bucket-id into some registers" This flows timeout would be set to the load balancer's affinity timeout configuration. A new (higher priority) needs to be added to tables ls_in_lb and lr_in_dnat to check on the result of the apply_lb_affinity() action and, if this is set to true (1) then bypass regular load balancing and use the learnt backent. Areas that need extra care: - the learn action flows will likely be generated by ovn-controller which means that we need to change the SB.Load_Balancer table and also insert there rows that correspond to load balancers applied to routers (we currently only add rows that correspond to LBs applied to switches). - we need to ensure that the datapath flows (both when session affinity is hit or not) are still generic enough and don't generate unnecessary upcalls.
Some context on the Kubernetes services requirements. From https://kubernetes.io/docs/concepts/services-networking/service/ : If you want to make sure that connections from a particular client are passed to the same Pod each time, you can select the session affinity based on the client's IP addresses by setting service.spec.sessionAffinity to "ClientIP" (the default is "None"). You can also set the maximum session sticky time by setting service.spec.sessionAffinityConfig.clientIP.timeoutSeconds appropriately. (the default value is 10800, which works out to be 3 hours).
upstream series: https://patchwork.ozlabs.org/project/ovn/list/?series=321324
this can be verified by case:#https://bugzilla.redhat.com/show_bug.cgi?id=2150533:nat/lb_force_snat_ip https://beaker.engineering.redhat.com/recipes/14222802/tasks/162837788/logs/taskout.log https://beaker.engineering.redhat.com/recipes/14222803/tasks/162837808/logs/taskout.log set verified. verified on version:ovn22.12-22.12.0-108.el9fdp
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn22.12 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:4677