Description of problem: When removing a load balancer backend as part of cleaning up the network resource it should be clearing any conntrack entries currently associated with the load balancer backend being removed
OpenShift worked around this by clearing UDP conntrack entries in ovn-kubernetes. Note that we only care about UDP conntrack. We also found that we had to clear conntrack entries on *all* nodes when a logical port went away, because even if the logical port was bound to chassis A there might be conntrack entries on chassis B if a pod on chassis B was trying to talk to that now-deleted pod on chassis A.
Note we are also hitting this issue in OVNK CI now: https://bugzilla.redhat.com/show_bug.cgi?id=1990335#c3 I think the deletion of the CT entry needs to be triggered by ovn-controller. If ovnkube-node is responsible, it will have to periodically "sync" all of the endpoints in the cluster and compare that to all of the CT entries in the DNAT zone. This is because there we need to remove the endpoint only after it has been removed via OpenFlow. If ovn-controller is responsible for this action, then it can do it after it knows it deleted the endpoint from the OF group and it knows specifically which endpoint it is. Talking with Dumitru this morning we discussed several options. Since ovn-controller is unaware of the OVS datapath being used it is unable to know whether or not kernel or userspace CT is being used and therefore does not call netlink directly. It typically issues an OF command to OVS to signal to OVS to clear conntrack. However there is no support for a specific command to clear a single conntrack entry by filter, so this path would require OVS library changes as well as OVN changes. Dumitru suggested a more simple fix where we could provide a script that ovn-controller would call when an endpoint is removed. A user could pass any script they want via an argument to ovn-controller binary, but an example script could be provided in the library that flushes the conntrack entry. Then in CNO we could just set the argument in ovn-controller like --endpoint-delete-script and that would then call a bash script that would call conntrack-utils to delete the entry.
(In reply to Tim Rozet from comment #5) [...] > Dumitru suggested a more simple fix where we could provide a script that > ovn-controller would call when an endpoint is removed. A user could pass any > script they want via an argument to ovn-controller binary, but an example > script could be provided in the library that flushes the conntrack entry. > Then in CNO we could just set the argument in ovn-controller like > --endpoint-delete-script and that would then call a bash script that would > call conntrack-utils to delete the entry. Just a note on this part, Tim raised the question about potential performance impact (because of calling a bash script on every endpoint deletion): We could try to design a solution where a dedicated ovn-controller thread takes care of eventually flushing conntrack for removed endpoints by calling the external binary/script. These request could be queued by the main thread when deletion happens. Like this, any potential delays incurred by exec-ing an external tool should not affect the main processing loop in ovn-controller.
OVS RFE for the "long-term" solution of informing ovs-vswitchd to flush conntrack entries that correspond to a given backend: https://bugzilla.redhat.com/show_bug.cgi?id=2120546
After a follow up discussion with Tim we decided to avoid the script "short term" solution and work towards a real fix: - support filtered conntrack flushing in OVS via an OF extension (bug 2120546) - use this new feature in ovn-controller when we detect backends being removed If needed, until the fix lands in OVS/OVN, ovn-kubernetes can work around the problem by periodically checking and cleaning up conntrack state for pods that are local to the node.
This comment was flagged a spam, view the edit history to see the original text if required.
Patches posted u/s: https://patchwork.ozlabs.org/project/ovn/list/?series=338239