Description of problem: We see during scale testing just executing the ovs-ofctl dump-flows command may take 8 seconds, and adds to our pod latency time. We need to remove this in favor of only using ovn-installed.
Numan proposed this patch upstream: http://patchwork.ozlabs.org/project/ovn/patch/20210818213511.3076974-1-numans@ovn.org/ When that hits OVN (and if nobody bike-sheds on the key names) we'll need ovnkube-master to set "vif-id=<macaddr>" in the pod's Logical Switch Port "options" column, like we do the requested_chassis option. Then ovnkube-node bits need to set "vif-id=<macaddr>" in the OVS Interface record, like we set iface-id already. Then we can trust that when reading ovn-installed from OVS during container sandbox setup, that both OVN and ovnkube agree on which pod instance was actually set up.
Adding some background here: 1. ovnkube-master is responsible for creating the logical switchport in OVN for a pod, while the ovnkube-node CNI is responsible for adding the veth to OVS 2. ovn-controller will recognize when the veth is added to the switch, and bind it to a matching logical switch port that it finds (which ovnkube-master configured) 3. In the past, ovn-controller would bind the port, but our CNI needed some way to know all of the openflow flows were present before we indicated the CNI add was complete. To do this we checked for specific openflow flows being present for the pod in OVS. 4. OVN team came up with a new indicator "ovn-installed", which ovn-controller writes into OVS ovsdb when it binds the port. 5. We switched to using this method, but then realized it had glaring issue, outlined here: https://github.com/ovn-org/ovn-kubernetes/commit/22bed6a10c669142fb13e612a8fe17cf8b895fd6 6. In the above case, if a pod was deleted/recreated with the same name as a previous pod (stateful sets use the same name), ovn-controller might see an old logical switch port and consider the port bound, even though the flows are not there for the new port. 7. In order to work around this, I added back the flow checking in the above commit. However, we see during scale runs the ovs-ofctl dump-flows command takes up to 8 seconds. Hurting our pod latency. 8. OVN team is adding a way to pass a unique identifier, so that when ovnkube-master creates the logical port, it can add this ID. Then the CNI can also add this ID when it adds the port, and ovn-controller will only consider the port ovn-installed if this matches. 9. We need changes to ovnkube to pass this new ID during LSP creation, and in the CNI port add to OVS. 10. In order to verify it works, we can use a stateful set, add/delete/add. The stateful set pod can have an init container that will immediately try to curl github.com or something. Internet connectivity should work immediately.
Verified on 4.10.0-0.nightly-2021-09-14-070333 that the iface-id-ver option is there. I was unable to verify any performance improvements. I was unable to reproduce long dump-flows times while running cluster-density and: avg(rate(ovnkube_node_cni_request_duration_seconds_sum{command="ADD"}[2m])) was not significantly different for a 100 node, 400 iteration cluster-density when comparing 4.9.latest to 4.10.latest 4.10 snippet from ovn-sbctl list Port_Binding on sbdb container _uuid : 4f9f2fff-c30e-479f-a252-c35b95c8458d chassis : 20be2355-e17c-499a-b602-25d7e6a42bee datapath : 5f77a8f6-2c2b-4db8-a488-eac20483d9cf encap : [] external_ids : {namespace=node-density-f3766076-0695-4962-8f29-2bc9a9089abf, pod="true"} gateway_chassis : [] ha_chassis_group : [] logical_port : node-density-f3766076-0695-4962-8f29-2bc9a9089abf_node-density-92 mac : ["0a:58:0a:83:14:0f 10.131.20.15"] nat_addresses : [] options : {iface-id-ver="28cabaa9-9340-4aad-8d76-3506a6bec573", requested-chassis=ip-10-0-197-151.us-east-2.compute.internal} parent_port : [] tag : [] tunnel_key : 8 type : "" up : true virtual_parent : [] @npinaeva @trozet Is this sufficient for verification? or do we need to see more concrete evidence on the performance side.
Hey Mike, it will only matter when the node is under heavy stress. The reduced amount of ovs-ofctl calls during CNI alleviates stress on OVS. I would say the performance benefit is already known and there is no need to validate it here. However, I would suggest for this bz that we verify that when a pod comes up from CNI ADD, it actually has internet access. Sai, has previously tested this by bringing up an init container which immediately tries to download a curl from github to test if the pod is truly wired for networking. I suggest trying that same test and ensure port_binding is working correctly. So maybe spin up 100 pods on a node all with init containers to curl something on the internet and make sure none of them fail.
Verified on 4.10.0-0.nightly-2021-10-02-095441 per comment 6 - pods can reach internet immediately upon startup.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056