We would like to improve the health checking for OVN-controller. In particular, we would like to catch the following circumstances: 1. ovn-controller hasn't applied flows for a long time, either because the sbdb or ovs-vswitchd is somehow down 2. ovn-controller is taking an increasingly long time to apply flows, and this is going to cause a problem 3. ovn-controller is otherwise unhealthy (this is intentionally poorly defined) Chatting with Numan, it seems to make sense for ovn-controller to write this information to the local ovsdb, so that we can consume it without blocking the main thread. All of these will be exported as Prometheus metrics (via ovn-kubernetes). Numbers 1 and 2 can be accomplished by writing a list of sync timestamps and their start-to-finish duration. ovn-kubernetes can then generate a histogram, and from that, the 95%ile sync duration can be estimated.
DCBW pointed out we've discussed something similar - namely, and end-to-end sign that flows are being propagated correctly. We should increment a timestamp in nb_cfg and see that it goes to sbdb and ovn-controller. Then, publish its value as a metric for both the ovn-kube master and all the ovn-controllers. This is a good idea, and I would like to see it as a part of this RFE as well.
After talking to the OVN team, we realized the nb_cfg propagation mechanism is already almost everything we need. Updating nb_cfg in nbdb, e.g. ovn-nbctl set NB_Global . nb_cfg=$(ts), will cause this value to be propagated down to SB_Global and then, via the ovn-controller, to the Chassis_Priv table. This table already includes the "nb_cfg_timestamp", which is the timestamp at which the latest nb_cfg value was set. Furthermore, nb_cfg is also available in the node's ovsdb, in the Bridge table. It would be useful, however, if the nb_cfg_timestamp were also available in the local ovsdb. So, this RFE is for two small(?) additions to the node's local ovsdb: 1- nb_cfg_timestamp 2- ovn-controller process start timestamp (The latter is to make it possible to alert on high e2e latency. Otherwise it's hard to tell the difference between an ovn-controller restart and high running latency)
Patch sent for review: http://patchwork.ozlabs.org/project/ovn/list/?series=249177&state=*
v2 posted for review: http://patchwork.ozlabs.org/project/ovn/list/?series=249334&state=*
Verified on: # rpm -qa | grep -E 'ovn|openvswitch' ovn-2021-host-21.09.0-12.el8fdp.x86_64 openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch openvswitch2.15-2.15.0-26.el8fdp.x86_64 ovn-2021-21.09.0-12.el8fdp.x86_64 ovn-2021-central-21.09.0-12.el8fdp.x86_64 ovn-nbctl --wait=hv sync ovs-vsctl get Bridge br-int external_ids {ct-zone-3d194f6b-c65f-4e4f-8867-b67f02208efa_dnat="4", ct-zone-3d194f6b-c65f-4e4f-8867-b67f02208efa_snat="5", ct-zone-495cb9cb-95a3-4c01-81d9-83977429b71d_dnat="8", ct-zone-495cb9cb-95a3-4c01-81d9-83977429b71d_snat="6", ct-zone-8f7b933f-d227-422d-a427-ac3a293ac19e_dnat="1", ct-zone-8f7b933f-d227-422d-a427-ac3a293ac19e_snat="2", ct-zone-ac15f654-9847-4331-9db2-1a920be4da7e_dnat="9", ct-zone-ac15f654-9847-4331-9db2-1a920be4da7e_snat="7", ct-zone-vm1="3", ct-zone-vm2="10", ct-zone-vm3="11", ovn-nb-cfg="1", ovn-nb-cfg-ts="1634816184014", ovn-startup-ts="1634814832761"} ====> ovn-nb-cfg-ts and ovn-startup-ts can be noted
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:5059