Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1924751

Summary:	[RFE] health-checking for ovn-controller
Product:	Red Hat Enterprise Linux Fast Datapath	Reporter:	Casey Callendrello <cdc>
Component:	OVN	Assignee:	Dumitru Ceara <dceara>
Status:	CLOSED ERRATA	QA Contact:	Ehsan Elahi <eelahi>
Severity:	low	Docs Contact:
Priority:	medium
Version:	RHEL 8.0	CC:	ctrautma, dceara, kfida
Target Milestone:	---
Target Release:	FDP 21.I
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ovn21.09-21.09.0-9.el8fdp	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-12-09 15:37:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Casey Callendrello 2021-02-03 14:46:50 UTC

We would like to improve the health checking for OVN-controller. In particular, we would like to catch the following circumstances:

1. ovn-controller hasn't applied flows for a long time, either because the sbdb or ovs-vswitchd is somehow down

2. ovn-controller is taking an increasingly long time to apply flows, and this is going to cause a problem

3. ovn-controller is otherwise unhealthy (this is intentionally poorly defined)


Chatting with Numan, it seems to make sense for ovn-controller to write this information to the local ovsdb, so that we can consume it without blocking the main thread. All of these will be exported as Prometheus metrics (via ovn-kubernetes).

Numbers 1 and 2 can be accomplished by writing a list of sync timestamps and their start-to-finish duration. ovn-kubernetes can then generate a histogram, and from that, the 95%ile sync duration can be estimated.

Comment 1 Casey Callendrello 2021-02-03 15:58:42 UTC

DCBW pointed out we've discussed something similar - namely, and end-to-end sign that flows are being propagated correctly.

We should increment a timestamp in nb_cfg and see that it goes to sbdb and ovn-controller. Then, publish its value as a metric for both the ovn-kube master and all the ovn-controllers.


This is a good idea, and I would like to see it as a part of this RFE as well.

Comment 2 Casey Callendrello 2021-06-16 08:37:13 UTC

After talking to the OVN team, we realized the nb_cfg propagation mechanism is already almost everything we need.

Updating nb_cfg in nbdb, e.g. ovn-nbctl set NB_Global . nb_cfg=$(ts), will cause this value to be propagated down to SB_Global and then, via the ovn-controller, to the Chassis_Priv table. This table already includes the "nb_cfg_timestamp", which is the timestamp at which the latest nb_cfg value was set.

Furthermore, nb_cfg is also available in the node's ovsdb, in the Bridge table. It would be useful, however, if the nb_cfg_timestamp were also available in the local ovsdb.

So, this RFE is for two small(?) additions to the node's local ovsdb:

1- nb_cfg_timestamp
2- ovn-controller process start timestamp


(The latter is to make it possible to alert on high e2e latency. Otherwise it's hard to tell the difference between an ovn-controller restart and high running latency)

Comment 3 Dumitru Ceara 2021-06-16 11:05:11 UTC

Patch sent for review: http://patchwork.ozlabs.org/project/ovn/list/?series=249177&state=*

Comment 4 Dumitru Ceara 2021-06-17 07:37:22 UTC

v2 posted for review: http://patchwork.ozlabs.org/project/ovn/list/?series=249334&state=*

Comment 8 Ehsan Elahi 2021-10-21 11:50:37 UTC

Verified on: 
# rpm -qa | grep -E 'ovn|openvswitch'
ovn-2021-host-21.09.0-12.el8fdp.x86_64
openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch
openvswitch2.15-2.15.0-26.el8fdp.x86_64
ovn-2021-21.09.0-12.el8fdp.x86_64
ovn-2021-central-21.09.0-12.el8fdp.x86_64

ovn-nbctl --wait=hv sync
ovs-vsctl get Bridge br-int external_ids
{ct-zone-3d194f6b-c65f-4e4f-8867-b67f02208efa_dnat="4", ct-zone-3d194f6b-c65f-4e4f-8867-b67f02208efa_snat="5", ct-zone-495cb9cb-95a3-4c01-81d9-83977429b71d_dnat="8", ct-zone-495cb9cb-95a3-4c01-81d9-83977429b71d_snat="6", ct-zone-8f7b933f-d227-422d-a427-ac3a293ac19e_dnat="1", ct-zone-8f7b933f-d227-422d-a427-ac3a293ac19e_snat="2", ct-zone-ac15f654-9847-4331-9db2-1a920be4da7e_dnat="9", ct-zone-ac15f654-9847-4331-9db2-1a920be4da7e_snat="7", ct-zone-vm1="3", ct-zone-vm2="10", ct-zone-vm3="11", ovn-nb-cfg="1", ovn-nb-cfg-ts="1634816184014", ovn-startup-ts="1634814832761"}

====> ovn-nb-cfg-ts and ovn-startup-ts can be noted

Comment 10 errata-xmlrpc 2021-12-09 15:37:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:5059