Bug 1662396

Summary: [RHOSP14] [OVN] No event generated in OVN Southbound DB when any ovn-controller has restarted in transport node.
Product: Red Hat OpenStack Reporter: Pradipta Kumar Sahoo <psahoo>
Component: python-networking-ovnAssignee: Assaf Muller <amuller>
Status: CLOSED NOTABUG QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: unspecified    
Version: 14.0 (Rocky)CC: apevec, dalvarez, lhh, majopela, nyechiel
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-08 23:10:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pradipta Kumar Sahoo 2018-12-28 07:11:36 UTC
Description of problem:
Referring to below OVN architectural model, OVN-Southbound DB which is present in Controller is responsible to manage the transport node via ovn-controller which has resided in all compute nodes.

                                         CMS
                                          |
                                          |
                              +-----------|-----------+ \
                              |           |           |   \
                              |     OVN/CMS Plugin    |	   \
                              |           |           |     \
                              |           |           |      \
                              |   OVN Northbound DB   |       \
                              |           |           |       --->> Overcloud Controller
                              |           |           |	     /
                              |       ovn-northd      |     /
                              |           |           |    /
                              +-----------|-----------+	  /
                                          |		 /
                                          |		/
                                +-------------------+ /
                                | OVN Southbound DB |/
                                +-------------------+
                                          |
                                          |
                       +------------------+------------------+
                       |                  |                  | ------>> Overcloud Compute
         HV 1          |                  |    HV n          |
       +---------------|---------------+  .  +---------------|---------------+
       |               |               |  .  |               |               |
       |        ovn-controller         |  .  |        ovn-controller         |
       |         |          |          |  .  |         |          |          |
       |         |          |          |     |         |          |          |
       |  ovs-vswitchd   ovsdb-server  |     |  ovs-vswitchd   ovsdb-server  |
       |                               |     |                               |
       +-------------------------------+     +-------------------------------+

Ideally, for any downtime in ovn-controller(in Compute), there should be a mechanism in ovn control-plane which can capture the health status of south-bound devices (ovn-controller, ovs-vswitchd , ovsdb-server).
In RHOSP14 testing, when we restart the ovn-controller container (in Compute), there is no event has captured neither in OVN logs in Controller node.

I am not sure if this is a flaw in the OVN architecture model or we are hitting any known bug/RFE. Please guide us for a better understanding of this monitor mechanism in OVN.


Version-Release number of selected component (if applicable):
Red Hat OpenStack 14


Steps to Reproduce:

1. Monitor all the OVN logs reside in overcloud controller node.
   # tailf /var/log/containers/openvswitch/ovn-northd.log /var/log/containers/openvswitch/ovsdb-server-nb.log /var/log/containers/openvswitch/ovsdb-server-sb.log

2. In Compute node. we restarted below docker service and ovs systemd service. We noticed the logs has captured in only ovn-controller.log resides in compute nodes
	@compute-0 ~]# systemctl restart ovsdb-server.service ovs-vswitchd.service openvswitch.service | sleep 10 |docker restart ovn_controller | sleep 10 | docker restart ovn_metadata_agent
	
	@compute-0 ~]# tailf /var/log/containers/openvswitch/ovn-controller.log
	2018-12-28T06:59:35.900Z|00040|fatal_signal|WARN|terminating with signal 15 (Terminated)
	2018-12-28T06:59:36.263Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovn-controller.log
	2018-12-28T06:59:36.264Z|00002|reconnect|INFO|unix:/run/openvswitch/db.sock: connecting...
	2018-12-28T06:59:36.264Z|00003|reconnect|INFO|unix:/run/openvswitch/db.sock: connected
	2018-12-28T06:59:36.267Z|00004|reconnect|INFO|tcp:172.17.1.17:6642: connecting...
	2018-12-28T06:59:37.269Z|00005|reconnect|INFO|tcp:172.17.1.17:6642: connection attempt timed out
	2018-12-28T06:59:38.270Z|00006|reconnect|INFO|tcp:172.17.1.17:6642: connecting...
	2018-12-28T06:59:38.271Z|00007|reconnect|INFO|tcp:172.17.1.17:6642: connected
	2018-12-28T06:59:38.277Z|00008|jsonrpc|WARN|unix:/run/openvswitch/db.sock: send error: Broken pipe
	2018-12-28T06:59:38.277Z|00009|reconnect|WARN|unix:/run/openvswitch/db.sock: connection dropped (Broken pipe)
	2018-12-28T06:59:38.282Z|00010|dpif_netlink|INFO|The kernel module does not support meters.
	2018-12-28T06:59:38.284Z|00011|ofctrl|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
	2018-12-28T06:59:38.284Z|00012|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
	2018-12-28T06:59:38.284Z|00013|pinctrl|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
	2018-12-28T06:59:38.284Z|00014|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
	2018-12-28T06:59:38.287Z|00015|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
	2018-12-28T06:59:38.288Z|00016|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
	2018-12-28T06:59:39.277Z|00017|reconnect|INFO|unix:/run/openvswitch/db.sock: connecting...
	2018-12-28T06:59:39.277Z|00018|reconnect|INFO|unix:/run/openvswitch/db.sock: connected

Expected results:
So, the concern is, in a larger scale environment how effective way OVN Northbound/Southbound DB or ovn-northd (in the controller) can monitor the health of ovn-controller which running in compute nodes.

Comment 1 Daniel Alvarez Sanchez 2019-01-08 23:10:37 UTC
From OSP networking-ovn perspective, ovn-controller is yet another agent. If it's down for 60 seconds (default parameter in config settings), it'll show as down when you list the agents from the API.

From the pure OVN perspective, you can tell from logs that it reconnected to SB database but also, as you restarted ovn-controller in a controlled way, you should see the Chassis entry going away from the SB database. You can confirm through 'ovn-sbctl list Chassis'

Please, feel free to reopen the bug if you feel that this info is not enough.