Description of problem: Referring to below OVN architectural model, OVN-Southbound DB which is present in Controller is responsible to manage the transport node via ovn-controller which has resided in all compute nodes. CMS | | +-----------|-----------+ \ | | | \ | OVN/CMS Plugin | \ | | | \ | | | \ | OVN Northbound DB | \ | | | --->> Overcloud Controller | | | / | ovn-northd | / | | | / +-----------|-----------+ / | / | / +-------------------+ / | OVN Southbound DB |/ +-------------------+ | | +------------------+------------------+ | | | ------>> Overcloud Compute HV 1 | | HV n | +---------------|---------------+ . +---------------|---------------+ | | | . | | | | ovn-controller | . | ovn-controller | | | | | . | | | | | | | | | | | | | ovs-vswitchd ovsdb-server | | ovs-vswitchd ovsdb-server | | | | | +-------------------------------+ +-------------------------------+ Ideally, for any downtime in ovn-controller(in Compute), there should be a mechanism in ovn control-plane which can capture the health status of south-bound devices (ovn-controller, ovs-vswitchd , ovsdb-server). In RHOSP14 testing, when we restart the ovn-controller container (in Compute), there is no event has captured neither in OVN logs in Controller node. I am not sure if this is a flaw in the OVN architecture model or we are hitting any known bug/RFE. Please guide us for a better understanding of this monitor mechanism in OVN. Version-Release number of selected component (if applicable): Red Hat OpenStack 14 Steps to Reproduce: 1. Monitor all the OVN logs reside in overcloud controller node. # tailf /var/log/containers/openvswitch/ovn-northd.log /var/log/containers/openvswitch/ovsdb-server-nb.log /var/log/containers/openvswitch/ovsdb-server-sb.log 2. In Compute node. we restarted below docker service and ovs systemd service. We noticed the logs has captured in only ovn-controller.log resides in compute nodes @compute-0 ~]# systemctl restart ovsdb-server.service ovs-vswitchd.service openvswitch.service | sleep 10 |docker restart ovn_controller | sleep 10 | docker restart ovn_metadata_agent @compute-0 ~]# tailf /var/log/containers/openvswitch/ovn-controller.log 2018-12-28T06:59:35.900Z|00040|fatal_signal|WARN|terminating with signal 15 (Terminated) 2018-12-28T06:59:36.263Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovn-controller.log 2018-12-28T06:59:36.264Z|00002|reconnect|INFO|unix:/run/openvswitch/db.sock: connecting... 2018-12-28T06:59:36.264Z|00003|reconnect|INFO|unix:/run/openvswitch/db.sock: connected 2018-12-28T06:59:36.267Z|00004|reconnect|INFO|tcp:172.17.1.17:6642: connecting... 2018-12-28T06:59:37.269Z|00005|reconnect|INFO|tcp:172.17.1.17:6642: connection attempt timed out 2018-12-28T06:59:38.270Z|00006|reconnect|INFO|tcp:172.17.1.17:6642: connecting... 2018-12-28T06:59:38.271Z|00007|reconnect|INFO|tcp:172.17.1.17:6642: connected 2018-12-28T06:59:38.277Z|00008|jsonrpc|WARN|unix:/run/openvswitch/db.sock: send error: Broken pipe 2018-12-28T06:59:38.277Z|00009|reconnect|WARN|unix:/run/openvswitch/db.sock: connection dropped (Broken pipe) 2018-12-28T06:59:38.282Z|00010|dpif_netlink|INFO|The kernel module does not support meters. 2018-12-28T06:59:38.284Z|00011|ofctrl|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch 2018-12-28T06:59:38.284Z|00012|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting... 2018-12-28T06:59:38.284Z|00013|pinctrl|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch 2018-12-28T06:59:38.284Z|00014|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting... 2018-12-28T06:59:38.287Z|00015|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected 2018-12-28T06:59:38.288Z|00016|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected 2018-12-28T06:59:39.277Z|00017|reconnect|INFO|unix:/run/openvswitch/db.sock: connecting... 2018-12-28T06:59:39.277Z|00018|reconnect|INFO|unix:/run/openvswitch/db.sock: connected Expected results: So, the concern is, in a larger scale environment how effective way OVN Northbound/Southbound DB or ovn-northd (in the controller) can monitor the health of ovn-controller which running in compute nodes.
From OSP networking-ovn perspective, ovn-controller is yet another agent. If it's down for 60 seconds (default parameter in config settings), it'll show as down when you list the agents from the API. From the pure OVN perspective, you can tell from logs that it reconnected to SB database but also, as you restarted ovn-controller in a controlled way, you should see the Chassis entry going away from the SB database. You can confirm through 'ovn-sbctl list Chassis' Please, feel free to reopen the bug if you feel that this info is not enough.