Description of problem: On healthy OSP13-HA-OVN setup, I stopped 192.168.24.1:8787/rhosp13/openstack-ovn-northd:2018-03-02.2 container. The expected behavior is Pacemaker need to restart the docker & promote it back The errors get in "pcs status" (overcloud) [root@controller-0 ~]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-0 (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum Last updated: Tue Mar 27 08:08:02 2018 Last change: Sun Mar 25 09:22:30 2018 by hacluster via crmd on controller-0 15 nodes configured 46 resources configured Online: [ controller-0 controller-1 controller-2 ] GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ] Full list of resources: Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-0 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 Docker container set: galera-bundle [192.168.24.1:8787/rhosp13/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-0 galera-bundle-1 (ocf::heartbeat:galera): Master controller-1 galera-bundle-2 (ocf::heartbeat:galera): Master controller-2 Docker container set: redis-bundle [192.168.24.1:8787/rhosp13/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1 redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2 ip-192.168.24.8 (ocf::heartbeat:IPaddr2): Started controller-0 ip-10.0.0.108 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.1.19 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.1.15 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.3.15 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.4.18 (ocf::heartbeat:IPaddr2): Started controller-1 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp13/openstack-haproxy:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-0 haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-1 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started controller-2 Docker container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp13/openstack-ovn-northd:2018-03-02.2] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped controller-0 ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-1 (Monitoring) ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): FAILED controller-2 (Monitoring) Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp13/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started controller-2 Failed Actions: * ovn-dbs-bundle-docker-0_monitor_60000 on controller-0 'unknown error' (1): call=161, status=complete, exitreason='', last-rc-change='Tue Mar 27 07:57:47 2018', queued=0ms, exec=0ms * rabbitmq_start_0 on rabbitmq-bundle-0 'unknown error' (1): call=77151, status=Timed Out, exitreason='', last-rc-change='Mon Mar 26 12:32:29 2018', queued=0ms, exec=200010ms * ovndb_servers_monitor_10000 on ovn-dbs-bundle-2 'not running' (7): call=40771, status=complete, exitreason='', last-rc-change='Tue Mar 27 08:07:59 2018', queued=0ms, exec=1955ms * ovndb_servers_demote_0 on ovn-dbs-bundle-1 'not running' (7): call=40815, status=complete, exitreason='', last-rc-change='Tue Mar 27 08:07:41 2018', queued=898ms, exec=1301ms * ovndb_servers_start_0 on ovn-dbs-bundle-0 'unknown error' (1): call=8, status=Timed Out, exitreason='', last-rc-change='Tue Mar 27 07:59:16 2018', queued=0ms, exec=200456ms pacemeker logs: Mar 27 08:02:37 [10] controller-0 pacemaker_remoted: notice: operation_finished: ovndb_servers_start_0:49:stderr [ ovs-appctl: cannot connect to "/var/run/openvswitch/ovnsb_db.ctl" (Connection refused) ] Mar 27 08:02:37 [10] controller-0 pacemaker_remoted: notice: operation_finished: ovndb_servers_start_0:49:stderr [ 2018-03-27T08:02:36Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovnsb_db.ctl ] Mar 27 08:02:37 [10] controller-0 pacemaker_remoted: notice: operation_finished: ovndb_servers_start_0:49:stderr [ ovs-appctl: cannot connect to "/var/run/openvswitch/ovnsb_db.ctl" (Connection refused) ] Mar 27 08:02:37 [10] controller-0 pacemaker_remoted: info: log_finished: finished - rsc:ovndb_servers action:start call_id:8 pid:49 exit-code:1 exec-time:200456ms queue-time:0ms Mar 27 08:02:40 [10] controller-0 pacemaker_remoted: info: log_execute: executing - rsc:ovndb_servers action:notify call_id:16 Mar 27 08:02:41 [10] controller-0 pacemaker_remoted: info: log_finished: finished - rsc:ovndb_servers action:notify call_id:16 pid:53569 exit-code:0 exec-time:1015ms queue-time:0ms Mar 27 08:02:45 [10] controller-0 pacemaker_remoted: info: log_execute: executing - rsc:ovndb_servers action:notify call_id:17 Mar 27 08:02:46 [10] controller-0 pacemaker_remoted: info: log_finished: finished - rsc:ovndb_servers action:notify call_id:17 pid:53573 exit-code:0 exec-time:891ms queue-time:0ms Mar 27 08:02:46 [10] controller-0 pacemaker_remoted: info: log_execute: executing - rsc:ovndb_servers action:stop call_id:18 Mar 27 08:02:47 [10] controller-0 pacemaker_remoted: notice: operation_finished: ovndb_servers_stop_0:53577:stderr [ 2018-03-27T08:02:47Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovn-northd.220.ctl ] Mar 27 08:02:47 [10] controller-0 pacemaker_remoted: notice: operation_finished: ovndb_servers_stop_0:53577:stderr [ ovs-appctl: cannot connect to "/var/run/openvswitch/ovn-northd.220.ctl" (Connection refused) ] Mar 27 08:02:47 [10] controller-0 pacemaker_remoted: notice: operation_finished: ovndb_servers_stop_0:53577:stderr [ 2018-03-27T08:02:47Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovnsb_db.ctl ] Mar 27 08:02:47 [10] controller-0 pacemaker_remoted: notice: operation_finished: ovndb_servers_stop_0:53577:stderr [ ovs-appctl: cannot connect to "/var/run/openvswitch/ovnsb_db.ctl" (Connection refused) ] Mar 27 08:02:47 [10] controller-0 pacemaker_remoted: info: log_finished: finished - rsc:ovndb_servers action:stop call_id:18 pid:53577 exit-code:0 exec-time:1117ms queue-time:794ms /var/log/containers/openvswitch/ovn-controller.log 018-03-27T07:59:05.474Z|00265|reconnect|INFO|tcp:172.17.1.15:6642: connecting... 2018-03-27T07:59:05.474Z|00266|reconnect|INFO|tcp:172.17.1.15:6642: connected 2018-03-27T07:59:11.707Z|00267|jsonrpc|WARN|tcp:172.17.1.15:6642: receive error: Connection reset by peer 2018-03-27T07:59:11.708Z|00268|reconnect|WARN|tcp:172.17.1.15:6642: connection dropped (Connection reset by peer) 2018-03-27T07:59:11.708Z|00269|reconnect|INFO|tcp:172.17.1.15:6642: waiting 8 seconds before reconnect 2018-03-27T07:59:19.718Z|00270|reconnect|INFO|tcp:172.17.1.15:6642: connecting... 2018-03-27T07:59:19.718Z|00271|reconnect|INFO|tcp:172.17.1.15:6642: connection attempt failed (Connection refused) 2018-03-27T07:59:19.719Z|00272|reconnect|INFO|tcp:172.17.1.15:6642: waiting 8 seconds before reconnect 2018-03-27T07:59:27.729Z|00273|reconnect|INFO|tcp:172.17.1.15:6642: connecting... 2018-03-27T07:59:27.729Z|00274|reconnect|INFO|tcp:172.17.1.15:6642: connection attempt failed (Connection refused) 2018-03-27T07:59:27.729Z|00275|reconnect|INFO|tcp:172.17.1.15:6642: waiting 8 seconds before reconnect Version-Release number of selected component (if applicable): OSP13-HA-OVN 2018-03-02.2 How reproducible: always Steps to Reproduce: 1.Deploy OSP13-OVN HA setup 2.On OVN master node stop the ovn-northd container. 3. Pacemaker does not restart & promote OVN container after stopping
Quick question, how are you stopping the ovn-northd container? with "docker stop ovn-northd" or with "pcs resource disable ovn-dbs-bundle"?
(In reply to Damien Ciabrini from comment #1) > Quick question, how are you stopping the ovn-northd container? with "docker > stop ovn-northd" or with "pcs resource disable ovn-dbs-bundle"? docker stop ovn-northd
At first sight I don't see pacemaker misbehaving here: * ovn-dbs-bundle-docker-0_monitor_60000 on controller-0 'unknown error' (1): call=161, status=complete, exitreason='', last-rc-change='Tue Mar 27 07:57:47 2018', queued=0ms, exec=0ms this indicates that pacemaker correctly figured out that the container he was managing was stopped. This in turn made pacemaker restart it. * ovndb_servers_demote_0 on ovn-dbs-bundle-1 'not running' (7): call=40815, status=complete, exitreason='', last-rc-change='Tue Mar 27 08:07:41 2018', queued=898ms, exec=1301ms This indicates that pacemaker tried to stop the resource before restarting it. obviously nothing to be stopped because the container was gone already ("not running") * ovndb_servers_start_0 on ovn-dbs-bundle-0 'unknown error' (1): call=8, status=Timed Out, exitreason='', last-rc-change='Tue Mar 27 07:59:16 2018', queued=0ms, exec=200456ms This error show that pacemaker restarted the container, but for some reason the restart of the service never finished. It timed out after 200seconds of trying. from Eran's log in /var/log/containers/openvswitch/ovn-controller.log, we see that the service could never connect to tcp:172.17.1.15:6642. OVN folks can probably tell more on why this could be happening.
Adding Numan from OVN-Dev team ^
I am looking into the issue
Hi, A customer just open a support case with the same issue. A quick summary: - ovn-dbs-bundle seems to fail to promote any nodes to master. - rebooted all controllers and the issue persisted. - currently ovn-dbs-bundle is unmanaged from pacemaker and the docker containers are running but doens't seem to be working properly. Sosreports and other logs available. Thanks.
*** This bug has been marked as a duplicate of bug 1795697 ***