Bug 1560892 - Pacemaker does not restart & promote OVN container after stopping the container [NEEDINFO]
Summary: Pacemaker does not restart & promote OVN container after stopping the container
Keywords:
Status: CLOSED DUPLICATE of bug 1795697
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: ---
Assignee: Numan Siddique
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On: 1795697
Blocks: ovncontainerization 1561591
TreeView+ depends on / blocked
 
Reported: 2018-03-27 08:37 UTC by Eran Kuris
Modified: 2020-02-20 14:31 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-20 14:31:31 UTC
Target Upstream Version:
twilson: needinfo? (ssigwald)


Attachments (Terms of Use)

Description Eran Kuris 2018-03-27 08:37:51 UTC
Description of problem:
On healthy OSP13-HA-OVN setup, I stopped 192.168.24.1:8787/rhosp13/openstack-ovn-northd:2018-03-02.2 container. 
The expected behavior is Pacemaker need to restart the docker & promote it back

The errors  get in "pcs status"
(overcloud) [root@controller-0 ~]#  pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-0 (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum
Last updated: Tue Mar 27 08:08:02 2018
Last change: Sun Mar 25 09:22:30 2018 by hacluster via crmd on controller-0

15 nodes configured
46 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Stopped controller-0
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-1
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp13/openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-0
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-1
   galera-bundle-2	(ocf::heartbeat:galera):	Master controller-2
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp13/openstack-redis:pcmklatest]
   redis-bundle-0	(ocf::heartbeat:redis):	Master controller-0
   redis-bundle-1	(ocf::heartbeat:redis):	Slave controller-1
   redis-bundle-2	(ocf::heartbeat:redis):	Slave controller-2
 ip-192.168.24.8	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-10.0.0.108	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.1.19	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.1.15	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.3.15	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.4.18	(ocf::heartbeat:IPaddr2):	Started controller-1
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp13/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Started controller-0
   haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Started controller-1
   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Started controller-2
 Docker container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp13/openstack-ovn-northd:2018-03-02.2]
   ovn-dbs-bundle-0	(ocf::ovn:ovndb-servers):	Stopped controller-0
   ovn-dbs-bundle-1	(ocf::ovn:ovndb-servers):	Slave controller-1 (Monitoring)
   ovn-dbs-bundle-2	(ocf::ovn:ovndb-servers):	FAILED controller-2 (Monitoring)
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp13/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0	(ocf::heartbeat:docker):	Started controller-2

Failed Actions:
* ovn-dbs-bundle-docker-0_monitor_60000 on controller-0 'unknown error' (1): call=161, status=complete, exitreason='',
    last-rc-change='Tue Mar 27 07:57:47 2018', queued=0ms, exec=0ms
* rabbitmq_start_0 on rabbitmq-bundle-0 'unknown error' (1): call=77151, status=Timed Out, exitreason='',
    last-rc-change='Mon Mar 26 12:32:29 2018', queued=0ms, exec=200010ms
* ovndb_servers_monitor_10000 on ovn-dbs-bundle-2 'not running' (7): call=40771, status=complete, exitreason='',
    last-rc-change='Tue Mar 27 08:07:59 2018', queued=0ms, exec=1955ms
* ovndb_servers_demote_0 on ovn-dbs-bundle-1 'not running' (7): call=40815, status=complete, exitreason='',
    last-rc-change='Tue Mar 27 08:07:41 2018', queued=898ms, exec=1301ms
* ovndb_servers_start_0 on ovn-dbs-bundle-0 'unknown error' (1): call=8, status=Timed Out, exitreason='',
    last-rc-change='Tue Mar 27 07:59:16 2018', queued=0ms, exec=200456ms


pacemeker logs:
Mar 27 08:02:37 [10] controller-0 pacemaker_remoted:   notice: operation_finished:      ovndb_servers_start_0:49:stderr [ ovs-appctl: cannot connect to "/var/run/openvswitch/ovnsb_db.ctl" (Connection refused) ]
Mar 27 08:02:37 [10] controller-0 pacemaker_remoted:   notice: operation_finished:      ovndb_servers_start_0:49:stderr [ 2018-03-27T08:02:36Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovnsb_db.ctl ]
Mar 27 08:02:37 [10] controller-0 pacemaker_remoted:   notice: operation_finished:      ovndb_servers_start_0:49:stderr [ ovs-appctl: cannot connect to "/var/run/openvswitch/ovnsb_db.ctl" (Connection refused) ]
Mar 27 08:02:37 [10] controller-0 pacemaker_remoted:     info: log_finished:    finished - rsc:ovndb_servers action:start call_id:8 pid:49 exit-code:1 exec-time:200456ms queue-time:0ms
Mar 27 08:02:40 [10] controller-0 pacemaker_remoted:     info: log_execute:     executing - rsc:ovndb_servers action:notify call_id:16
Mar 27 08:02:41 [10] controller-0 pacemaker_remoted:     info: log_finished:    finished - rsc:ovndb_servers action:notify call_id:16 pid:53569 exit-code:0 exec-time:1015ms queue-time:0ms
Mar 27 08:02:45 [10] controller-0 pacemaker_remoted:     info: log_execute:     executing - rsc:ovndb_servers action:notify call_id:17
Mar 27 08:02:46 [10] controller-0 pacemaker_remoted:     info: log_finished:    finished - rsc:ovndb_servers action:notify call_id:17 pid:53573 exit-code:0 exec-time:891ms queue-time:0ms
Mar 27 08:02:46 [10] controller-0 pacemaker_remoted:     info: log_execute:     executing - rsc:ovndb_servers action:stop call_id:18
Mar 27 08:02:47 [10] controller-0 pacemaker_remoted:   notice: operation_finished:      ovndb_servers_stop_0:53577:stderr [ 2018-03-27T08:02:47Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovn-northd.220.ctl ]
Mar 27 08:02:47 [10] controller-0 pacemaker_remoted:   notice: operation_finished:      ovndb_servers_stop_0:53577:stderr [ ovs-appctl: cannot connect to "/var/run/openvswitch/ovn-northd.220.ctl" (Connection refused) ]
Mar 27 08:02:47 [10] controller-0 pacemaker_remoted:   notice: operation_finished:      ovndb_servers_stop_0:53577:stderr [ 2018-03-27T08:02:47Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovnsb_db.ctl ]
Mar 27 08:02:47 [10] controller-0 pacemaker_remoted:   notice: operation_finished:      ovndb_servers_stop_0:53577:stderr [ ovs-appctl: cannot connect to "/var/run/openvswitch/ovnsb_db.ctl" (Connection refused) ]
Mar 27 08:02:47 [10] controller-0 pacemaker_remoted:     info: log_finished:    finished - rsc:ovndb_servers action:stop call_id:18 pid:53577 exit-code:0 exec-time:1117ms queue-time:794ms


/var/log/containers/openvswitch/ovn-controller.log 
018-03-27T07:59:05.474Z|00265|reconnect|INFO|tcp:172.17.1.15:6642: connecting...
2018-03-27T07:59:05.474Z|00266|reconnect|INFO|tcp:172.17.1.15:6642: connected
2018-03-27T07:59:11.707Z|00267|jsonrpc|WARN|tcp:172.17.1.15:6642: receive error: Connection reset by peer
2018-03-27T07:59:11.708Z|00268|reconnect|WARN|tcp:172.17.1.15:6642: connection dropped (Connection reset by peer)
2018-03-27T07:59:11.708Z|00269|reconnect|INFO|tcp:172.17.1.15:6642: waiting 8 seconds before reconnect
2018-03-27T07:59:19.718Z|00270|reconnect|INFO|tcp:172.17.1.15:6642: connecting...
2018-03-27T07:59:19.718Z|00271|reconnect|INFO|tcp:172.17.1.15:6642: connection attempt failed (Connection refused)
2018-03-27T07:59:19.719Z|00272|reconnect|INFO|tcp:172.17.1.15:6642: waiting 8 seconds before reconnect
2018-03-27T07:59:27.729Z|00273|reconnect|INFO|tcp:172.17.1.15:6642: connecting...
2018-03-27T07:59:27.729Z|00274|reconnect|INFO|tcp:172.17.1.15:6642: connection attempt failed (Connection refused)
2018-03-27T07:59:27.729Z|00275|reconnect|INFO|tcp:172.17.1.15:6642: waiting 8 seconds before reconnect


Version-Release number of selected component (if applicable):
OSP13-HA-OVN
2018-03-02.2

How reproducible:
always

Steps to Reproduce:
1.Deploy OSP13-OVN HA setup 
2.On OVN master node stop the ovn-northd container. 
3. Pacemaker does not restart & promote OVN container after stopping

Comment 1 Damien Ciabrini 2018-03-27 19:54:39 UTC
Quick question, how are you stopping the ovn-northd container? with "docker stop ovn-northd" or with "pcs resource disable ovn-dbs-bundle"?

Comment 2 Eran Kuris 2018-03-28 04:20:33 UTC
(In reply to Damien Ciabrini from comment #1)
> Quick question, how are you stopping the ovn-northd container? with "docker
> stop ovn-northd" or with "pcs resource disable ovn-dbs-bundle"?

docker stop ovn-northd

Comment 3 Damien Ciabrini 2018-03-28 08:18:49 UTC
At first sight I don't see pacemaker misbehaving here:

* ovn-dbs-bundle-docker-0_monitor_60000 on controller-0 'unknown error' (1): call=161, status=complete, exitreason='',
    last-rc-change='Tue Mar 27 07:57:47 2018', queued=0ms, exec=0ms

this indicates that pacemaker correctly figured out that the container he was managing was stopped. This in turn made pacemaker restart it.

* ovndb_servers_demote_0 on ovn-dbs-bundle-1 'not running' (7): call=40815, status=complete, exitreason='',
    last-rc-change='Tue Mar 27 08:07:41 2018', queued=898ms, exec=1301ms

This indicates that pacemaker tried to stop the resource before restarting it. obviously nothing to be stopped because the container was gone already ("not running")

* ovndb_servers_start_0 on ovn-dbs-bundle-0 'unknown error' (1): call=8, status=Timed Out, exitreason='',
    last-rc-change='Tue Mar 27 07:59:16 2018', queued=0ms, exec=200456ms

This error show that pacemaker restarted the container, but for some reason the restart of the service never finished. It timed out after 200seconds of trying.

from Eran's log in /var/log/containers/openvswitch/ovn-controller.log, we see that the service could never connect to tcp:172.17.1.15:6642.

OVN folks can probably tell more on why this could be happening.

Comment 4 Eran Kuris 2018-03-28 08:23:23 UTC
Adding Numan from OVN-Dev team ^

Comment 5 Numan Siddique 2018-03-28 11:10:45 UTC
I am looking into the issue

Comment 13 Siggy Sigwald 2020-02-01 18:00:12 UTC
Hi,
A customer just open a support case with the same issue. A quick summary:
- ovn-dbs-bundle seems to fail to promote any nodes to master.
- rebooted all controllers and the issue persisted.
- currently ovn-dbs-bundle is unmanaged from pacemaker and the docker containers are running but doens't seem to be working properly.
Sosreports and other logs available.

Thanks.

Comment 18 Terry Wilson 2020-02-20 14:31:31 UTC

*** This bug has been marked as a duplicate of bug 1795697 ***


Note You need to log in before you can comment on or make changes to this bug.