Bug 1712896 - [OSP15] ovn-dbs-bundle-podman docker is stuck in startup and cant load the ovn services after stopping the docker
Summary: [OSP15] ovn-dbs-bundle-podman docker is stuck in startup and cant load the ov...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: rc
: 15.0 (Stein)
Assignee: Michele Baldessari
QA Contact: Roman Safronov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-22 12:37 UTC by Eran Kuris
Modified: 2019-09-26 10:51 UTC (History)
11 users (show)

Fixed In Version: openstack-tripleo-heat-templates-10.6.1-0.20190722170519.014b20c.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-21 11:22:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1834856 0 None None None 2019-07-01 12:59:07 UTC
OpenStack gerrit 668882 0 None MERGED Fix ovn dbs control port 2020-12-05 21:36:06 UTC
Red Hat Product Errata RHEA-2019:2811 0 None None None 2019-09-21 11:22:48 UTC

Description Eran Kuris 2019-05-22 12:37:32 UTC
Description of problem:
When I stop ovn-dbs-bundle-podman docker on the master node pacemaker is promoting one of the other controller nodes to be the master.

When I start the docker that I just stopped it stuck on startup and is stuck in "stopped" status
 

[root@controller-1 heat-admin]# podman logs ovn-dbs-bundle-podman-1
+ sudo -E kolla_set_configs
INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json
INFO:__main__:Validating config file
INFO:__main__:Kolla config strategy set to: COPY_ALWAYS
INFO:__main__:Copying service configuration files
INFO:__main__:Copying /dev/null to /etc/libqb/force-filesystem-sockets
INFO:__main__:Setting permission for /etc/libqb/force-filesystem-sockets
INFO:__main__:Writing out command to execute
++ cat /run_command
+ CMD=/usr/sbin/pacemaker_remoted
+ ARGS=
+ [[ ! -n '' ]]
+ . kolla_extend_start
++ [[ ! -d /var/log/kolla/openvswitch ]]
+++ stat -c %a /var/log/kolla/openvswitch
++ [[ 755 != \7\5\5 ]]
++ . /usr/local/bin/kolla_ovn_extend_start
+ echo 'Running command: '\''/usr/sbin/pacemaker_remoted'\'''
Running command: '/usr/sbin/pacemaker_remoted'
+ exec /usr/sbin/pacemaker_remoted
(crm_add_logfile) 	error: Directory '/var/log/pacemaker' does not exist: logging to '/var/log/pacemaker/pacemaker.log' is disabled
(crm_log_init) 	info: Changed active directory to /var/lib/pacemaker/cores
(qb_ipcs_us_publish) 	info: server name: lrmd
(lrmd_init_remote_tls_server) 	notice: Starting TLS listener on port 3125
(pcmk__init_tls_dh) 	info: Generating Diffie-Hellman parameters with 2048-bit prime for TLS
(qb_ipcs_us_publish) 	info: server name: cib_ro
(qb_ipcs_us_publish) 	info: server name: cib_rw
(qb_ipcs_us_publish) 	info: server name: cib_shm
(qb_ipcs_us_publish) 	info: server name: attrd
(qb_ipcs_us_publish) 	info: server name: stonith-ng
(qb_ipcs_us_publish) 	info: server name: crmd
(main) 	info: Starting
(crm_remote_accept) 	info: New remote connection from ::ffff:172.17.1.50
(lrmd_remote_listen) 	info: Remote client pending authentication | 0x55976d196560 id: 6086b9d6-18fb-484c-a508-5b20ce11c7d9
(remoted__read_handshake_data) 	notice: Remote client connection accepted
(process_lrmd_get_rsc_info) 	info: Resource 'ovndb_servers' not found (0 active resources)
(process_lrmd_get_rsc_info) 	info: Resource 'ovndb_servers:1' not found (0 active resources)
(process_lrmd_rsc_register) 	info: Added 'ovndb_servers' to the rsc list (1 active resources)
(log_execute) 	info: executing - rsc:ovndb_servers action:start call_id:8
(log_finished) 	info: finished - rsc:ovndb_servers action:start call_id:8 pid:99 exit-code:0 exec-time:1289ms queue-time:0ms
(log_execute) 	info: executing - rsc:ovndb_servers action:notify call_id:14
(log_finished) 	info: finished - rsc:ovndb_servers action:notify call_id:14 pid:246 exit-code:0 exec-time:439ms queue-time:0ms


root@controller-1 heat-admin]# pcs status                                                                                                                                   
Cluster name: tripleo_cluster                                                                                                                                                
Stack: corosync                                                                                                                                                              
Current DC: controller-0 (version 2.0.1-4.el8-0eb7991564) - partition with quorum                                                                                            
Last updated: Wed May 22 12:13:05 2019                                                                                                                                       
Last change: Wed May 22 12:11:56 2019 by ovn-dbs-bundle-0 via crm_attribute on controller-0                                                                                  
                                                                                                                                                                             
15 nodes configured                                                                                                                                                          
46 resources configured                                                                                                                                                      
                                                                                                                                                                             
Online: [ controller-0 controller-1 controller-2 ]                                                                                                                           
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-2@controller-2 rabbitmq-bun
dle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]       
                                                                                                                                                                             
Full list of resources:                                                                                                                                                      
                                                                                                                                                                             
 podman container set: galera-bundle [192.168.24.1:8787/rhosp15/openstack-mariadb:pcmklatest]                                                                                
   galera-bundle-0      (ocf::heartbeat:galera):        Master controller-0                                                                                                  
   galera-bundle-1      (ocf::heartbeat:galera):        Master controller-1                                                                                                  
   galera-bundle-2      (ocf::heartbeat:galera):        Master controller-2                                                                                                  
 podman container set: rabbitmq-bundle [192.168.24.1:8787/rhosp15/openstack-rabbitmq:pcmklatest]                                                                             
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started controller-0                                                                                         
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started controller-1                                                                                         
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Started controller-2                                                                                         
 podman container set: redis-bundle [192.168.24.1:8787/rhosp15/openstack-redis:pcmklatest]                                                                                   
   redis-bundle-0       (ocf::heartbeat:redis): Master controller-0                                                                                                          
   redis-bundle-1       (ocf::heartbeat:redis): Slave controller-1                                                                                                           
   redis-bundle-2       (ocf::heartbeat:redis): Slave controller-2                                                                                                           
 ip-192.168.24.7        (ocf::heartbeat:IPaddr2):       Started controller-0                                                                                                 
 ip-10.0.0.103  (ocf::heartbeat:IPaddr2):       Started controller-1                                                                                                         
 ip-172.17.1.110        (ocf::heartbeat:IPaddr2):       Started controller-2
       
ip-172.17.1.84 (ocf::heartbeat:IPaddr2):       Started controller-0                                                                                                         
 ip-172.17.3.93 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.4.55 (ocf::heartbeat:IPaddr2):       Started controller-1
 podman container set: haproxy-bundle [192.168.24.1:8787/rhosp15/openstack-haproxy:pcmklatest]                                                                               
   haproxy-bundle-podman-0      (ocf::heartbeat:podman):        Started controller-0
   haproxy-bundle-podman-1      (ocf::heartbeat:podman):        Started controller-1
   haproxy-bundle-podman-2      (ocf::heartbeat:podman):        Started controller-2
 podman container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]                                                                            
   ovn-dbs-bundle-0     (ocf::ovn:ovndb-servers):       Master controller-0
   ovn-dbs-bundle-1     (ocf::ovn:ovndb-servers):       Stopped controller-1
   ovn-dbs-bundle-2     (ocf::ovn:ovndb-servers):       Slave controller-2
 podman container: openstack-cinder-volume [192.168.24.1:8787/rhosp15/openstack-cinder-volume:pcmklatest]                                                                    
   openstack-cinder-volume-podman-0     (ocf::heartbeat:podman):        Started controller-2                                                                                 

Failed Resource Actions:
* ovn-dbs-bundle-podman-1_monitor_60000 on controller-1 'not running' (7): call=181, status=complete, exitreason='',                                                         
    last-rc-change='Wed May 22 12:12:13 2019', queued=0ms, exec=0ms
* ovn-dbs-bundle-1_start_0 on controller-1 'unknown error' (1): call=31, status=Timed Out, exitreason='',                                                                    
    last-rc-change='Wed May 22 12:11:58 2019', queued=0ms, exec=0ms
* redis-bundle-podman-0_monitor_60000 on controller-0 'not running' (7): call=106, status=complete, exitreason='',                                                           
    last-rc-change='Wed May 22 11:45:56 2019', queued=0ms, exec=0ms
* redis-bundle-0_monitor_30000 on controller-0 'unknown error' (1): call=15, status=Error, exitreason='',                                                                    
    last-rc-change='Wed May 22 11:45:44 2019', queued=0ms, exec=0ms
* ovn-dbs-bundle-1_start_0 on controller-0 'unknown error' (1): call=32, status=Timed Out, exitreason='',                                                                    
    last-rc-change='Wed May 22 12:12:39 2019', queued=0ms, exec=0ms
* ovn-dbs-bundle-1_start_0 on controller-2 'unknown error' (1): call=33, status=Timed Out, exitreason='',                                                                    
    last-rc-change='Wed May 22 12:12:17 2019', queued=0ms, exec=0ms
Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


Version-Release number of selected component (if applicable):
OpenStack/15.0-RHEL-8/RHOS_TRUNK-15.0-RHEL-8-20190509.n.1
[root@controller-1 heat-admin]# rpm -qa | grep pacemaker
pacemaker-cli-2.0.1-4.el8.x86_64
ansible-pacemaker-1.0.4-0.20190418190349.0e4d7c0.el8ost.noarch
pacemaker-libs-2.0.1-4.el8.x86_64
pacemaker-remote-2.0.1-4.el8.x86_64
pacemaker-schemas-2.0.1-4.el8.noarch
pacemaker-cluster-libs-2.0.1-4.el8.x86_64
puppet-pacemaker-0.7.3-0.20190425160350.c6b01cc.el8ost.noarch
pacemaker-2.0.1-4.el8.x86_64
[root@controller-1 heat-admin]# rpm -qa | grep ovn
puppet-ovn-14.4.1-0.20190420043650.20090c6.el8ost.noarch
ovn2.11-2.11.0-16.el8fdp.x86_64
rhosp-openvswitch-ovn-common-2.11-0.3.el8ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. pcs status (1 node is master 2 other slaves)
2. stop the ovn docker in the master node: podman stop ovn-dbs-bundle-podman-0
3. pcs status (another node become to be the master, 1 slave, 1 stopped status)
4. start the docker that you stopped: podman start ovn-dbs-bundle-podman-0
5. pcs status - the status was not changed-  1 master, 1 slave, 1 stopped status)
6. podman logs ovn-dbs-bundle-podman-0  you can see that the docker was starting but it cannot load the services.




Additional info:
sos-report will be added.

Comment 3 Damien Ciabrini 2019-05-24 08:36:51 UTC
So the test consists in killing the ovn container with podman, and see how pacemaker reacts.
(FTR pacemaker is in charge of the entire lifecycle of the ovn container)

Internally pacemaker has three nested resources to manage the ovn service:

  * ovn-dbs-bundle-podman-1: the resource in charge of creating/starting/stopping the podman container that will host the ovn server
    podman actions are implemented by the podman resource agent in /usr/lib/ocf/resource.d/heartbeat/podman

  * ovn-dbs-bundle-1: the resource that is in charge of registering a pacemaker remote node in the cluster, from the pacemaker_remote process that runs in ovn's podman container.
    that action is managed internally by pacemaker, there's no external shell script involved

  * ovndb_servers: the ovn resource agent, in charge of starting/monitoring OVN when pacemaker tells it to.
    those actions are implemented in the ovn resource agent in /usr/share/openvswitch/scripts/ovndb-servers.ocf


- The "podman stop" action occurs around 14:03, on the podman container that hosts the Master replica of the ovn A/P resource, controller-1


- from the pacemaker's pengine logs at that time:

Pacemaker notices that the ovn resource need to failover to another node (controller-0), and schedule a clean stop of the nested resource on controller-1:
May 22 14:03:18 controller-0 pacemaker-schedulerd[2869]: notice:  * Move       ip-172.17.1.84                       (       controller-1 -> controller-0 )
May 22 14:03:18 controller-0 pacemaker-schedulerd[2869]: notice:  * Promote    ovndb_servers:0                      (   Slave -> Master ovn-dbs-bundle-0 )
May 22 14:03:18 controller-0 pacemaker-schedulerd[2869]: notice:  * Stop       ovn-dbs-bundle-1                     (                       controller-1 )   due to node availability
May 22 14:03:18 controller-0 pacemaker-schedulerd[2869]: notice:  * Stop       ovndb_servers:1                      (            Master ovn-dbs-bundle-1 )   due to unrunnable ovn-dbs-bundle-1 start

and it can't stop the pacemaker remote on controller-1 since the podman container is gone:

May 22 14:03:22 controller-0 pacemaker-attrd[2868]: notice: Removing all ovn-dbs-bundle-1 attributes for peer controller-1
May 22 14:03:22 controller-0 pacemaker-fenced[2866]: notice: Node ovn-dbs-bundle-1 state is now lost
May 22 14:03:22 controller-0 pacemaker-controld[2870]: notice: Initiating notify operation ovndb_servers_post_notify_stop_0 on ovn-dbs-bundle-2
May 22 14:03:22 controller-0 pacemaker-controld[2870]: notice: Node ovn-dbs-bundle-1 state is now lost

Then, it schedules a restart of the pacemaker_remote resource (and of the ocf resource), but stangely, it doesn't schedule a restart of the podman resource yet?

May 22 14:03:26 controller-0 pacemaker-controld[2870]: notice: Transition 143 (Complete=35, Pending=0, Fired=0, Skipped=1, Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-input-137.bz2): Stopped
May 22 14:03:26 controller-0 pacemaker-schedulerd[2869]: notice:  * Start      ovn-dbs-bundle-1                     (                       controller-1 )
May 22 14:03:26 controller-0 pacemaker-schedulerd[2869]: notice:  * Start      ovndb_servers:1                      (                   ovn-dbs-bundle-1 )

Eventually, this makes the pacemaker_remote resource fail to restart, because no container is running at that time on controller-1:

May 22 14:03:43 controller-0 pacemaker-controld[2870]: warning: Action 201 (ovn-dbs-bundle-1_start_0) on controller-1 failed (target: 0 vs. rc: 1): Error
May 22 14:03:43 controller-0 pacemaker-controld[2870]: notice: Transition 144 aborted by operation ovn-dbs-bundle-1_start_0 'modify' on controller-1: Event failed
May 22 14:03:43 controller-0 pacemaker-controld[2870]: notice: Transition 144 (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=13, Source=/var/lib/pacemaker/pengine/pe-input-138.bz2): Complete

At that time, pacemaker schedule a restart of all resources on controller-1, including the podman container:
May 22 14:03:43 controller-0 pacemaker-schedulerd[2869]: warning: Processing failed start of ovn-dbs-bundle-1 on controller-1: unknown error
May 22 14:03:43 controller-0 pacemaker-schedulerd[2869]: warning: Processing failed start of ovn-dbs-bundle-1 on controller-1: unknown error
May 22 14:03:43 controller-0 pacemaker-schedulerd[2869]: notice:  * Restart    ovn-dbs-bundle-podman-1              (                       controller-1 )
May 22 14:03:43 controller-0 pacemaker-schedulerd[2869]: notice:  * Recover    ovn-dbs-bundle-1                     (                       controller-1 )
May 22 14:03:43 controller-0 pacemaker-schedulerd[2869]: notice:  * Start      ovndb_servers:1                      (                   ovn-dbs-bundle-1 )

but it's already too late to recover the pacemaker_remote resource on controller-1 apparently:
May 22 14:03:43 controller-0 pacemaker-schedulerd[2869]: warning: Processing failed start of ovn-dbs-bundle-1 on controller-1: unknown error
May 22 14:03:43 controller-0 pacemaker-schedulerd[2869]: warning: Processing failed start of ovn-dbs-bundle-1 on controller-1: unknown error
May 22 14:03:43 controller-0 pacemaker-schedulerd[2869]: warning: Forcing ovn-dbs-bundle-1 away from controller-1 after 1000000 failures (max=1000000)
May 22 14:03:43 controller-0 pacemaker-schedulerd[2869]: notice:  * Restart    ovn-dbs-bundle-podman-1              (                       controller-1 )
May 22 14:03:43 controller-0 pacemaker-schedulerd[2869]: notice:  * Recover    ovn-dbs-bundle-1                     (       controller-1 -> controller-2 )
May 22 14:03:43 controller-0 pacemaker-schedulerd[2869]: notice:  * Start      ovndb_servers:1                      (                   ovn-dbs-bundle-1 )

the pacemaker_remote resource is then attempted to restart on controller-2 (which look like a bad transition to me), and then on controller-0, and ultimately stops being restarted:
May 22 14:04:27 controller-0 pacemaker-schedulerd[2869]: warning: Forcing ovn-dbs-bundle-1 away from controller-0 after 1000000 failures (max=1000000)
May 22 14:04:27 controller-0 pacemaker-schedulerd[2869]: warning: Forcing ovn-dbs-bundle-1 away from controller-1 after 1000000 failures (max=1000000)
May 22 14:04:27 controller-0 pacemaker-schedulerd[2869]: warning: Forcing ovn-dbs-bundle-1 away from controller-2 after 1000000 failures (max=1000000)
May 22 14:04:27 controller-0 pacemaker-schedulerd[2869]: notice:  * Restart    ovn-dbs-bundle-podman-1              (                       controller-1 )
May 22 14:04:27 controller-0 pacemaker-schedulerd[2869]: notice:  * Stop       ovn-dbs-bundle-1                     (                       controller-0 )   due to node availability
May 22 14:04:27 controller-0 pacemaker-schedulerd[2869]: notice:  * Start      ovndb_servers:1                      (                   ovn-dbs-bundle-1 )   due to unrunnable ovn-dbs-bundle-1 start (blocked)

At that time, the podman container has been restarted on controller-1, but pacemaker stops trying to restart the pacemaker remote resource on controller-1, which is the reason why it's not showing up in 'pcs status'

The problem looks to be pacemaker scheduling an unwanted/unexpected/invalid transition.

Comment 6 Andrew Beekhof 2019-05-30 11:57:11 UTC
This smells like a problem with the podman agent's monitor operation.
If its not reporting the correct status then this could happen.

Another possibility is that its a symptom of the monitor interval being too long - causing other parts of pacemaker to recognise failures higher up the resource stack.

If its a timing issue, then perhaps scheduling a monitor of the podman resource makes sense if we detect the connection die, and making that monitor op a pre-req for recovering the connection

Comment 7 Ken Gaillot 2019-05-31 19:51:02 UTC
Neither the logs nor the pe-inputs from Comment 3 are in the report in Comment 1 ... any chance of getting them?

The cluster has stonith disabled, which of course isn't supported and can cause problems in recovery.

Comment 9 Roman Safronov 2019-06-11 08:14:20 UTC
(In reply to Ken Gaillot from comment #7)
> Neither the logs nor the pe-inputs from Comment 3 are in the report in
> Comment 1 ... any chance of getting them?
> 
> The cluster has stonith disabled, which of course isn't supported and can
> cause problems in recovery.

I encountered the same issue on my environment with stonith enabled
Logs can be found at: http://rhos-release.virt.bos.redhat.com/log/bz1712896c
If any additional log needed I can provide as well as access to the environment.

[heat-admin@controller-1 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-0 (version 2.0.1-4.el8-0eb7991564) - partition with quorum
Last updated: Tue Jun 11 08:03:51 2019
Last change: Mon Jun 10 20:25:06 2019 by ovn-dbs-bundle-0 via crm_attribute on controller-0

15 nodes configured
49 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]

Full list of resources:

 podman container set: galera-bundle [192.168.24.1:8787/rhosp15/openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-0
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-1
   galera-bundle-2	(ocf::heartbeat:galera):	Master controller-2
 podman container set: rabbitmq-bundle [192.168.24.1:8787/rhosp15/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started controller-0
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-1
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
 podman container set: redis-bundle [192.168.24.1:8787/rhosp15/openstack-redis:pcmklatest]
   redis-bundle-0	(ocf::heartbeat:redis):	Master controller-0
   redis-bundle-1	(ocf::heartbeat:redis):	Slave controller-1
   redis-bundle-2	(ocf::heartbeat:redis):	Slave controller-2
 ip-192.168.24.20	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-10.0.0.125	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.1.21	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.1.28	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.3.122	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.4.115	(ocf::heartbeat:IPaddr2):	Started controller-0
 podman container set: haproxy-bundle [192.168.24.1:8787/rhosp15/openstack-haproxy:pcmklatest]
   haproxy-bundle-podman-0	(ocf::heartbeat:podman):	Started controller-0
   haproxy-bundle-podman-1	(ocf::heartbeat:podman):	Started controller-1
   haproxy-bundle-podman-2	(ocf::heartbeat:podman):	Started controller-2
 podman container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]
   ovn-dbs-bundle-0	(ocf::ovn:ovndb-servers):	Master controller-0
   ovn-dbs-bundle-1	(ocf::ovn:ovndb-servers):	Stopped controller-1
   ovn-dbs-bundle-2	(ocf::ovn:ovndb-servers):	Slave controller-2
 podman container: openstack-cinder-volume [192.168.24.1:8787/rhosp15/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-podman-0	(ocf::heartbeat:podman):	Started controller-1
 stonith-fence_ipmilan-525400c73bc7	(stonith:fence_ipmilan):	Started controller-0
 stonith-fence_ipmilan-525400327d9c	(stonith:fence_ipmilan):	Started controller-0
 stonith-fence_ipmilan-52540065e8b5	(stonith:fence_ipmilan):	Started controller-1

Failed Resource Actions:
* ovn-dbs-bundle-1_start_0 on controller-0 'unknown error' (1): call=27, status=Timed Out, exitreason='',
    last-rc-change='Mon Jun 10 20:25:50 2019', queued=0ms, exec=0ms
* ovn-dbs-bundle-1_start_0 on controller-2 'unknown error' (1): call=24, status=Timed Out, exitreason='',
    last-rc-change='Mon Jun 10 20:25:27 2019', queued=0ms, exec=0ms
* ovn-dbs-bundle-podman-1_monitor_60000 on controller-1 'not running' (7): call=115, status=complete, exitreason='',
    last-rc-change='Mon Jun 10 20:25:14 2019', queued=0ms, exec=0ms
* ovn-dbs-bundle-1_start_0 on controller-1 'unknown error' (1): call=25, status=Timed Out, exitreason='',
    last-rc-change='Mon Jun 10 20:25:07 2019', queued=0ms, exec=0ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


Puddle
15.0-RHEL-8/RHOS_TRUNK-15.0-RHEL-8-20190527.n.0

[heat-admin@controller-1 ~]$ rpm -qa  | grep pacemaker
pacemaker-cli-2.0.1-4.el8.x86_64
ansible-pacemaker-1.0.4-0.20190418190349.0e4d7c0.el8ost.noarch
pacemaker-libs-2.0.1-4.el8.x86_64
pacemaker-remote-2.0.1-4.el8.x86_64
pacemaker-schemas-2.0.1-4.el8.noarch
pacemaker-cluster-libs-2.0.1-4.el8.x86_64
puppet-pacemaker-0.7.3-0.20190524140403.0c5d699.el8ost.noarch
pacemaker-2.0.1-4.el8.x86_64



[heat-admin@controller-1 ~]$ sudo podman logs ovn-dbs-bundle-podman-1
+ sudo -E kolla_set_configs
INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json
INFO:__main__:Validating config file
INFO:__main__:Kolla config strategy set to: COPY_ALWAYS
INFO:__main__:Copying service configuration files
INFO:__main__:Copying /dev/null to /etc/libqb/force-filesystem-sockets
INFO:__main__:Setting permission for /etc/libqb/force-filesystem-sockets
INFO:__main__:Writing out command to execute
++ cat /run_command
+ CMD=/usr/sbin/pacemaker_remoted
+ ARGS=
+ [[ ! -n '' ]]
+ . kolla_extend_start
++ [[ ! -d /var/log/kolla/openvswitch ]]
+++ stat -c %a /var/log/kolla/openvswitch
++ [[ 755 != \7\5\5 ]]
++ . /usr/local/bin/kolla_ovn_extend_start
+ echo 'Running command: '\''/usr/sbin/pacemaker_remoted'\'''
Running command: '/usr/sbin/pacemaker_remoted'
+ exec /usr/sbin/pacemaker_remoted
(crm_add_logfile) 	error: Directory '/var/log/pacemaker' does not exist: logging to '/var/log/pacemaker/pacemaker.log' is disabled
(crm_log_init) 	info: Changed active directory to /var/lib/pacemaker/cores
(qb_ipcs_us_publish) 	info: server name: lrmd
(lrmd_init_remote_tls_server) 	notice: Starting TLS listener on port 3125
(pcmk__init_tls_dh) 	info: Generating Diffie-Hellman parameters with 2048-bit prime for TLS
(qb_ipcs_us_publish) 	info: server name: cib_ro
(qb_ipcs_us_publish) 	info: server name: cib_rw
(qb_ipcs_us_publish) 	info: server name: cib_shm
(qb_ipcs_us_publish) 	info: server name: attrd
(qb_ipcs_us_publish) 	info: server name: stonith-ng
(qb_ipcs_us_publish) 	info: server name: crmd
(main) 	info: Starting

Comment 10 Ken Gaillot 2019-06-17 17:43:00 UTC
This is an interesting scenario. I believe this is a network or firewall configuration issue rather than a pacemaker bug, though it's possible pacemaker should behave differently.

The conditions under which the issue occurs are:

* The start-failure-is-fatal cluster property is true (which is the default).

* The ovn-dbs-bundle is not configured with its own IP, but instead uses network="host" and control-port="3125".

(BTW these are appropriate, and not problems in themselves.)


The sequence of events in the incident in Comment 9 is:

* The manual container stop (on controller-1) causes pacemaker-remoted to shut down gracefully.

* In response, Pacemaker stops the containerized resource and its remote connection. At this point, the container's recurring monitor has not yet detected the container is down, so pacemaker doesn't know it's down yet. Therefore, pacemaker also schedules the remote connection to start again. This is correct behavior; it's possible for a remote connection to fail without the container failing, so we can't assume the container is failed.

* The connection start times out, as would be expected since the container is actually down. This timeout is shown in the failed actions in pcs status. A key consequence is the connection is banned from controller-1 (since start-failure-is-fatal=true).

* The container monitor does fail on its next run (which is also shown in the failed actions). Pacemaker properly schedules recovery of the container and connection. However, the container is scheduled to be recovered on controller-1, while the connection is scheduled to be recovered on controller-2. This is not automatically a problem; Pacemaker is designed so that bundle containers and bundle connections can run on different nodes. That feature is present so that the container can run on a Pacemaker Remote node while the connection runs on a full cluster node, but that's not the situation here, so one could make an argument that it would be better to ban the entire bundle (and not just the connection) from controller-1, at least when start-failure-is-fatal=true. 

* The container is successfully restarted on controller-1, but controller-2 (and later controller-0) is unable to reach it for some reason:

Jun 10 20:25:47 controller-2 pacemaker-controld  [314912] (lrmd_tcp_connect_cb)         info: Could not connect to Pacemaker Remote at controller-1:3125

I suspect either controller-2 can't resolve the controller-1 name, or a firewall is blocking controller-2 from reaching the control-port on controller-1. Would it be possible to verify whether that's the case?

Comment 11 Roman Safronov 2019-06-18 15:12:18 UTC
Hi Ken,

It seems like you're right. I see that port 3125 is filtered.

[heat-admin@controller-2 ~]$ nmap -sT controller-1 -p 3125
Starting Nmap 7.70 ( https://nmap.org ) at 2019-06-18 15:00 UTC
Nmap scan report for controller-1 (172.17.1.88)
Host is up (0.0013s latency).
rDNS record for 172.17.1.88: controller-1.localdomain

PORT     STATE    SERVICE
3125/tcp filtered a13-an

Nmap done: 1 IP address (1 host up) scanned in 0.25 seconds

Comment 12 Ken Gaillot 2019-06-18 15:42:50 UTC
(In reply to Roman Safronov from comment #11)
> Hi Ken,
> 
> It seems like you're right. I see that port 3125 is filtered.
> 
> [heat-admin@controller-2 ~]$ nmap -sT controller-1 -p 3125
> Starting Nmap 7.70 ( https://nmap.org ) at 2019-06-18 15:00 UTC
> Nmap scan report for controller-1 (172.17.1.88)
> Host is up (0.0013s latency).
> rDNS record for 172.17.1.88: controller-1.localdomain
> 
> PORT     STATE    SERVICE
> 3125/tcp filtered a13-an
> 
> Nmap done: 1 IP address (1 host up) scanned in 0.25 seconds

That's good to hear, for confirmation. The ports specified as "control-port" in the Pacemaker configuration should be open on each cluster node to all other cluster nodes. You can see the Pacemaker configuration with "pcs config". I'll look into whether our documentation needs to be updated. Let me know if that doesn't resolve the issue.

Comment 16 Roman Safronov 2019-07-29 10:42:57 UTC
Verified on 15.0-RHEL-8/RHOS_TRUNK-15.0-RHEL-8-20190725.n.1 with openstack-tripleo-heat-templates-10.6.1-0.20190725000448.e49b8db.el8ost.noarch
Verified that TCP port 3125 is accessible between controller nodes.
Verified that after stopping ovn-dbs-bundle-podman master container another container becomes a master and stopped container is restarted automatically.

Comment 20 errata-xmlrpc 2019-09-21 11:22:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2811


Note You need to log in before you can comment on or make changes to this bug.