Bug 1652613 - [OVN] slaves nodes dont promoted to be master after force shutdown the master node[regression]
Summary: [OVN] slaves nodes dont promoted to be master after force shutdown the maste...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Damien Ciabrini
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On: 1652752
Blocks: 1658631
TreeView+ depends on / blocked
 
Reported: 2018-11-22 13:19 UTC by Eran Kuris
Modified: 2019-09-09 13:30 UTC (History)
14 users (show)

Fixed In Version: pacemaker-1.1.19-8.el7_6.2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1658631 (view as bug list)
Environment:
Last Closed: 2019-01-10 14:55:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Eran Kuris 2018-11-22 13:19:14 UTC
Description of problem:
Forced shut down controller-0  which is the master node and expected another slave nodes to be promoted as master but it does not happen. 

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.deploy ovn setup 
2. shut down the master node--> 
echo o >/proc/sysrq-trigger
3. check pcs status  

Actual results:


Expected results:


Additional info:

Comment 2 Numan Siddique 2018-11-22 15:39:05 UTC
Here are my findings so far

Before resetting or stopping the controller-0 cluster, here is the status of pcs

******
[root@controller-0 heat-admin]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 1.1.19-8.el7_6.1-c3c624ea3d) - partition with quorum
Last updated: Thu Nov 22 15:24:20 2018
Last change: Thu Nov 22 15:23:50 2018 by root via crm_resource on controller-0

15 nodes configured
46 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started controller-0
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started controller-1
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Started controller-2
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Master controller-0
   galera-bundle-1      (ocf::heartbeat:galera):        Master controller-1
   galera-bundle-2      (ocf::heartbeat:galera):        Master controller-2
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Master controller-0
   redis-bundle-1       (ocf::heartbeat:redis): Slave controller-1
   redis-bundle-2       (ocf::heartbeat:redis): Slave controller-2
 ip-192.168.24.9        (ocf::heartbeat:IPaddr2):       Started controller-2
 ip-10.0.0.110  (ocf::heartbeat:IPaddr2):       Started controller-1
 ip-172.17.1.17 (ocf::heartbeat:IPaddr2):       Started controller-2
 ip-172.17.1.12 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.3.13 (ocf::heartbeat:IPaddr2):       Started controller-1
 ip-172.17.4.19 (ocf::heartbeat:IPaddr2):       Started controller-2
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Started controller-0
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Started controller-1
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Started controller-2
 Docker container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp14/openstack-ovn-northd:pcmknum]
   ovn-dbs-bundle-0     (ocf::ovn:ovndb-servers):       Master controller-0
   ovn-dbs-bundle-1     (ocf::ovn:ovndb-servers):       Slave controller-1
   ovn-dbs-bundle-2     (ocf::ovn:ovndb-servers):       Slave controller-2
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0     (ocf::heartbeat:docker):        Started controller-2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
*****

When I run any of the below command

#pcs cluster stop controller-0
or
# echo o >/proc/sysrq-trigger

pacemaker moves the VIP resource - ip-172.17.1.12 to controller-1, but it never promotes ovn-dbs-bundle resource in controller-1.

Below is the o/p of pcs status in controller-1

****
[root@controller-1 openvswitch]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 1.1.19-8.el7_6.1-c3c624ea3d) - partition with quorum
Last updated: Thu Nov 22 15:29:26 2018
Last change: Thu Nov 22 15:23:50 2018 by root via crm_resource on controller-0

15 nodes configured
46 resources configured

Online: [ controller-1 controller-2 ]
OFFLINE: [ controller-0 ]
GuestOnline: [ galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Stopped
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started controller-1
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Started controller-2
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Stopped
   galera-bundle-1      (ocf::heartbeat:galera):        Master controller-1
   galera-bundle-2      (ocf::heartbeat:galera):        Master controller-2
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Stopped
   redis-bundle-1       (ocf::heartbeat:redis): Slave controller-1
   redis-bundle-2       (ocf::heartbeat:redis): Slave controller-2
 ip-192.168.24.9        (ocf::heartbeat:IPaddr2):       Started controller-2
 ip-10.0.0.110  (ocf::heartbeat:IPaddr2):       Started controller-1
 ip-172.17.1.17 (ocf::heartbeat:IPaddr2):       Started controller-2
 ip-172.17.1.12 (ocf::heartbeat:IPaddr2):       Started controller-1
 ip-172.17.3.13 (ocf::heartbeat:IPaddr2):       Started controller-1
 ip-172.17.4.19 (ocf::heartbeat:IPaddr2):       Started controller-2
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Stopped
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Started controller-1
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Started controller-2
 Docker container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp14/openstack-ovn-northd:pcmknum]
   ovn-dbs-bundle-0     (ocf::ovn:ovndb-servers):       Stopped
   ovn-dbs-bundle-1     (ocf::ovn:ovndb-servers):       Slave controller-1
   ovn-dbs-bundle-2     (ocf::ovn:ovndb-servers):       Slave controller-2
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0     (ocf::heartbeat:docker):        Started controller-2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
*****

Before doing this testing, I had put some log messages in ovndb-servers.ocf, and I don't see any promote/start action called on controller-1. Not ever monitor action is called.

Below is what I see in controller-1

###################
[root@controller-1 heat-admin]# tail -f /var/log/containers/openvswitch/pcs_debug.txt 
ACTION = monitor
ovsdb_server_check_status : sb_status = running/backup : nb_status = running/backup
******************************************************
Thu Nov 22 15:25:21 UTC 2018
ACTION = notify
ovsdb_serveR_notify  : type_op = pre-stop
******************************************************
Thu Nov 22 15:25:23 UTC 2018
ACTION = notify
ovsdb_serveR_notify  : type_op = post-stop

******************************************************
Thu Nov 22 15:25:42 UTC 2018
ACTION = notify
ovsdb_serveR_notify  : type_op = pre-promote


******************************************************
Thu Nov 22 15:28:12 UTC 2018
ACTION = notify
ovsdb_serveR_notify  : type_op = pre-promote
******************************************************
Thu Nov 22 15:30:43 UTC 2018
ACTION = notify
ovsdb_serveR_notify  : type_op = pre-promote
###########

Every 2 or 3 minutes, pacemaker calls pre-promote action. OVN RA script just returns OCF_SUCCESS for notify actions. It handles only post-promote type op.


Below is what is seen in controller-2

#######################
******************************************************
Thu Nov 22 15:25:21 UTC 2018
ACTION = notify
ovsdb_serveR_notify  : type_op = pre-stop
******************************************************
Thu Nov 22 15:25:23 UTC 2018
ACTION = notify
ovsdb_serveR_notify  : type_op = post-stop

******************************************************
Thu Nov 22 15:25:38 UTC 2018
ACTION = monitor
ovsdb_server_check_status : sb_status = running/backup : nb_status = running/backup

******************************************************
Thu Nov 22 15:25:42 UTC 2018
ACTION = notify
ovsdb_serveR_notify  : type_op = pre-promote


******************************************************
Thu Nov 22 15:26:09 UTC 2018
ACTION = monitor
ovsdb_server_check_status : sb_status = running/backup : nb_status = running/backup
******************************************************
Thu Nov 22 15:26:40 UTC 2018
ACTION = monitor
ovsdb_server_check_status : sb_status = running/backup : nb_status = running/backup
******************************************************
Thu Nov 22 15:27:11 UTC 2018
ACTION = monitor
ovsdb_server_check_status : sb_status = running/backup : nb_status = running/backup
******************************************************
Thu Nov 22 15:27:42 UTC 2018
ACTION = monitor
ovsdb_server_check_status : sb_status = running/backup : nb_status = running/backup
******************************************************
Thu Nov 22 15:28:12 UTC 2018
ACTION = notify
ovsdb_serveR_notify  : type_op = pre-promote
******************************************************
##################################


[root@controller-1 openvswitch]# rpm -qa | grep pcs
pcs-0.9.165-6.el7.x86_64
[root@controller-1 openvswitch]# rpm -qa | grep pace
pacemaker-cli-1.1.19-8.el7_6.1.x86_64
pacemaker-1.1.19-8.el7_6.1.x86_64
ansible-pacemaker-1.0.4-0.20180827141254.0e4d7c0.el7ost.noarch
pacemaker-libs-1.1.19-8.el7_6.1.x86_64
userspace-rcu-0.7.16-2.el7cp.x86_64
puppet-pacemaker-0.7.2-0.20181008172519.9a4bc2d.el7ost.noarch
pacemaker-cluster-libs-1.1.19-8.el7_6.1.x86_64
pacemaker-remote-1.1.19-8.el7_6.1.x86_64


Once I start pacemaker in controller-0, every thing is back to normal

[root@controller-0 heat-admin]#pcs cluster start controller-0

[root@controller-0 heat-admin]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 1.1.19-8.el7_6.1-c3c624ea3d) - partition with quorum
Last updated: Thu Nov 22 15:37:15 2018
Last change: Thu Nov 22 15:23:50 2018 by root via crm_resource on controller-0

15 nodes configured
46 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started controller-0
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started controller-1
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Started controller-2
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Master controller-0
   galera-bundle-1      (ocf::heartbeat:galera):        Master controller-1
   galera-bundle-2      (ocf::heartbeat:galera):        Master controller-2
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Master controller-0
   redis-bundle-1       (ocf::heartbeat:redis): Slave controller-1
   redis-bundle-2       (ocf::heartbeat:redis): Slave controller-2
 ip-192.168.24.9        (ocf::heartbeat:IPaddr2):       Started controller-2
 ip-10.0.0.110  (ocf::heartbeat:IPaddr2):       Started controller-1
 ip-172.17.1.17 (ocf::heartbeat:IPaddr2):       Started controller-2
 ip-172.17.1.12 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.3.13 (ocf::heartbeat:IPaddr2):       Started controller-1
 ip-172.17.4.19 (ocf::heartbeat:IPaddr2):       Started controller-2
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Started controller-0
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Started controller-1
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Started controller-2
 Docker container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp14/openstack-ovn-northd:pcmknum]
   ovn-dbs-bundle-0     (ocf::ovn:ovndb-servers):       Master controller-0
   ovn-dbs-bundle-1     (ocf::ovn:ovndb-servers):       Slave controller-1
   ovn-dbs-bundle-2     (ocf::ovn:ovndb-servers):       Slave controller-2
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0     (ocf::heartbeat:docker):        Started controller-2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


It would be great if pidone team takes a look.

Comment 3 Damien Ciabrini 2018-11-22 22:04:11 UTC
After initial investigation this seems to be a pacemaker issue.
I've create a dedicated RHEL bz [1] with an isolated reproducer.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1652752

Comment 18 Eran Kuris 2019-01-10 12:45:30 UTC
fix verified:
OpenStack/14.0-RHEL-7/2019-01-07.1/
[root@controller-0 ~]# rpm -qa | grep pacemaker
pacemaker-cluster-libs-1.1.19-8.el7_6.2.x86_64
pacemaker-remote-1.1.19-8.el7_6.2.x86_64
pacemaker-libs-1.1.19-8.el7_6.2.x86_64
pacemaker-1.1.19-8.el7_6.2.x86_64
puppet-pacemaker-0.7.2-0.20181008172520.9a4bc2d.el7ost.noarch
ansible-pacemaker-1.0.4-0.20180827141254.0e4d7c0.el7ost.noarch
pacemaker-cli-1.1.19-8.el7_6.2.x86_64


   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Started controller-2
 Docker container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp14/openstack-ovn-northd:pcmklatest]
   ovn-dbs-bundle-0	(ocf::ovn:ovndb-servers):	Stopped
   ovn-dbs-bundle-1	(ocf::ovn:ovndb-servers):	Master controller-1
   ovn-dbs-bundle-2	(ocf::ovn:ovndb-servers):	Slave controller-2
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0	(ocf::heartbeat:docker):	Started controller-2


Note You need to log in before you can comment on or make changes to this bug.