Description of problem: Forced shut down controller-0 which is the master node and expected another slave nodes to be promoted as master but it does not happen. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1.deploy ovn setup 2. shut down the master node--> echo o >/proc/sysrq-trigger 3. check pcs status Actual results: Expected results: Additional info:
Here are my findings so far Before resetting or stopping the controller-0 cluster, here is the status of pcs ****** [root@controller-0 heat-admin]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-1 (version 1.1.19-8.el7_6.1-c3c624ea3d) - partition with quorum Last updated: Thu Nov 22 15:24:20 2018 Last change: Thu Nov 22 15:23:50 2018 by root via crm_resource on controller-0 15 nodes configured 46 resources configured Online: [ controller-0 controller-1 controller-2 ] GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ] Full list of resources: Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-0 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 Docker container set: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-0 galera-bundle-1 (ocf::heartbeat:galera): Master controller-1 galera-bundle-2 (ocf::heartbeat:galera): Master controller-2 Docker container set: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1 redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2 ip-192.168.24.9 (ocf::heartbeat:IPaddr2): Started controller-2 ip-10.0.0.110 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.1.17 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.1.12 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.3.13 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.4.19 (ocf::heartbeat:IPaddr2): Started controller-2 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-0 haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-1 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started controller-2 Docker container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp14/openstack-ovn-northd:pcmknum] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0 ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-1 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-2 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started controller-2 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled ***** When I run any of the below command #pcs cluster stop controller-0 or # echo o >/proc/sysrq-trigger pacemaker moves the VIP resource - ip-172.17.1.12 to controller-1, but it never promotes ovn-dbs-bundle resource in controller-1. Below is the o/p of pcs status in controller-1 **** [root@controller-1 openvswitch]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-1 (version 1.1.19-8.el7_6.1-c3c624ea3d) - partition with quorum Last updated: Thu Nov 22 15:29:26 2018 Last change: Thu Nov 22 15:23:50 2018 by root via crm_resource on controller-0 15 nodes configured 46 resources configured Online: [ controller-1 controller-2 ] OFFLINE: [ controller-0 ] GuestOnline: [ galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ] Full list of resources: Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 Docker container set: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Stopped galera-bundle-1 (ocf::heartbeat:galera): Master controller-1 galera-bundle-2 (ocf::heartbeat:galera): Master controller-2 Docker container set: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Stopped redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1 redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2 ip-192.168.24.9 (ocf::heartbeat:IPaddr2): Started controller-2 ip-10.0.0.110 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.1.17 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.1.12 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.3.13 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.4.19 (ocf::heartbeat:IPaddr2): Started controller-2 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Stopped haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-1 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started controller-2 Docker container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp14/openstack-ovn-northd:pcmknum] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-1 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-2 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started controller-2 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled ***** Before doing this testing, I had put some log messages in ovndb-servers.ocf, and I don't see any promote/start action called on controller-1. Not ever monitor action is called. Below is what I see in controller-1 ################### [root@controller-1 heat-admin]# tail -f /var/log/containers/openvswitch/pcs_debug.txt ACTION = monitor ovsdb_server_check_status : sb_status = running/backup : nb_status = running/backup ****************************************************** Thu Nov 22 15:25:21 UTC 2018 ACTION = notify ovsdb_serveR_notify : type_op = pre-stop ****************************************************** Thu Nov 22 15:25:23 UTC 2018 ACTION = notify ovsdb_serveR_notify : type_op = post-stop ****************************************************** Thu Nov 22 15:25:42 UTC 2018 ACTION = notify ovsdb_serveR_notify : type_op = pre-promote ****************************************************** Thu Nov 22 15:28:12 UTC 2018 ACTION = notify ovsdb_serveR_notify : type_op = pre-promote ****************************************************** Thu Nov 22 15:30:43 UTC 2018 ACTION = notify ovsdb_serveR_notify : type_op = pre-promote ########### Every 2 or 3 minutes, pacemaker calls pre-promote action. OVN RA script just returns OCF_SUCCESS for notify actions. It handles only post-promote type op. Below is what is seen in controller-2 ####################### ****************************************************** Thu Nov 22 15:25:21 UTC 2018 ACTION = notify ovsdb_serveR_notify : type_op = pre-stop ****************************************************** Thu Nov 22 15:25:23 UTC 2018 ACTION = notify ovsdb_serveR_notify : type_op = post-stop ****************************************************** Thu Nov 22 15:25:38 UTC 2018 ACTION = monitor ovsdb_server_check_status : sb_status = running/backup : nb_status = running/backup ****************************************************** Thu Nov 22 15:25:42 UTC 2018 ACTION = notify ovsdb_serveR_notify : type_op = pre-promote ****************************************************** Thu Nov 22 15:26:09 UTC 2018 ACTION = monitor ovsdb_server_check_status : sb_status = running/backup : nb_status = running/backup ****************************************************** Thu Nov 22 15:26:40 UTC 2018 ACTION = monitor ovsdb_server_check_status : sb_status = running/backup : nb_status = running/backup ****************************************************** Thu Nov 22 15:27:11 UTC 2018 ACTION = monitor ovsdb_server_check_status : sb_status = running/backup : nb_status = running/backup ****************************************************** Thu Nov 22 15:27:42 UTC 2018 ACTION = monitor ovsdb_server_check_status : sb_status = running/backup : nb_status = running/backup ****************************************************** Thu Nov 22 15:28:12 UTC 2018 ACTION = notify ovsdb_serveR_notify : type_op = pre-promote ****************************************************** ################################## [root@controller-1 openvswitch]# rpm -qa | grep pcs pcs-0.9.165-6.el7.x86_64 [root@controller-1 openvswitch]# rpm -qa | grep pace pacemaker-cli-1.1.19-8.el7_6.1.x86_64 pacemaker-1.1.19-8.el7_6.1.x86_64 ansible-pacemaker-1.0.4-0.20180827141254.0e4d7c0.el7ost.noarch pacemaker-libs-1.1.19-8.el7_6.1.x86_64 userspace-rcu-0.7.16-2.el7cp.x86_64 puppet-pacemaker-0.7.2-0.20181008172519.9a4bc2d.el7ost.noarch pacemaker-cluster-libs-1.1.19-8.el7_6.1.x86_64 pacemaker-remote-1.1.19-8.el7_6.1.x86_64 Once I start pacemaker in controller-0, every thing is back to normal [root@controller-0 heat-admin]#pcs cluster start controller-0 [root@controller-0 heat-admin]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-1 (version 1.1.19-8.el7_6.1-c3c624ea3d) - partition with quorum Last updated: Thu Nov 22 15:37:15 2018 Last change: Thu Nov 22 15:23:50 2018 by root via crm_resource on controller-0 15 nodes configured 46 resources configured Online: [ controller-0 controller-1 controller-2 ] GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ] Full list of resources: Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-0 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 Docker container set: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-0 galera-bundle-1 (ocf::heartbeat:galera): Master controller-1 galera-bundle-2 (ocf::heartbeat:galera): Master controller-2 Docker container set: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1 redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2 ip-192.168.24.9 (ocf::heartbeat:IPaddr2): Started controller-2 ip-10.0.0.110 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.1.17 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.1.12 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.3.13 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.4.19 (ocf::heartbeat:IPaddr2): Started controller-2 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-0 haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-1 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started controller-2 Docker container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp14/openstack-ovn-northd:pcmknum] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0 ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-1 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-2 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started controller-2 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled It would be great if pidone team takes a look.
After initial investigation this seems to be a pacemaker issue. I've create a dedicated RHEL bz [1] with an isolated reproducer. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1652752
fix verified: OpenStack/14.0-RHEL-7/2019-01-07.1/ [root@controller-0 ~]# rpm -qa | grep pacemaker pacemaker-cluster-libs-1.1.19-8.el7_6.2.x86_64 pacemaker-remote-1.1.19-8.el7_6.2.x86_64 pacemaker-libs-1.1.19-8.el7_6.2.x86_64 pacemaker-1.1.19-8.el7_6.2.x86_64 puppet-pacemaker-0.7.2-0.20181008172520.9a4bc2d.el7ost.noarch ansible-pacemaker-1.0.4-0.20180827141254.0e4d7c0.el7ost.noarch pacemaker-cli-1.1.19-8.el7_6.2.x86_64 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started controller-2 Docker container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp14/openstack-ovn-northd:pcmklatest] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Master controller-1 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-2 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started controller-2