Description of problem: After running minor update on OSP12 HA setup with SRIOV support I saw that rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-0 after few minutes ~5 in started and stopped in another controller node. After ~30 it started on all controller node. Version-Release number of selected component (if applicable): OSP-12 latest puddle 2017-12-01.4 [root@controller-0 ~]# rpm -qa |grep rabbit rabbitmq-server-3.6.5-5.el7ost.noarch puppet-rabbitmq-5.6.0-5.70ac6c5git.el7ost.noarch [root@controller-0 ~]# rpm -qa |grep pacemaker ansible-pacemaker-1.0.3-2.el7ost.noarch pacemaker-libs-1.1.16-12.el7_4.5.x86_64 pacemaker-cluster-libs-1.1.16-12.el7_4.5.x86_64 pacemaker-1.1.16-12.el7_4.5.x86_64 puppet-pacemaker-0.6.0-2.el7ost.noarch pacemaker-cli-1.1.16-12.el7_4.5.x86_64 pacemaker-remote-1.1.16-12.el7_4.5.x86_64 https://drive.google.com/open?id=1SjrZnYnrNTu6gw7iLRXRe8_J7aAprBGI Steps to Reproduce: 1.run minor update on osp12 ha. 2. 3. deployment files: https://code.engineering.redhat.com/gerrit/gitweb?p=Neutron-QE.git;a=tree;f=BM_heat_template/ospd-12-multiple-nic-vlans-sriov-hybrid-ha;h=df451d66a38902ddfc97d7e7533c0b2b8926ed08;hb=refs/heads/master Additional info: The controllers are deployed with low ram resource. according to HA team, it maybe causes the issue.
the output of pcs status: Stack: corosync Current DC: controller-0 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Wed Dec 6 13:07:15 2017 Last change: Wed Dec 6 09:00:29 2017 by root via crm_resource on controller-2 12 nodes configured 37 resources configured Online: [ controller-0 controller-1 controller-2 ] GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 rabbitmq-bundle-0@controller-2 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-0 redis-bundle-0@c ontroller-0 ] Active resources: ip-192.168.24.14 (ocf::heartbeat:IPaddr2): Started controller-0 ip-10.35.166.47 (ocf::heartbeat:IPaddr2): Started controller-2 ip-10.35.1.12 (ocf::heartbeat:IPaddr2): Started controller-2 ip-10.35.1.8 (ocf::heartbeat:IPaddr2): Started controller-0 ip-10.35.3.61 (ocf::heartbeat:IPaddr2): Started controller-0 ip-10.35.4.86 (ocf::heartbeat:IPaddr2): Started controller-1 openstack-cinder-volume (systemd:openstack-cinder-volume): Started controller-0 Docker container set: rabbitmq-bundle [rhos-qe-mirror-tlv.usersys.redhat.com:5000/rhosp12/openstack-rabbitmq-docker:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): FAILED controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-0 Docker container set: galera-bundle [rhos-qe-mirror-tlv.usersys.redhat.com:5000/rhosp12/openstack-mariadb-docker:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Stopped controller-0 galera-bundle-1 (ocf::heartbeat:galera): Master controller-1 galera-bundle-2 (ocf::heartbeat:galera): Master controller-2 Docker container set: redis-bundle [rhos-qe-mirror-tlv.usersys.redhat.com:5000/rhosp12/openstack-redis-docker:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 redis-bundle-1 (ocf::heartbeat:redis): Stopped controller-1 redis-bundle-2 (ocf::heartbeat:redis): Stopped every few minutes pcs status shows that all resources are started on all nodes.
Likely a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1519765
*** This bug has been marked as a duplicate of bug 1519765 ***