Bug 1887606

Summary: rabbitmq pcs resource stuck in stopped after a non-main-vip ip node restart , regression since puddle RHOS-16.1-RHEL-8-20201007.n.0
Product: Red Hat OpenStack Reporter: pkomarov
Component: rabbitmq-serverAssignee: Peter Lemenkov <plemenko>
Status: CLOSED DUPLICATE QA Contact: pkomarov
Severity: medium Docs Contact:
Priority: medium    
Version: 16.1 (Train)CC: apevec, jeckersb, lhh, lmiccini, michele
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-19 08:24:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description pkomarov 2020-10-12 22:26:38 UTC
Description of problem:
rabbitmq pcs resource stuck in stopped after a non-main-vip ip node restart , regression since puddle RHOS-16.1-RHEL-8-20201007.n.0 

Version-Release number of selected component (if applRHOS-16.1-RHEL-8-20201007.n.0icable):


How reproducible:
100%

Steps to Reproduce:
reproducer : 
#find the main-vip node
        . /home/stack/overcloudrc && echo $OS_AUTH_URL | cut -d ':' -f2 | cut -d '/' -f3
#hard reset non main-vip controllers:
       ip a |grep "{{ hostvars['main_vip_uc']['value'] }}" ||
         (sleep 5s && echo b > /proc/sysrq-trigger)


Actual results:
after reboot two rabbitmq pcs resources are stopped

Expected results:
after reboot all rabbitmq pcs resource should be started

Additional info:
test run logs : 
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Phase3/view/OSP%2016.1/view/PidOne/job/DFG-pidone-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-ansible-sts-sanity/18/artifact/ansible_sts_results/04_HARD_RESET_CONTROLLER_NON_VIP.log
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Phase3/view/OSP%2016.1/view/PidOne/job/DFG-pidone-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-ansible-sts-sanity/17/artifact/ansible_sts_results/04_HARD_RESET_CONTROLLER_NON_VIP.log

logs and files are here : 
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Phase3/view/OSP%2016.1/view/PidOne/job/DFG-pidone-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-ansible-sts-sanity/17/artifact/

https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Phase3/view/OSP%2016.1/view/PidOne/job/DFG-pidone-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-ansible-sts-sanity/18/artifact/

Comment 1 pkomarov 2020-10-12 22:30:23 UTC
after ~24 min the resource is stuck in this state: 

[0;33m  * Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]:[0m
[0;33m    * rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Starting controller-0[0m
[0;33m    * rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Stopped controller-1[0m
[0;33m    * rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Stopped controller-2[0m