1887606 – rabbitmq pcs resource stuck in stopped after a non-main-vip ip node restart , regression since puddle RHOS-16.1-RHEL-8-20201007.n.0

Bug 1887606 - rabbitmq pcs resource stuck in stopped after a non-main-vip ip node restart , regression since puddle RHOS-16.1-RHEL-8-20201007.n.0

Summary: rabbitmq pcs resource stuck in stopped after a non-main-vip ip node restart ,...

Keywords:
Status:	CLOSED DUPLICATE of bug 1986998
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rabbitmq-server
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Peter Lemenkov
QA Contact:	pkomarov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-12 22:26 UTC by pkomarov
Modified:	2022-08-10 10:37 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-19 08:24:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-7209	0	None	None	None	2022-08-10 10:37:18 UTC

Description pkomarov 2020-10-12 22:26:38 UTC

Description of problem:
rabbitmq pcs resource stuck in stopped after a non-main-vip ip node restart , regression since puddle RHOS-16.1-RHEL-8-20201007.n.0 

Version-Release number of selected component (if applRHOS-16.1-RHEL-8-20201007.n.0icable):


How reproducible:
100%

Steps to Reproduce:
reproducer : 
#find the main-vip node
        . /home/stack/overcloudrc && echo $OS_AUTH_URL | cut -d ':' -f2 | cut -d '/' -f3
#hard reset non main-vip controllers:
       ip a |grep "{{ hostvars['main_vip_uc']['value'] }}" ||
         (sleep 5s && echo b > /proc/sysrq-trigger)


Actual results:
after reboot two rabbitmq pcs resources are stopped

Expected results:
after reboot all rabbitmq pcs resource should be started

Additional info:
test run logs : 
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Phase3/view/OSP%2016.1/view/PidOne/job/DFG-pidone-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-ansible-sts-sanity/18/artifact/ansible_sts_results/04_HARD_RESET_CONTROLLER_NON_VIP.log
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Phase3/view/OSP%2016.1/view/PidOne/job/DFG-pidone-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-ansible-sts-sanity/17/artifact/ansible_sts_results/04_HARD_RESET_CONTROLLER_NON_VIP.log

logs and files are here : 
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Phase3/view/OSP%2016.1/view/PidOne/job/DFG-pidone-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-ansible-sts-sanity/17/artifact/

https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Phase3/view/OSP%2016.1/view/PidOne/job/DFG-pidone-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-ansible-sts-sanity/18/artifact/

Comment 1 pkomarov 2020-10-12 22:30:23 UTC

after ~24 min the resource is stuck in this state: 

[0;33m  * Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]:[0m
[0;33m    * rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Starting controller-0[0m
[0;33m    * rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Stopped controller-1[0m
[0;33m    * rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Stopped controller-2[0m

Note You need to log in before you can comment on or make changes to this bug.