Bug 1982460

Summary: osp16.2 rabbitmq resource stuck in stopped after main_vip holding controller crash test
Product: Red Hat OpenStack Reporter: pkomarov
Component: rabbitmq-serverAssignee: Peter Lemenkov <plemenko>
Status: CLOSED DUPLICATE QA Contact: pkomarov
Severity: high Docs Contact:
Priority: medium    
Version: 16.2 (Train)CC: apevec, ekuris, jeckersb, lhh, lmiccini
Target Milestone: ---Keywords: AutomationBlocker, Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-10 06:57:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1972209, 1990406, 1999264, 2000570, 2019335    
Bug Blocks:    

Description pkomarov 2021-07-14 22:27:28 UTC
Description of problem:
osp16.2 rabbitmq resource stuck in stopped after main_vip holding controller crash test

Version-Release number of selected component (if applicable):
RHOS-16.2-RHEL-8-20210707.n.0

How reproducible:
100%

Steps to Reproduce:

export ip_main_vip_ip=$(. /home/stack/overcloudrc && echo $OS_AUTH_URL | cut -d ':' -f2 | cut -d '/' -f3)

#execute on controllers:
#crash the one holding that vip:
ip a |grep  ip_main_vip_ip && (sleep 5s && echo c > /proc/sysrq-trigger)

Actual results:
one rabbitmq resource is stuck in stopped state

Expected results:


Additional info:

Comment 2 pkomarov 2021-07-14 22:46:20 UTC
sosreports, stack home,all overcloud /var/log, are at : http://rhos-release.virt.bos.redhat.com/log/pkomarov_sosreports/BZ_1982460/

Comment 3 pkomarov 2021-07-14 22:58:28 UTC
From the logs : 

Tobiko crashes the node at : 12:10:10.220 

https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/gate-sanity-16.2_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/47/artifact/infrared/.workspaces/active/tobiko_faults/tobiko_faults_02_faults_faults.html

2021-07-14 12:10:10.220 287426 INFO tobiko.shell.sh._reboot [-] Executing reboot command on host '192.168.24.34' (command='sudo /bin/sh -c 'echo 1 > /proc/sys/kernel/sysrq && echo c > /proc/sysrq-trigger'')... 

from pacemaker.log 

#fencing right after the test crash

Jul 14 12:10:19 controller-0 pacemaker-schedulerd[2891] (pe_fence_node)         warning: Guest node rabbitmq-bundle-0 will be fenced (by recovering its guest resource rabbitmq-bundle-podman-0): rabbitmq:0 is thought to be active there
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (pcmk__native_merge_weights)    info: rabbitmq-bundle-podman-0: Rolling back optional scores from rabbitmq-bundle-0
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (pcmk__native_allocate)         info: Resource rabbitmq-bundle-podman-0 cannot run anywhere
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (colocation_match)      info: rabbitmq-bundle-0: Rolling back scores from rabbitmq-bundle-podman-0 (no available nodes)
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (custom_action)         warning: rabbitmq-bundle-podman-0_stop_0 on controller-2 is unrunnable (node is offline)
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (custom_action)         warning: rabbitmq-bundle-podman-0_stop_0 on controller-2 is unrunnable (node is offline)
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (custom_action)         warning: rabbitmq-bundle-podman-0_stop_0 on controller-2 is unrunnable (node is offline)
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (custom_action)         warning: rabbitmq-bundle-podman-0_stop_0 on controller-2 is unrunnable (node is offline)
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (native_stop_constraints)       notice: Stop of failed resource rabbitmq-bundle-podman-0 is implicit after controller-2 is fenced
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (LogNodeActions)        notice:  * Fence (off) rabbitmq-bundle-0 (resource: rabbitmq-bundle-podman-0) 'guest is unclean'

#rabbit is not recovering : 

 start (blocked)
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (unpack_rsc_op_failure)         warning: Unexpected result (error: podman failed to launch container) was recorded for start of rabbitmq-bundle-podman-0 on controller-2 at Jul 14 12:11:21 2021 | rc=1 id=rabbitmq-bundle-podman-0_last_failure_0
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (pe_get_failcount)      info: rabbitmq-bundle-podman-0 has failed INFINITY times on controller-2
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (check_migration_threshold)     warning: Forcing rabbitmq-bundle-podman-0 away from controller-2 after 1000000 failures (max=1000000)
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (pcmk__native_merge_weights)    info: rabbitmq-bundle-podman-0: Rolling back optional scores from rabbitmq-bundle-0
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (pcmk__native_allocate)         info: Resource rabbitmq-bundle-podman-0 cannot run anywhere
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (colocation_match)      info: rabbitmq-bundle-0: Rolling back scores from rabbitmq-bundle-podman-0 (no available nodes)
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (LogActions)    info: Leave   rabbitmq-bundle-podman-0  (Stopped)
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (LogAction)     notice:  * Start      rabbitmq-bundle-0                      (                    controller-2 )   due to unrunnable rabbitmq-bundle-podman-0 start (blocked)
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (LogAction)     notice:  * Start      rabbitmq:0                             (               rabbitmq-bundle-0 )   due to unrunnable rabbitmq-bundle-podman-0 start (blocked)

Comment 7 Eduardo Olivares 2021-07-28 08:31:49 UTC
*** Bug 1983952 has been marked as a duplicate of this bug. ***

Comment 8 Luca Miccini 2021-11-10 06:57:00 UTC
we are fairly confident that this is a duplicate of bz#1973035.

*** This bug has been marked as a duplicate of bug 1973035 ***