Bug 1982460 - osp16.2 rabbitmq resource stuck in stopped after main_vip holding controller crash test
Summary: osp16.2 rabbitmq resource stuck in stopped after main_vip holding controller ...
Keywords:
Status: CLOSED DUPLICATE of bug 1973035
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 16.2 (Train)
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Peter Lemenkov
QA Contact: pkomarov
URL:
Whiteboard:
: 1983952 (view as bug list)
Depends On: 1972209 1990406 1999264 2000570 2019335
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-14 22:27 UTC by pkomarov
Modified: 2022-08-10 17:02 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-10 06:57:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-6248 0 None None None 2022-08-10 17:02:52 UTC

Description pkomarov 2021-07-14 22:27:28 UTC
Description of problem:
osp16.2 rabbitmq resource stuck in stopped after main_vip holding controller crash test

Version-Release number of selected component (if applicable):
RHOS-16.2-RHEL-8-20210707.n.0

How reproducible:
100%

Steps to Reproduce:

export ip_main_vip_ip=$(. /home/stack/overcloudrc && echo $OS_AUTH_URL | cut -d ':' -f2 | cut -d '/' -f3)

#execute on controllers:
#crash the one holding that vip:
ip a |grep  ip_main_vip_ip && (sleep 5s && echo c > /proc/sysrq-trigger)

Actual results:
one rabbitmq resource is stuck in stopped state

Expected results:


Additional info:

Comment 2 pkomarov 2021-07-14 22:46:20 UTC
sosreports, stack home,all overcloud /var/log, are at : http://rhos-release.virt.bos.redhat.com/log/pkomarov_sosreports/BZ_1982460/

Comment 3 pkomarov 2021-07-14 22:58:28 UTC
From the logs : 

Tobiko crashes the node at : 12:10:10.220 

https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/gate-sanity-16.2_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/47/artifact/infrared/.workspaces/active/tobiko_faults/tobiko_faults_02_faults_faults.html

2021-07-14 12:10:10.220 287426 INFO tobiko.shell.sh._reboot [-] Executing reboot command on host '192.168.24.34' (command='sudo /bin/sh -c 'echo 1 > /proc/sys/kernel/sysrq && echo c > /proc/sysrq-trigger'')... 

from pacemaker.log 

#fencing right after the test crash

Jul 14 12:10:19 controller-0 pacemaker-schedulerd[2891] (pe_fence_node)         warning: Guest node rabbitmq-bundle-0 will be fenced (by recovering its guest resource rabbitmq-bundle-podman-0): rabbitmq:0 is thought to be active there
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (pcmk__native_merge_weights)    info: rabbitmq-bundle-podman-0: Rolling back optional scores from rabbitmq-bundle-0
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (pcmk__native_allocate)         info: Resource rabbitmq-bundle-podman-0 cannot run anywhere
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (colocation_match)      info: rabbitmq-bundle-0: Rolling back scores from rabbitmq-bundle-podman-0 (no available nodes)
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (custom_action)         warning: rabbitmq-bundle-podman-0_stop_0 on controller-2 is unrunnable (node is offline)
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (custom_action)         warning: rabbitmq-bundle-podman-0_stop_0 on controller-2 is unrunnable (node is offline)
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (custom_action)         warning: rabbitmq-bundle-podman-0_stop_0 on controller-2 is unrunnable (node is offline)
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (custom_action)         warning: rabbitmq-bundle-podman-0_stop_0 on controller-2 is unrunnable (node is offline)
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (native_stop_constraints)       notice: Stop of failed resource rabbitmq-bundle-podman-0 is implicit after controller-2 is fenced
Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (LogNodeActions)        notice:  * Fence (off) rabbitmq-bundle-0 (resource: rabbitmq-bundle-podman-0) 'guest is unclean'

#rabbit is not recovering : 

 start (blocked)
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (unpack_rsc_op_failure)         warning: Unexpected result (error: podman failed to launch container) was recorded for start of rabbitmq-bundle-podman-0 on controller-2 at Jul 14 12:11:21 2021 | rc=1 id=rabbitmq-bundle-podman-0_last_failure_0
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (pe_get_failcount)      info: rabbitmq-bundle-podman-0 has failed INFINITY times on controller-2
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (check_migration_threshold)     warning: Forcing rabbitmq-bundle-podman-0 away from controller-2 after 1000000 failures (max=1000000)
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (pcmk__native_merge_weights)    info: rabbitmq-bundle-podman-0: Rolling back optional scores from rabbitmq-bundle-0
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (pcmk__native_allocate)         info: Resource rabbitmq-bundle-podman-0 cannot run anywhere
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (colocation_match)      info: rabbitmq-bundle-0: Rolling back scores from rabbitmq-bundle-podman-0 (no available nodes)
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (LogActions)    info: Leave   rabbitmq-bundle-podman-0  (Stopped)
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (LogAction)     notice:  * Start      rabbitmq-bundle-0                      (                    controller-2 )   due to unrunnable rabbitmq-bundle-podman-0 start (blocked)
Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (LogAction)     notice:  * Start      rabbitmq:0                             (               rabbitmq-bundle-0 )   due to unrunnable rabbitmq-bundle-podman-0 start (blocked)

Comment 7 Eduardo Olivares 2021-07-28 08:31:49 UTC
*** Bug 1983952 has been marked as a duplicate of this bug. ***

Comment 8 Luca Miccini 2021-11-10 06:57:00 UTC
we are fairly confident that this is a duplicate of bz#1973035.

*** This bug has been marked as a duplicate of bug 1973035 ***


Note You need to log in before you can comment on or make changes to this bug.