Description of problem: osp16.2 rabbitmq resource stuck in stopped after main_vip holding controller crash test Version-Release number of selected component (if applicable): RHOS-16.2-RHEL-8-20210707.n.0 How reproducible: 100% Steps to Reproduce: export ip_main_vip_ip=$(. /home/stack/overcloudrc && echo $OS_AUTH_URL | cut -d ':' -f2 | cut -d '/' -f3) #execute on controllers: #crash the one holding that vip: ip a |grep ip_main_vip_ip && (sleep 5s && echo c > /proc/sysrq-trigger) Actual results: one rabbitmq resource is stuck in stopped state Expected results: Additional info:
Found via a Tobiko job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/gate-sanity-16.2_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/ Test report is here : https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/gate-sanity-16.2_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/47/artifact/infrared/.workspaces/active/tobiko_faults/tobiko_faults_02_faults_faults.html
sosreports, stack home,all overcloud /var/log, are at : http://rhos-release.virt.bos.redhat.com/log/pkomarov_sosreports/BZ_1982460/
From the logs : Tobiko crashes the node at : 12:10:10.220 https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/gate-sanity-16.2_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/47/artifact/infrared/.workspaces/active/tobiko_faults/tobiko_faults_02_faults_faults.html 2021-07-14 12:10:10.220 287426 INFO tobiko.shell.sh._reboot [-] Executing reboot command on host '192.168.24.34' (command='sudo /bin/sh -c 'echo 1 > /proc/sys/kernel/sysrq && echo c > /proc/sysrq-trigger'')... from pacemaker.log #fencing right after the test crash Jul 14 12:10:19 controller-0 pacemaker-schedulerd[2891] (pe_fence_node) warning: Guest node rabbitmq-bundle-0 will be fenced (by recovering its guest resource rabbitmq-bundle-podman-0): rabbitmq:0 is thought to be active there Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (pcmk__native_merge_weights) info: rabbitmq-bundle-podman-0: Rolling back optional scores from rabbitmq-bundle-0 Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (pcmk__native_allocate) info: Resource rabbitmq-bundle-podman-0 cannot run anywhere Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (colocation_match) info: rabbitmq-bundle-0: Rolling back scores from rabbitmq-bundle-podman-0 (no available nodes) Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (custom_action) warning: rabbitmq-bundle-podman-0_stop_0 on controller-2 is unrunnable (node is offline) Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (custom_action) warning: rabbitmq-bundle-podman-0_stop_0 on controller-2 is unrunnable (node is offline) Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (custom_action) warning: rabbitmq-bundle-podman-0_stop_0 on controller-2 is unrunnable (node is offline) Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (custom_action) warning: rabbitmq-bundle-podman-0_stop_0 on controller-2 is unrunnable (node is offline) Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (native_stop_constraints) notice: Stop of failed resource rabbitmq-bundle-podman-0 is implicit after controller-2 is fenced Jul 14 12:10:20 controller-0 pacemaker-schedulerd[2891] (LogNodeActions) notice: * Fence (off) rabbitmq-bundle-0 (resource: rabbitmq-bundle-podman-0) 'guest is unclean' #rabbit is not recovering : start (blocked) Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (unpack_rsc_op_failure) warning: Unexpected result (error: podman failed to launch container) was recorded for start of rabbitmq-bundle-podman-0 on controller-2 at Jul 14 12:11:21 2021 | rc=1 id=rabbitmq-bundle-podman-0_last_failure_0 Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (pe_get_failcount) info: rabbitmq-bundle-podman-0 has failed INFINITY times on controller-2 Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (check_migration_threshold) warning: Forcing rabbitmq-bundle-podman-0 away from controller-2 after 1000000 failures (max=1000000) Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (pcmk__native_merge_weights) info: rabbitmq-bundle-podman-0: Rolling back optional scores from rabbitmq-bundle-0 Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (pcmk__native_allocate) info: Resource rabbitmq-bundle-podman-0 cannot run anywhere Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (colocation_match) info: rabbitmq-bundle-0: Rolling back scores from rabbitmq-bundle-podman-0 (no available nodes) Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (LogActions) info: Leave rabbitmq-bundle-podman-0 (Stopped) Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (LogAction) notice: * Start rabbitmq-bundle-0 ( controller-2 ) due to unrunnable rabbitmq-bundle-podman-0 start (blocked) Jul 14 22:41:56 controller-0 pacemaker-schedulerd[2891] (LogAction) notice: * Start rabbitmq:0 ( rabbitmq-bundle-0 ) due to unrunnable rabbitmq-bundle-podman-0 start (blocked)
*** Bug 1983952 has been marked as a duplicate of this bug. ***
we are fairly confident that this is a duplicate of bz#1973035. *** This bug has been marked as a duplicate of bug 1973035 ***