Bug 1859971
Summary: | rabbitmq-cluster resources stopped after a controller hard reboot | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Eduardo Olivares <eolivare> | ||||
Component: | rabbitmq-server | Assignee: | John Eckersberg <jeckersb> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | pkomarov | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 16.1 (Train) | CC: | akatz, apevec, ekuris, elicohen, jeckersb, lhh, lmiccini, michele, tfreger | ||||
Target Milestone: | --- | Keywords: | AutomationBlocker, Reopened, Triaged | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1983952 (view as bug list) | Environment: | |||||
Last Closed: | 2021-07-29 07:39:48 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1983952 | ||||||
Attachments: |
|
Description
Eduardo Olivares
2020-07-23 12:16:30 UTC
Created attachment 1702222 [details]
test logs
This bug was reproduced on RHOS-16.1-RHEL-8-20200714.n.0 by tobiko test test_reboot_controller_non_main_vip. The commands used by this test to reboot controller nodes follow (it can be used for manual reproduction): sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger # on server: controller-2 sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger # on server: controller-0 Looks like it isn't easily reproducible in our lab via the double reboot. I've restarted two controllers and things recovered (after a long-ish while tho). Would you mind trying to reproduce it? Please give us a ping once the environment is available for troubleshooting. Thanks It happened again yesterday on RHOS-16.1-RHEL-8-20200723.n.0 with the same job (https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/48/) 2020-07-28 04:22:24.466 328858 INFO tobiko.tests.faults.ha.cloud_disruptions [-] disrupt exec: sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger on server: controller-0 2020-07-28 04:22:24.466 328858 DEBUG tobiko.common._fixture [-] Clean up fixture 'tobiko.shell.ssh._client.SSHClientFixture' cleanUp /home/stack/src/x/tobiko/tobiko/common/_fixture.py:379 2020-07-28 04:22:24.471 328858 INFO tobiko.tests.faults.ha.cloud_disruptions [-] disrupt exec: sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger on server: controller-2 2020-07-28 04:22:24.471 328858 DEBUG tobiko.common._fixture [-] Clean up fixture 'tobiko.shell.ssh._client.SSHClientFixture' cleanUp /home/stack/src/x/tobiko/tobiko/common/_fixture.py:379 ... 2020-07-28 04:34:08.796 328858 INFO tobiko.tripleo.pacemaker [-] Retrying pacemaker resource checks attempt 359 of 360 2020-07-28 04:34:09.797 328858 DEBUG tobiko.common._fixture [-] Set up fixture 'tobiko.shell.sh._ssh.SSHShellProcessFixture' setUp /home/stack/src/x/tobiko/tobiko/common/_fixture.py:371 2020-07-28 04:34:09.805 328858 DEBUG tobiko.shell.sh._ssh [-] Executing remote command: 'sudo pcs status resources |grep ocf' (login='heat-admin.24.51:22', timeout=None, environment={})... create_process /home/stack/src/x/tobiko/tobiko/shell/sh/_ssh.py:99 ... * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-0 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-1 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-2 The same issue made the next tests fail. I mean, the rabbitmq resources were not recovered later on their own: 2020-07-28 05:25:13.022 328858 INFO tobiko.tripleo.pacemaker [-] Retrying pacemaker resource checks attempt 359 of 360 2020-07-28 05:25:14.023 328858 DEBUG tobiko.common._fixture [-] Set up fixture 'tobiko.shell.sh._ssh.SSHShellProcessFixture' setUp /home/stack/src/x/tobiko/tobiko/common/_fixture.py:371 2020-07-28 05:25:14.024 328858 DEBUG tobiko.shell.sh._ssh [-] Executing remote command: 'sudo pcs status resources |grep ocf' (login='heat-admin.24.51:22', timeout=None, environment={})... create_process /home/stack/src/x/tobiko/tobiko/shell/sh/_ssh.py:99 ... * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-0 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-1 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-2 Test logs: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/48/artifact/.workspaces/active/test_results/tobiko.log OC node logs can be downloaded from here: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/48/artifact/ Unfortunately this job run on a server from Jenkins pool and I don't have access to it (and it was reinstalled after the job ended). I will try to reproduce it on my env. closing as we weren't able to reproduce this. please reopen if it happens again and provide sosreports/access to the env so we can debug it. This probably has something to do with attributes hanging around in the cib. Rough timeline: - Controllers 0 and 2 are killed - Pacemaker stops rabbitmq on controller 1 (bug? race in minority partition handling code?) - Controller 0 is the first one to restart, but in the log file (had to dig a while to find it... http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/72/controller-0/var/log/extra/journal.txt.gz) I see: Apr 04 09:52:27 controller-0 rabbitmq-cluster(rabbitmq)[12305]: INFO: Forgetting rabbit@controller-0 via nodes [ rabbit@controller-0 ]. ... Apr 04 09:52:29 controller-0 rabbitmq-cluster(rabbitmq)[12474]: ERROR: Failed to forget node rabbit@controller-0 via rabbit@controller-0. Apr 04 09:52:29 controller-0 rabbitmq-cluster(rabbitmq)[12482]: INFO: Joining existing cluster with [ rabbit@controller-0 ] nodes. Apr 04 09:52:29 controller-0 rabbitmq-cluster(rabbitmq)[12489]: INFO: Waiting for server to start ... Apr 04 09:52:52 controller-0 rabbitmq-cluster(rabbitmq)[14387]: INFO: Attempting to join cluster with target node rabbit@controller-0 ... Apr 04 09:52:53 controller-0 rabbitmq-cluster(rabbitmq)[14565]: INFO: Join process incomplete, shutting down. Apr 04 09:52:53 controller-0 rabbitmq-cluster(rabbitmq)[14569]: WARNING: Failed to join the RabbitMQ cluster from nodes rabbit@controller-0. Stopping local unclustered rabbitmq ... Apr 04 09:52:58 controller-0 rabbitmq-cluster(rabbitmq)[15397]: WARNING: Re-detect available rabbitmq nodes and try to start again And it basically just loops from there, bouncing up and down. The key takeaway is that controller-0 is trying to join an existing cluster with... controller-0. So something is clearly wrong there. I need to go re-read the cib attribute code in the resource agent because it confuses me and I forget how it works 100% of the time. So we spent quite some time on this and we're now moderately sure that if rabbit ends up not starting and you get a few messages like the following: Apr 04 09:52:52 controller-0 rabbitmq-cluster(rabbitmq)[14387]: INFO: Attempting to join cluster with target node rabbit@controller-0 Which basically means that rabbit on controller-0 is trying to form a cluster with itself, then we're in the presence of a pacemaker attrd bug: https://bugzilla.redhat.com/show_bug.cgi?id=1986998 The full analysis + explanation is on that BZ. For simplicity reasons I will close this one as a dupe of the pcmk bz. Once the pcmk bz is root-caused+fixed we can clone it to the rhel releases we deem appropriate. *** This bug has been marked as a duplicate of bug 1986998 *** |