Description of problem: After hard rebooting two controllers, rabbitmq-bundle-x pacemaker resources remain stopped. The test case checks during more than 10 minutes after the nodes have been rebooted. The status of controller-2 is "Starting" during most of the retries: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-0 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-1 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Starting controller-2 controller-2 /var/log/messages shows the following errors: Jul 22 18:25:38 controller-2 rabbitmq-cluster(rabbitmq)[268]: WARNING: Re-detect available rabbitmq nodes and try to start again Jul 22 18:25:39 controller-2 rabbitmq-cluster(rabbitmq)[268]: INFO: RabbitMQ server could not get cluster status from mnesia Jul 22 18:25:39 controller-2 rabbitmq-cluster(rabbitmq)[268]: INFO: wiping data directory before joining Jul 22 18:25:42 controller-2 rabbitmq-cluster(rabbitmq)[268]: INFO: Forgetting rabbit@controller-2 via nodes [ rabbit@controller-0 ]. Jul 22 18:25:42 controller-2 rabbitmq-cluster(rabbitmq)[268]: ERROR: Failed to forget node rabbit@controller-2 via rabbit@controller-0. Jul 22 18:25:42 controller-2 rabbitmq-cluster(rabbitmq)[268]: INFO: Joining existing cluster with [ rabbit@controller-0 ] nodes. Jul 22 18:25:42 controller-2 rabbitmq-cluster(rabbitmq)[268]: INFO: Waiting for server to start ... Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: warning: rabbitmq_start_0 process (PID 268) timed out Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: warning: rabbitmq_start_0:268 - timed out after 200000ms Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ Error: unable to perform an operation on node 'rabbit@controller-2'. Please see diagnostics information and suggestions below. ] Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ ] Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ Most common reasons for this are: ] Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ ] Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues) ] Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server) ] Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ * Target node is not running ] Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ ] Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ In addition to the diagnostics info below: ] Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ ] Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more ] Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ * Consult server logs on node rabbit@controller-2 ] Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ * If target node is configured to use long node names, don't forget to use --longnames with CLI tools ] Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ ] Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ DIAGNOSTICS ] Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ =========== ] Some error messages from /var/log/containers/rabbitmq/rabbit 2020-07-22 18:22:55.483 [error] <0.916.0> Error on AMQP connection <0.916.0> (172.17.1.79:51694 -> 172.17.1.117:5672 - nova-scheduler:21:973a3f7f-cf31-495f-af4c-b9e07342794d, vhost: '/', user: 'guest', state: running), channel 0: operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'" 2020-07-22 18:22:55.483 [error] <0.1501.0> Error on AMQP connection <0.1501.0> (172.17.1.148:54074 -> 172.17.1.117:5672 - heat-engine:24:aef1180b-ece5-4173-a5e8-c4fcd2ee58c2, vhost: '/', user: 'guest', state: running), channel 0: operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'" 2020-07-22 18:22:55.483 [error] <0.2096.0> Error on AMQP connection <0.2096.0> (172.17.1.117:46070 -> 172.17.1.117:5672 - heat-engine:23:1b0ccabd-feb7-4b75-a4d2-de43a2c78061, vhost: '/', user: 'guest', state: running), channel 0: operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'" The job that reproduced this issue is: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/46/ Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1702222 [details] test logs
This bug was reproduced on RHOS-16.1-RHEL-8-20200714.n.0 by tobiko test test_reboot_controller_non_main_vip. The commands used by this test to reboot controller nodes follow (it can be used for manual reproduction): sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger # on server: controller-2 sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger # on server: controller-0
Looks like it isn't easily reproducible in our lab via the double reboot. I've restarted two controllers and things recovered (after a long-ish while tho). Would you mind trying to reproduce it? Please give us a ping once the environment is available for troubleshooting. Thanks
It happened again yesterday on RHOS-16.1-RHEL-8-20200723.n.0 with the same job (https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/48/) 2020-07-28 04:22:24.466 328858 INFO tobiko.tests.faults.ha.cloud_disruptions [-] disrupt exec: sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger on server: controller-0 2020-07-28 04:22:24.466 328858 DEBUG tobiko.common._fixture [-] Clean up fixture 'tobiko.shell.ssh._client.SSHClientFixture' cleanUp /home/stack/src/x/tobiko/tobiko/common/_fixture.py:379 2020-07-28 04:22:24.471 328858 INFO tobiko.tests.faults.ha.cloud_disruptions [-] disrupt exec: sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger on server: controller-2 2020-07-28 04:22:24.471 328858 DEBUG tobiko.common._fixture [-] Clean up fixture 'tobiko.shell.ssh._client.SSHClientFixture' cleanUp /home/stack/src/x/tobiko/tobiko/common/_fixture.py:379 ... 2020-07-28 04:34:08.796 328858 INFO tobiko.tripleo.pacemaker [-] Retrying pacemaker resource checks attempt 359 of 360 2020-07-28 04:34:09.797 328858 DEBUG tobiko.common._fixture [-] Set up fixture 'tobiko.shell.sh._ssh.SSHShellProcessFixture' setUp /home/stack/src/x/tobiko/tobiko/common/_fixture.py:371 2020-07-28 04:34:09.805 328858 DEBUG tobiko.shell.sh._ssh [-] Executing remote command: 'sudo pcs status resources |grep ocf' (login='heat-admin.24.51:22', timeout=None, environment={})... create_process /home/stack/src/x/tobiko/tobiko/shell/sh/_ssh.py:99 ... * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-0 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-1 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-2 The same issue made the next tests fail. I mean, the rabbitmq resources were not recovered later on their own: 2020-07-28 05:25:13.022 328858 INFO tobiko.tripleo.pacemaker [-] Retrying pacemaker resource checks attempt 359 of 360 2020-07-28 05:25:14.023 328858 DEBUG tobiko.common._fixture [-] Set up fixture 'tobiko.shell.sh._ssh.SSHShellProcessFixture' setUp /home/stack/src/x/tobiko/tobiko/common/_fixture.py:371 2020-07-28 05:25:14.024 328858 DEBUG tobiko.shell.sh._ssh [-] Executing remote command: 'sudo pcs status resources |grep ocf' (login='heat-admin.24.51:22', timeout=None, environment={})... create_process /home/stack/src/x/tobiko/tobiko/shell/sh/_ssh.py:99 ... * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-0 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-1 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-2 Test logs: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/48/artifact/.workspaces/active/test_results/tobiko.log OC node logs can be downloaded from here: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/48/artifact/ Unfortunately this job run on a server from Jenkins pool and I don't have access to it (and it was reinstalled after the job ended). I will try to reproduce it on my env.
closing as we weren't able to reproduce this. please reopen if it happens again and provide sosreports/access to the env so we can debug it.
This probably has something to do with attributes hanging around in the cib. Rough timeline: - Controllers 0 and 2 are killed - Pacemaker stops rabbitmq on controller 1 (bug? race in minority partition handling code?) - Controller 0 is the first one to restart, but in the log file (had to dig a while to find it... http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/72/controller-0/var/log/extra/journal.txt.gz) I see: Apr 04 09:52:27 controller-0 rabbitmq-cluster(rabbitmq)[12305]: INFO: Forgetting rabbit@controller-0 via nodes [ rabbit@controller-0 ]. ... Apr 04 09:52:29 controller-0 rabbitmq-cluster(rabbitmq)[12474]: ERROR: Failed to forget node rabbit@controller-0 via rabbit@controller-0. Apr 04 09:52:29 controller-0 rabbitmq-cluster(rabbitmq)[12482]: INFO: Joining existing cluster with [ rabbit@controller-0 ] nodes. Apr 04 09:52:29 controller-0 rabbitmq-cluster(rabbitmq)[12489]: INFO: Waiting for server to start ... Apr 04 09:52:52 controller-0 rabbitmq-cluster(rabbitmq)[14387]: INFO: Attempting to join cluster with target node rabbit@controller-0 ... Apr 04 09:52:53 controller-0 rabbitmq-cluster(rabbitmq)[14565]: INFO: Join process incomplete, shutting down. Apr 04 09:52:53 controller-0 rabbitmq-cluster(rabbitmq)[14569]: WARNING: Failed to join the RabbitMQ cluster from nodes rabbit@controller-0. Stopping local unclustered rabbitmq ... Apr 04 09:52:58 controller-0 rabbitmq-cluster(rabbitmq)[15397]: WARNING: Re-detect available rabbitmq nodes and try to start again And it basically just loops from there, bouncing up and down. The key takeaway is that controller-0 is trying to join an existing cluster with... controller-0. So something is clearly wrong there. I need to go re-read the cib attribute code in the resource agent because it confuses me and I forget how it works 100% of the time.
So we spent quite some time on this and we're now moderately sure that if rabbit ends up not starting and you get a few messages like the following: Apr 04 09:52:52 controller-0 rabbitmq-cluster(rabbitmq)[14387]: INFO: Attempting to join cluster with target node rabbit@controller-0 Which basically means that rabbit on controller-0 is trying to form a cluster with itself, then we're in the presence of a pacemaker attrd bug: https://bugzilla.redhat.com/show_bug.cgi?id=1986998 The full analysis + explanation is on that BZ. For simplicity reasons I will close this one as a dupe of the pcmk bz. Once the pcmk bz is root-caused+fixed we can clone it to the rhel releases we deem appropriate. *** This bug has been marked as a duplicate of bug 1986998 ***