Bug 1859971

Summary:

rabbitmq-cluster resources stopped after a controller hard reboot

Product:

Red Hat OpenStack

Reporter:

Eduardo Olivares <eolivare>

Component:

rabbitmq-server

Assignee:

John Eckersberg <jeckersb>

Status:

CLOSED DUPLICATE

QA Contact:

pkomarov

Severity:

high

Docs Contact:

Priority:

high

Version:

16.1 (Train)

CC:

akatz, apevec, ekuris, elicohen, jeckersb, lhh, lmiccini, michele, tfreger

Target Milestone:

---

Keywords:

AutomationBlocker, Reopened, Triaged

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1983952 (view as bug list)

Environment:

Last Closed:

2021-07-29 07:39:48 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1983952

Attachments:

Description	Flags
test logs	none

Description Eduardo Olivares 2020-07-23 12:16:30 UTC

Description of problem:
After hard rebooting two controllers, rabbitmq-bundle-x pacemaker resources remain stopped. The test case checks during more than 10 minutes after the nodes have been rebooted.

The status of controller-2 is "Starting" during most of the retries:
        * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster):  Stopped controller-0
        * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster):  Stopped controller-1
        * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster):  Starting controller-2


controller-2 /var/log/messages shows the following errors:
Jul 22 18:25:38 controller-2 rabbitmq-cluster(rabbitmq)[268]: WARNING: Re-detect available rabbitmq nodes and try to start again
Jul 22 18:25:39 controller-2 rabbitmq-cluster(rabbitmq)[268]: INFO: RabbitMQ server could not get cluster status from mnesia
Jul 22 18:25:39 controller-2 rabbitmq-cluster(rabbitmq)[268]: INFO: wiping data directory before joining
Jul 22 18:25:42 controller-2 rabbitmq-cluster(rabbitmq)[268]: INFO: Forgetting rabbit@controller-2 via nodes [ rabbit@controller-0  ].
Jul 22 18:25:42 controller-2 rabbitmq-cluster(rabbitmq)[268]: ERROR: Failed to forget node rabbit@controller-2 via rabbit@controller-0.
Jul 22 18:25:42 controller-2 rabbitmq-cluster(rabbitmq)[268]: INFO: Joining existing cluster with [ rabbit@controller-0  ] nodes.
Jul 22 18:25:42 controller-2 rabbitmq-cluster(rabbitmq)[268]: INFO: Waiting for server to start
...
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: warning: rabbitmq_start_0 process (PID 268) timed out
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: warning: rabbitmq_start_0:268 - timed out after 200000ms
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ Error: unable to perform an operation on node 'rabbit@controller-2'. Please see diagnostics information and suggestions below. ]
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [  ]
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ Most common reasons for this are: ]
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [  ]
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [  * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues) ]
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [  * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server) ]
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [  * Target node is not running ]
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [  ]
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ In addition to the diagnostics info below: ]
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [  ]
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [  * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more ]
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [  * Consult server logs on node rabbit@controller-2 ]
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [  * If target node is configured to use long node names, don't forget to use --longnames with CLI tools ]
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [  ]
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ DIAGNOSTICS ]
Jul 22 18:25:48 controller-2 pacemaker-remoted[8]: notice: rabbitmq_start_0:268:stderr [ =========== ]


Some error messages from /var/log/containers/rabbitmq/rabbit
2020-07-22 18:22:55.483 [error] <0.916.0> Error on AMQP connection <0.916.0> (172.17.1.79:51694 -> 172.17.1.117:5672 - nova-scheduler:21:973a3f7f-cf31-495f-af4c-b9e07342794d, vhost: '/', user: 'guest', state: running), channel 0:
 operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"
2020-07-22 18:22:55.483 [error] <0.1501.0> Error on AMQP connection <0.1501.0> (172.17.1.148:54074 -> 172.17.1.117:5672 - heat-engine:24:aef1180b-ece5-4173-a5e8-c4fcd2ee58c2, vhost: '/', user: 'guest', state: running), channel 0:
 operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"
2020-07-22 18:22:55.483 [error] <0.2096.0> Error on AMQP connection <0.2096.0> (172.17.1.117:46070 -> 172.17.1.117:5672 - heat-engine:23:1b0ccabd-feb7-4b75-a4d2-de43a2c78061, vhost: '/', user: 'guest', state: running), channel 0:
 operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"



The job that reproduced this issue is: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/46/







Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Eduardo Olivares 2020-07-23 12:17:46 UTC

Created attachment 1702222 [details]
test logs

Comment 2 Eduardo Olivares 2020-07-27 07:30:02 UTC

This bug was reproduced on RHOS-16.1-RHEL-8-20200714.n.0 by tobiko test test_reboot_controller_non_main_vip.

The commands used by this test to reboot controller nodes follow (it can be used for manual reproduction):
sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger # on server: controller-2
sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger # on server: controller-0

Comment 3 Luca Miccini 2020-07-27 09:23:19 UTC

Looks like it isn't easily reproducible in our lab via the double reboot. I've restarted two controllers and things recovered (after a long-ish while tho).
Would you mind trying to reproduce it? Please give us a ping once the environment is available for troubleshooting.
Thanks

Comment 4 Eduardo Olivares 2020-07-28 07:41:11 UTC

It happened again yesterday on RHOS-16.1-RHEL-8-20200723.n.0 with the same job (https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/48/)

2020-07-28 04:22:24.466 328858 INFO tobiko.tests.faults.ha.cloud_disruptions [-] disrupt exec: sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger on server: controller-0
2020-07-28 04:22:24.466 328858 DEBUG tobiko.common._fixture [-] Clean up fixture 'tobiko.shell.ssh._client.SSHClientFixture' cleanUp /home/stack/src/x/tobiko/tobiko/common/_fixture.py:379
2020-07-28 04:22:24.471 328858 INFO tobiko.tests.faults.ha.cloud_disruptions [-] disrupt exec: sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger on server: controller-2
2020-07-28 04:22:24.471 328858 DEBUG tobiko.common._fixture [-] Clean up fixture 'tobiko.shell.ssh._client.SSHClientFixture' cleanUp /home/stack/src/x/tobiko/tobiko/common/_fixture.py:379
...
2020-07-28 04:34:08.796 328858 INFO tobiko.tripleo.pacemaker [-] Retrying pacemaker resource checks attempt 359 of 360
2020-07-28 04:34:09.797 328858 DEBUG tobiko.common._fixture [-] Set up fixture 'tobiko.shell.sh._ssh.SSHShellProcessFixture' setUp /home/stack/src/x/tobiko/tobiko/common/_fixture.py:371
2020-07-28 04:34:09.805 328858 DEBUG tobiko.shell.sh._ssh [-] Executing remote command: 'sudo pcs status resources |grep ocf' (login='heat-admin.24.51:22', timeout=None, environment={})... create_process /home/stack/src/x/tobiko/tobiko/shell/sh/_ssh.py:99
...
        * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster):  Stopped controller-0
        * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster):  Stopped controller-1
        * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster):  Stopped controller-2



The same issue made the next tests fail. I mean, the rabbitmq resources were not recovered later on their own:
2020-07-28 05:25:13.022 328858 INFO tobiko.tripleo.pacemaker [-] Retrying pacemaker resource checks attempt 359 of 360
2020-07-28 05:25:14.023 328858 DEBUG tobiko.common._fixture [-] Set up fixture 'tobiko.shell.sh._ssh.SSHShellProcessFixture' setUp /home/stack/src/x/tobiko/tobiko/common/_fixture.py:371
2020-07-28 05:25:14.024 328858 DEBUG tobiko.shell.sh._ssh [-] Executing remote command: 'sudo pcs status resources |grep ocf' (login='heat-admin.24.51:22', timeout=None, environment={})... create_process /home/stack/src/x/tobiko/tobiko/shell/sh/_ssh.py:99
...
        * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster):  Stopped controller-0
        * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster):  Stopped controller-1
        * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster):  Stopped controller-2





Test logs: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/48/artifact/.workspaces/active/test_results/tobiko.log


OC node logs can be downloaded from here: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/48/artifact/



Unfortunately this job run on a server from Jenkins pool and I don't have access to it (and it was reinstalled after the job ended). I will try to reproduce it on my env.

Comment 5 Luca Miccini 2020-12-02 09:13:45 UTC

closing as we weren't able to reproduce this. please reopen if it happens again and provide sosreports/access to the env so we can debug it.

Comment 11 John Eckersberg 2021-04-06 20:45:25 UTC

This probably has something to do with attributes hanging around in the cib.

Rough timeline:

- Controllers 0 and 2 are killed

- Pacemaker stops rabbitmq on controller 1 (bug?  race in minority partition handling code?)

- Controller 0 is the first one to restart, but in the log file (had to dig a while to find it... http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/72/controller-0/var/log/extra/journal.txt.gz) I see:

Apr 04 09:52:27 controller-0 rabbitmq-cluster(rabbitmq)[12305]: INFO: Forgetting rabbit@controller-0 via nodes [ rabbit@controller-0  ].
...
Apr 04 09:52:29 controller-0 rabbitmq-cluster(rabbitmq)[12474]: ERROR: Failed to forget node rabbit@controller-0 via rabbit@controller-0.
Apr 04 09:52:29 controller-0 rabbitmq-cluster(rabbitmq)[12482]: INFO: Joining existing cluster with [ rabbit@controller-0  ] nodes.
Apr 04 09:52:29 controller-0 rabbitmq-cluster(rabbitmq)[12489]: INFO: Waiting for server to start
...
Apr 04 09:52:52 controller-0 rabbitmq-cluster(rabbitmq)[14387]: INFO: Attempting to join cluster with target node rabbit@controller-0
...
Apr 04 09:52:53 controller-0 rabbitmq-cluster(rabbitmq)[14565]: INFO: Join process incomplete, shutting down.
Apr 04 09:52:53 controller-0 rabbitmq-cluster(rabbitmq)[14569]: WARNING: Failed to join the RabbitMQ cluster from nodes rabbit@controller-0. Stopping local unclustered rabbitmq
...
Apr 04 09:52:58 controller-0 rabbitmq-cluster(rabbitmq)[15397]: WARNING: Re-detect available rabbitmq nodes and try to start again

And it basically just loops from there, bouncing up and down.

The key takeaway is that controller-0 is trying to join an existing cluster with... controller-0.  So something is clearly wrong there.  I need to go re-read the cib attribute code in the resource agent because it confuses me and I forget how it works 100% of the time.

Comment 13 Michele Baldessari 2021-07-29 07:39:48 UTC

So we spent quite some time on this and we're now moderately sure that if rabbit ends up not starting and you get a few messages like the following:
Apr 04 09:52:52 controller-0 rabbitmq-cluster(rabbitmq)[14387]: INFO: Attempting to join cluster with target node rabbit@controller-0

Which basically means that rabbit on controller-0 is trying to form a cluster with itself, then we're in the presence of a pacemaker attrd bug: https://bugzilla.redhat.com/show_bug.cgi?id=1986998

The full analysis + explanation is on that BZ. For simplicity reasons I will close this one as a dupe of the pcmk bz. Once the pcmk bz is root-caused+fixed we can clone it to the rhel releases we deem appropriate.

*** This bug has been marked as a duplicate of bug 1986998 ***