1522896 – [UPDATES] RMQ Fails to join cluster after minor update and node's reboot

Bug 1522896 - [UPDATES] RMQ Fails to join cluster after minor update and node's reboot

Summary: [UPDATES] RMQ Fails to join cluster after minor update and node's reboot

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rabbitmq-server
Sub Component:
Version:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	12.0 (Pike)
Assignee:	John Eckersberg
QA Contact:	Udi Shkalim
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1568411 (view as bug list)
Depends On:
Blocks:	1592407
TreeView+	depends on / blocked

Reported:	2017-12-06 17:14 UTC by Yurii Prokulevych
Modified:	2019-01-11 16:26 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1592407 (view as bug list)
Environment:
Last Closed:	2019-01-11 16:26:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Yurii Prokulevych 2017-12-06 17:14:18 UTC

Description of problem:
-----------------------
Some RMQ instances fail to join cluster after node is rebooted after minor update.
From failed rmq containers:
notice: operation_finished:   rabbitmq_start_0:489:stderr [ Error: unable to connect to node 'rabbit@controller-2': nodedown ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [  ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [ DIAGNOSTICS ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [ =========== ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [  ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [ attempted to contact: ['rabbit@controller-2'] ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [  ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [ rabbit@controller-2: ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [   * connected to epmd (port 4369) on controller-2 ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [   * epmd reports: node 'rabbit' not running at all ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [                   no other nodes on controller-2 ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [   * suggestion: start the node ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [  ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [ current node details: ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [ - node name: 'rabbitmq-cli-42@controller-2' ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [ - home dir: /var/lib/rabbitmq ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [ - cookie hash: aUh+lqTjdvRSodaHhNKqEg== ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [  ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ]
  notice: operation_finished:   rabbitmq_start_0:489:stderr [ Error: mnesia_not_running ]
    info: log_finished: finished - rsc:rabbitmq action:start call_id:16 pid:489 exit-code:1 exec-time:20128ms queue-time:0ms

...

notice: operation_finished:   rabbitmq_start_0:184:stderr [  ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [ DIAGNOSTICS ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [ =========== ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [  ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [ attempted to contact: ['rabbit@controller-1'] ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [  ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [ rabbit@controller-1: ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [   * connected to epmd (port 4369) on controller-1 ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [   * epmd reports: node 'rabbit' not running at all ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [                   no other nodes on controller-1 ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [   * suggestion: start the node ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [  ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [ current node details: ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [ - node name: 'rabbitmq-cli-36@controller-1' ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [ - home dir: /var/lib/rabbitmq ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [ - cookie hash: aUh+lqTjdvRSodaHhNKqEg== ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [  ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [ Error: {offline_node_no_offline_flag,"You are trying to remove a node from an offline node. That is dangerous, but can be done with the --offline flag. Please consult the manual for rabbitmqctl for more information."} ]
  notice: operation_finished:   rabbitmq_start_0:184:stderr [ Error: {inconsistent_cluster,"Node 'rabbit@controller-0' thinks it's clustered with node 'rabbit@controller-1', but 'rabbit@controller-1' disagrees"} ]
    info: log_finished: finished - rsc:rabbitmq action:start call_id:15 pid:184 exit-code:1 exec-time:22913ms queue-time:0ms


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
Images from 2017-12-01.4

Steps to Reproduce:
-------------------
1. Perform minor update of UC/OC ( from 2017-11-29.2 -> 2017-12-01.4)
2. Start rebooting nodes one by one, if pcs is running on a node, stop it, reboot node, when node is back start pcs.


Actual results:
---------------
RMQ fails to start on some nodes and starts on another

Expected results:
-----------------
RMQ starts on all nodes after reboot


Additional info:
----------------
Virtual setup: 3controllers + 2computes + 3ceph ; OC/UC with SSL + IPv6

Comment 6 John Eckersberg 2018-03-26 21:09:47 UTC

I just tried to reproduce this a few times and failed. What I did was, for each controller in sequence:

pcs cluster stop --request-timeout=300
shutdown -r now

Then just waited for everything to come back online, which it did successfully.

In the absence of any obvious clue, let me braindump here what exactly happens in the resource agent when this procedure is followed, because it's not at all obvious and quite frankly I forget every time until I go back and read the code. So selfishly this will give me something to refer to in the future when these sorts of issues arise, and also help anyone interested in following along.

In this example, we start with everything up and clustered happily.

First, stop pcs on controller-0 (c0) as described above:

# pcs cluster stop --request-timeout=300

As part of the cluster stop action, the rabbitmq resource on c0 is stopped. Immediately after that occurs, c0 is still a member of the rabbitmq cluster but the node is offline.

However, when the stop action completes on c0, a notification action occurs on the other cluster nodes via the pacemaker notify mechanism. This action translates the pcmk node to a rabbitmq nodename by querying the rmq-node-attr-last-known-rabbitmq (this is a permanent attribute that persists even when the cluster is stopped on that node). Once we have the rabbit nodename, we run:

rabbitmqctl forget_cluster_node $other_node_name

And at this point, the rabbitmq cluster only has two members.

Next, c0 is rebooted. Once it comes back up, the pcs cluster starts up automatically and the rabbitmq resource is attempted to start. How, exactly?

First we get the "join list", which is the list of other rabbitmq nodes that are currently up and running.

If no other nodes are running, there is special code to bootstrap the cluster, but we will ignore that case here because the update procedure ensures that at most one controller is rebooted at a time.

When the other nodes *are* running, we get the join list by querying the crm for nodes with the rmq-node-attr-rabbitmq attribute. Note this is *not* the same attribute mentioned above; this attribute is transient and as such only exists for resources that are confirmed as up.

Once we have a valid list of nodes to join, we then...

- Explicitly stop rabbitmq (it shouldn't be running but doesn't hurt to be sure)

- Wipe the rabbitmq data directory, this ensures that mnesia will cluster correctly when joining. This is a literal rm -rf /var/lib/rabbitmq/mnesia.

- For each cluster node in the join list, we invoke *remotely* on that node to forget the local node. This shouldn't strictly be necessary because the notify action does this when rabbit on c0 stopped before. But if for some reason that failed (other node was down?), this tries again.

- Finally, join the existing cluster, which happens by:

- Start rabbit app
- Check if already clustered (can't be because we wiped mnesia, but ok to check I guess...), if clustered, consider everything OK.
- Otherwise, stop rabbit app and iterate the join list and attempt to `rabbitmqctl join_cluster $node` for each in sequence. Once we successfully join, start the app and everything is up.

Comment 7 John Eckersberg 2018-04-30 13:27:35 UTC

*** Bug 1568411 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.