Description of problem: ----------------------- Some RMQ instances fail to join cluster after node is rebooted after minor update. From failed rmq containers: notice: operation_finished: rabbitmq_start_0:489:stderr [ Error: unable to connect to node 'rabbit@controller-2': nodedown ] notice: operation_finished: rabbitmq_start_0:489:stderr [ ] notice: operation_finished: rabbitmq_start_0:489:stderr [ DIAGNOSTICS ] notice: operation_finished: rabbitmq_start_0:489:stderr [ =========== ] notice: operation_finished: rabbitmq_start_0:489:stderr [ ] notice: operation_finished: rabbitmq_start_0:489:stderr [ attempted to contact: ['rabbit@controller-2'] ] notice: operation_finished: rabbitmq_start_0:489:stderr [ ] notice: operation_finished: rabbitmq_start_0:489:stderr [ rabbit@controller-2: ] notice: operation_finished: rabbitmq_start_0:489:stderr [ * connected to epmd (port 4369) on controller-2 ] notice: operation_finished: rabbitmq_start_0:489:stderr [ * epmd reports: node 'rabbit' not running at all ] notice: operation_finished: rabbitmq_start_0:489:stderr [ no other nodes on controller-2 ] notice: operation_finished: rabbitmq_start_0:489:stderr [ * suggestion: start the node ] notice: operation_finished: rabbitmq_start_0:489:stderr [ ] notice: operation_finished: rabbitmq_start_0:489:stderr [ current node details: ] notice: operation_finished: rabbitmq_start_0:489:stderr [ - node name: 'rabbitmq-cli-42@controller-2' ] notice: operation_finished: rabbitmq_start_0:489:stderr [ - home dir: /var/lib/rabbitmq ] notice: operation_finished: rabbitmq_start_0:489:stderr [ - cookie hash: aUh+lqTjdvRSodaHhNKqEg== ] notice: operation_finished: rabbitmq_start_0:489:stderr [ ] notice: operation_finished: rabbitmq_start_0:489:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ] notice: operation_finished: rabbitmq_start_0:489:stderr [ Error: mnesia_not_running ] info: log_finished: finished - rsc:rabbitmq action:start call_id:16 pid:489 exit-code:1 exec-time:20128ms queue-time:0ms ... notice: operation_finished: rabbitmq_start_0:184:stderr [ ] notice: operation_finished: rabbitmq_start_0:184:stderr [ DIAGNOSTICS ] notice: operation_finished: rabbitmq_start_0:184:stderr [ =========== ] notice: operation_finished: rabbitmq_start_0:184:stderr [ ] notice: operation_finished: rabbitmq_start_0:184:stderr [ attempted to contact: ['rabbit@controller-1'] ] notice: operation_finished: rabbitmq_start_0:184:stderr [ ] notice: operation_finished: rabbitmq_start_0:184:stderr [ rabbit@controller-1: ] notice: operation_finished: rabbitmq_start_0:184:stderr [ * connected to epmd (port 4369) on controller-1 ] notice: operation_finished: rabbitmq_start_0:184:stderr [ * epmd reports: node 'rabbit' not running at all ] notice: operation_finished: rabbitmq_start_0:184:stderr [ no other nodes on controller-1 ] notice: operation_finished: rabbitmq_start_0:184:stderr [ * suggestion: start the node ] notice: operation_finished: rabbitmq_start_0:184:stderr [ ] notice: operation_finished: rabbitmq_start_0:184:stderr [ current node details: ] notice: operation_finished: rabbitmq_start_0:184:stderr [ - node name: 'rabbitmq-cli-36@controller-1' ] notice: operation_finished: rabbitmq_start_0:184:stderr [ - home dir: /var/lib/rabbitmq ] notice: operation_finished: rabbitmq_start_0:184:stderr [ - cookie hash: aUh+lqTjdvRSodaHhNKqEg== ] notice: operation_finished: rabbitmq_start_0:184:stderr [ ] notice: operation_finished: rabbitmq_start_0:184:stderr [ Error: {offline_node_no_offline_flag,"You are trying to remove a node from an offline node. That is dangerous, but can be done with the --offline flag. Please consult the manual for rabbitmqctl for more information."} ] notice: operation_finished: rabbitmq_start_0:184:stderr [ Error: {inconsistent_cluster,"Node 'rabbit@controller-0' thinks it's clustered with node 'rabbit@controller-1', but 'rabbit@controller-1' disagrees"} ] info: log_finished: finished - rsc:rabbitmq action:start call_id:15 pid:184 exit-code:1 exec-time:22913ms queue-time:0ms Version-Release number of selected component (if applicable): ------------------------------------------------------------- Images from 2017-12-01.4 Steps to Reproduce: ------------------- 1. Perform minor update of UC/OC ( from 2017-11-29.2 -> 2017-12-01.4) 2. Start rebooting nodes one by one, if pcs is running on a node, stop it, reboot node, when node is back start pcs. Actual results: --------------- RMQ fails to start on some nodes and starts on another Expected results: ----------------- RMQ starts on all nodes after reboot Additional info: ---------------- Virtual setup: 3controllers + 2computes + 3ceph ; OC/UC with SSL + IPv6
I just tried to reproduce this a few times and failed. What I did was, for each controller in sequence: pcs cluster stop --request-timeout=300 shutdown -r now Then just waited for everything to come back online, which it did successfully. In the absence of any obvious clue, let me braindump here what exactly happens in the resource agent when this procedure is followed, because it's not at all obvious and quite frankly I forget every time until I go back and read the code. So selfishly this will give me something to refer to in the future when these sorts of issues arise, and also help anyone interested in following along. In this example, we start with everything up and clustered happily. First, stop pcs on controller-0 (c0) as described above: # pcs cluster stop --request-timeout=300 As part of the cluster stop action, the rabbitmq resource on c0 is stopped. Immediately after that occurs, c0 is still a member of the rabbitmq cluster but the node is offline. However, when the stop action completes on c0, a notification action occurs on the other cluster nodes via the pacemaker notify mechanism. This action translates the pcmk node to a rabbitmq nodename by querying the rmq-node-attr-last-known-rabbitmq (this is a permanent attribute that persists even when the cluster is stopped on that node). Once we have the rabbit nodename, we run: rabbitmqctl forget_cluster_node $other_node_name And at this point, the rabbitmq cluster only has two members. Next, c0 is rebooted. Once it comes back up, the pcs cluster starts up automatically and the rabbitmq resource is attempted to start. How, exactly? First we get the "join list", which is the list of other rabbitmq nodes that are currently up and running. If no other nodes are running, there is special code to bootstrap the cluster, but we will ignore that case here because the update procedure ensures that at most one controller is rebooted at a time. When the other nodes *are* running, we get the join list by querying the crm for nodes with the rmq-node-attr-rabbitmq attribute. Note this is *not* the same attribute mentioned above; this attribute is transient and as such only exists for resources that are confirmed as up. Once we have a valid list of nodes to join, we then... - Explicitly stop rabbitmq (it shouldn't be running but doesn't hurt to be sure) - Wipe the rabbitmq data directory, this ensures that mnesia will cluster correctly when joining. This is a literal rm -rf /var/lib/rabbitmq/mnesia. - For each cluster node in the join list, we invoke *remotely* on that node to forget the local node. This shouldn't strictly be necessary because the notify action does this when rabbit on c0 stopped before. But if for some reason that failed (other node was down?), this tries again. - Finally, join the existing cluster, which happens by: - Start rabbit app - Check if already clustered (can't be because we wiped mnesia, but ok to check I guess...), if clustered, consider everything OK. - Otherwise, stop rabbit app and iterate the join list and attempt to `rabbitmqctl join_cluster $node` for each in sequence. Once we successfully join, start the app and everything is up.
*** Bug 1568411 has been marked as a duplicate of this bug. ***