Given a typical three-node openstack cluster with controller 0,1,2: 1) Disconnect one of the three nodes from the network, say controller 0. 2) Resource agent runs monitoring operation on one of the remaining healthy nodes, say controller 1. 3) The resource agent runs `rabbitmqctl cluster_status`, which hangs for some time while trying to contact node 0. 4) The monitoring operation times out, so pacemaker stops the rabbitmq service on node 1. 5) Rabbitmq on node 2 sees the 'DOWN' message from 1, and realizes it's now in a minority partition, so it begins to pause itself as part of the pause_minority partition handling strategy. 6) Pacemaker starts rabbitmq on node 1. Depending on the exact timings, it seems like this will sometimes work, and other times will result in a rabbitmq cluster where the nodes cannot form a majority partition so neither node 1 or 2 will start. It seems to me that the cluster_status query should fail faster than it is, given the way things are tuned and how long it should take to mark node 0 as unreachable within the cluster. This will require some more investigation on the exact calls being made, what sort of timeouts (if any) they have, and what the implications are of the call(s) landing before or after node 0 is determined to be down.
Some timings of rabbit_mnesia:status(), which is what ultimately backs cluster_status. I've got a three node cluster, with the TCP_USER_TIMEOUT set to 15 seconds: [root@overcloud-controller-0 ~]# grep ERL_ARGS /etc/rabbitmq/rabbitmq-env.conf RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 -kernel inet_default_connect_options [{nodelay,true},{raw,6,18,<<15000:64/native>>}] -kernel inet_default_listen_options [{raw,6,18,<<15000:64/native>>}]" The test: - Turn on debug tracing on node 1 - Run rabbitmqctl cluster_status on node 1 to gather baseline timings - Modify iptables on node 0 to drop all input/output traffic on 25672 - Run rabbitmqctl cluster_status again on node 1 to gather timing once we've effectively lost node 0 - Turn debug off Results: Under normal conditions, all nodes up: (<6747.17828.9>) call rabbit_mnesia:status() (Timestamp: {1504,19004,753903}) ... (<6747.17828.9>) returned from rabbit_mnesia:status/0 -> [{nodes, [{disc, ['rabbit@overcloud-controller-0', 'rabbit@overcloud-controller-1', 'rabbit@overcloud-controller-2']}]}, {running_nodes, ['rabbit@overcloud-controller-0', 'rabbit@overcloud-controller-2', 'rabbit@overcloud-controller-1']}, {cluster_name, <<"rabbit">>}, {partitions,[]}] (Timestamp: {1504, 19004, 757283}) So this took 757283-753903=3380 microseconds, very fast. Now, after adding the firewall rule: (<6747.17848.9>) call rabbit_mnesia:status() (Timestamp: {1504,19007,176029}) ... (<6747.17848.9>) returned from rabbit_mnesia:status/0 -> [{nodes, [{disc, ['rabbit@overcloud-controller-0', 'rabbit@overcloud-controller-1', 'rabbit@overcloud-controller-2']}]}, {running_nodes, ['rabbit@overcloud-controller-0', 'rabbit@overcloud-controller-2', 'rabbit@overcloud-controller-1']}, {cluster_name, <<"rabbit">>}, {partitions,[]}] (Timestamp: {1504, 19033, 507165}) So this one took 19033-19007=26 seconds However if you look at the timing of the whole command, there's some extra time involved: [root@overcloud-controller-1 ~]# time rabbitmqctl cluster_status Cluster status of node 'rabbit@overcloud-controller-1' ... [{nodes,[{disc,['rabbit@overcloud-controller-0', 'rabbit@overcloud-controller-1', 'rabbit@overcloud-controller-2']}]}, {running_nodes,['rabbit@overcloud-controller-0', 'rabbit@overcloud-controller-2', 'rabbit@overcloud-controller-1']}, {cluster_name,<<"rabbit">>}, {partitions,[]}, {alarms,[{'rabbit@overcloud-controller-0',[nodedown]}, {'rabbit@overcloud-controller-2',[]}, {'rabbit@overcloud-controller-1',[]}]}] real 0m31.973s user 0m0.407s sys 0m0.166s About an extra 6 second discrepency (~32s vs 26s) Some amount of this is just overhead starting/stopping the erlang VM, and setting up the distribution so the rabbitmqctl node can query the rabbit app itself. In fact the "good" cluster_status still takes about 1.8s of walltime. The rest of the time is likely spent inside of rabbitmqctl collecting the output of the "alarms" status. It does so by querying each node directly in order to get the data from `rabbitmqctl status`, and it will waste a few seconds trying to contact the node we firewalled off since it's not recorded as down yet. One other thing to note, there is a suspicious 8 second lapse in the rabbitmq log between the time when controller 0 goes down, and when queues start failing over: =INFO REPORT==== 29-Aug-2017::15:03:53 === rabbit on node 'rabbit@overcloud-controller-0' down =INFO REPORT==== 29-Aug-2017::15:03:53 === node 'rabbit@overcloud-controller-0' down: etimedout =INFO REPORT==== 29-Aug-2017::15:04:01 === Mirrored queue 'neutron-vo-Port-1.1.overcloud-controller-2.localdomain' in vhost '/': Slave <rabbit.1365.0> saw deaths of mirrors <rabbit.1257.0> I'm not sure what to make of that. I'm highly suspicious that short-circuiting node failure via TCP_USER_TIMEOUT acts weird compared to the typical behavior when tuning net_ticktime, but will require more tests.
I ran the test again, except this time I: - Commented out RABBITMQ_ERL_SERVER_ARGS entirely from rabbitmq-env.conf, effectively removing any TCP tunings - Adjusted net_ticktime in rabbitmq.conf to 15 seconds (dowm from 60s default) And with that, cluster_status now takes about 23 seconds. Also note that net_ticktime has a +- 25% variance built-in, so the expected time can be anywhere from 11.25 seconds to 18.75 seconds. So 23 seconds is certainly better detection than 32 seconds, and it exercises more typical codepaths in the VM.
Upstream review to switch to net_ticktime - https://review.openstack.org/#/c/503788/
Switching this to OSP 13. The upstream fix introduces a new option and is not suitable for backport to Pike.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086