Bug 1485989 - resource agent monitor action times out on cluster_status when peer node is down
Summary: resource agent monitor action times out on cluster_status when peer node is down
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: beta
: 13.0 (Queens)
Assignee: John Eckersberg
QA Contact: Udi Shkalim
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-28 15:56 UTC by John Eckersberg
Modified: 2018-06-27 13:34 UTC (History)
10 users (show)

Fixed In Version: openstack-tripleo-heat-templates-8.0.0-0.20180304031147.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-27 13:34:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 503788 0 None MERGED RabbitMQ should use net_ticktime 2020-12-06 04:44:52 UTC
Red Hat Product Errata RHEA-2018:2086 0 None None None 2018-06-27 13:34:53 UTC

Description John Eckersberg 2017-08-28 15:56:14 UTC
Given a typical three-node openstack cluster with controller 0,1,2:

1) Disconnect one of the three nodes from the network, say controller 0.

2) Resource agent runs monitoring operation on one of the remaining healthy nodes, say controller 1.

3) The resource agent runs `rabbitmqctl cluster_status`, which hangs for some time while trying to contact node 0.

4) The monitoring operation times out, so pacemaker stops the rabbitmq service on node 1.

5) Rabbitmq on node 2 sees the 'DOWN' message from 1, and realizes it's now in a minority partition, so it begins to pause itself as part of the pause_minority partition handling strategy.

6) Pacemaker starts rabbitmq on node 1.

Depending on the exact timings, it seems like this will sometimes work, and other times will result in a rabbitmq cluster where the nodes cannot form a majority partition so neither node 1 or 2 will start.

It seems to me that the cluster_status query should fail faster than it is, given the way things are tuned and how long it should take to mark node 0 as unreachable within the cluster.  This will require some more investigation on the exact calls being made, what sort of timeouts (if any) they have, and what the implications are of the call(s) landing before or after node 0 is determined to be down.

Comment 1 John Eckersberg 2017-08-29 15:38:29 UTC
Some timings of rabbit_mnesia:status(), which is what ultimately backs cluster_status.

I've got a three node cluster, with the TCP_USER_TIMEOUT set to 15 seconds:

[root@overcloud-controller-0 ~]# grep ERL_ARGS /etc/rabbitmq/rabbitmq-env.conf 
RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 -kernel inet_default_connect_options [{nodelay,true},{raw,6,18,<<15000:64/native>>}] -kernel inet_default_listen_options [{raw,6,18,<<15000:64/native>>}]"

The test:

- Turn on debug tracing on node 1

- Run rabbitmqctl cluster_status on node 1 to gather baseline timings

- Modify iptables on node 0 to drop all input/output traffic on 25672

- Run rabbitmqctl cluster_status again on node 1 to gather timing once we've effectively lost node 0

- Turn debug off

Results:

Under normal conditions, all nodes up:

(<6747.17828.9>) call rabbit_mnesia:status() (Timestamp: {1504,19004,753903})
...
(<6747.17828.9>) returned from rabbit_mnesia:status/0 -> [{nodes,
                                                           [{disc,
                                                             ['rabbit@overcloud-controller-0',
                                                              'rabbit@overcloud-controller-1',
                                                              'rabbit@overcloud-controller-2']}]},
                                                          {running_nodes,
                                                           ['rabbit@overcloud-controller-0',
                                                            'rabbit@overcloud-controller-2',
                                                            'rabbit@overcloud-controller-1']},
                                                          {cluster_name,
                                                           <<"rabbit">>},
                                                          {partitions,[]}] (Timestamp: {1504,
                                                                                        19004,
                                                                                        757283})

So this took 757283-753903=3380 microseconds, very fast.


Now, after adding the firewall rule:

(<6747.17848.9>) call rabbit_mnesia:status() (Timestamp: {1504,19007,176029})
...
(<6747.17848.9>) returned from rabbit_mnesia:status/0 -> [{nodes,
                                                           [{disc,
                                                             ['rabbit@overcloud-controller-0',
                                                              'rabbit@overcloud-controller-1',
                                                              'rabbit@overcloud-controller-2']}]},
                                                          {running_nodes,
                                                           ['rabbit@overcloud-controller-0',
                                                            'rabbit@overcloud-controller-2',
                                                            'rabbit@overcloud-controller-1']},
                                                          {cluster_name,
                                                           <<"rabbit">>},
                                                          {partitions,[]}] (Timestamp: {1504,
                                                                                        19033,
                                                                                        507165})

So this one took 19033-19007=26 seconds

However if you look at the timing of the whole command, there's some extra time involved:

[root@overcloud-controller-1 ~]# time rabbitmqctl cluster_status       
Cluster status of node 'rabbit@overcloud-controller-1' ...           
[{nodes,[{disc,['rabbit@overcloud-controller-0',                       
                'rabbit@overcloud-controller-1',                       
                'rabbit@overcloud-controller-2']}]},                   
 {running_nodes,['rabbit@overcloud-controller-0',                      
                 'rabbit@overcloud-controller-2',                      
                 'rabbit@overcloud-controller-1']},                    
 {cluster_name,<<"rabbit">>},       
 {partitions,[]},                                                      
 {alarms,[{'rabbit@overcloud-controller-0',[nodedown]},                
          {'rabbit@overcloud-controller-2',[]},                        
          {'rabbit@overcloud-controller-1',[]}]}]                      

real    0m31.973s                                                      
user    0m0.407s                                                       
sys     0m0.166s                                                       

About an extra 6 second discrepency (~32s vs 26s)

Some amount of this is just overhead starting/stopping the erlang VM, and setting up the distribution so the rabbitmqctl node can query the rabbit app itself.  In fact the "good" cluster_status still takes about 1.8s of walltime.

The rest of the time is likely spent inside of rabbitmqctl collecting the output of the "alarms" status.  It does so by querying each node directly in order to get the data from `rabbitmqctl status`, and it will waste a few seconds trying to contact the node we firewalled off since it's not recorded as down yet.

One other thing to note, there is a suspicious 8 second lapse in the rabbitmq log between the time when controller 0 goes down, and when queues start failing over:

=INFO REPORT==== 29-Aug-2017::15:03:53 ===
rabbit on node 'rabbit@overcloud-controller-0' down

=INFO REPORT==== 29-Aug-2017::15:03:53 ===
node 'rabbit@overcloud-controller-0' down: etimedout

=INFO REPORT==== 29-Aug-2017::15:04:01 ===
Mirrored queue 'neutron-vo-Port-1.1.overcloud-controller-2.localdomain' in vhost '/': Slave <rabbit.1365.0> saw deaths of mirrors <rabbit.1257.0>

I'm not sure what to make of that.  I'm highly suspicious that short-circuiting node failure via TCP_USER_TIMEOUT acts weird compared to the typical behavior when tuning net_ticktime, but will require more tests.

Comment 2 John Eckersberg 2017-08-29 18:41:07 UTC
I ran the test again, except this time I:

- Commented out RABBITMQ_ERL_SERVER_ARGS entirely from rabbitmq-env.conf, effectively removing any TCP tunings

- Adjusted net_ticktime in rabbitmq.conf to 15 seconds (dowm from 60s default)

And with that, cluster_status now takes about 23 seconds.  Also note that net_ticktime has a +- 25% variance built-in, so the expected time can be anywhere from 11.25 seconds to 18.75 seconds.  So 23 seconds is certainly better detection than 32 seconds, and it exercises more typical codepaths in the VM.

Comment 3 John Eckersberg 2017-10-30 14:48:09 UTC
Upstream review to switch to net_ticktime - https://review.openstack.org/#/c/503788/

Comment 4 John Eckersberg 2017-11-14 20:31:59 UTC
Switching this to OSP 13.  The upstream fix introduces a new option and is not suitable for backport to Pike.

Comment 9 errata-xmlrpc 2018-06-27 13:34:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086


Note You need to log in before you can comment on or make changes to this bug.