Bug 1189480
| Summary: | Rabbitmq cluster remains partitioned after short network partition incident | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Bart van den Heuvel <bvandenh> |
| Component: | openstack-foreman-installer | Assignee: | John Eckersberg <jeckersb> |
| Status: | CLOSED ERRATA | QA Contact: | Leonid Natapov <lnatapov> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.0 (Juno) | CC: | aparsons, apevec, dmaley, felipe.alfaro, jeckersb, jguiditt, lhh, martin, mburns, morazi, nbarcet, oblaut, pcaruana, pneedle, racedoro, rhos-maint, sasha, sclewis, yeylon |
| Target Milestone: | z4 | Keywords: | TestOnly, ZStream |
| Target Release: | Installer | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-foreman-installer-3.0.17-1.el7ost | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-08-24 15:18:08 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1151756, 1189241 | ||
| Bug Blocks: | 1186672 | ||
|
Description
Bart van den Heuvel
2015-02-05 13:19:12 UTC
I'm going to move this to OFI and fix it by forcing the TCP timeout down to 5 seconds. If the other node drops off the net (or you firewall it away like above), then the intercluster connection will close with a timeout error on both sides, and avoid the asymmetrical disconnect noted in the linked forum thread. Would it be possible to show how to implement the change on the command-line so we can test the effectiveness? I believe the implementation of the solution is not 'fixed' in puppet configuration? (In reply to Bart van den Heuvel from comment #7) > Would it be possible to show how to implement the change on the command-line > so we can test the effectiveness? I believe the implementation of the > solution is not 'fixed' in puppet configuration? Add this line to /etc/rabbitmq/rabbitmq-env.conf: RABBITMQ_SERVER_ERL_ARGS="+K true +A30 +P 1048576 -kernel inet_default_connect_options [{nodelay,true},{raw,6,18,<<5000:64/native>>}] -kernel inet_default_listen_options [{raw,6,18,<<5000:64/native>>}]" Tested the proposed solution. It does not work as expected. See the results below. (I will update the bugzilla)
Three node rabbitmq cluster:
cat > /etc/rabbitmq/rabbitmq.config << EOF
% config to configure clustering with defaults, except:
% network partition response (ignore), management console on, management agent on
[
{rabbit, [
{cluster_nodes, {['rabbit@rabbit1', 'rabbit@rabbit2', 'rabbit@rabbit3'], disc}},
{cluster_partition_handling, pause_minority},
{default_user, <<"guest">>},
{default_pass, <<"guest">>}
]},
{rabbitmq_management, [{listener, [{port, 15672}]}]},
{rabbitmq_management_agent, [ {force_fine_statistics, true} ] },
{kernel, [ ]}
].
EOF
scp /etc/rabbitmq/rabbitmq.config rabbit2:/etc/rabbitmq/
scp /etc/rabbitmq/rabbitmq.config rabbit3:/etc/rabbitmq/
Did this on each of the cluster nodes (rabbit1, rabbit2, rabbit3)
echo 'RABBITMQ_SERVER_ERL_ARGS="+K true +A30 +P 1048576 -kernel inet_default_connect_options [{nodelay,true},{raw,6,18,<<5000:64/native>>}] -kernel inet_default_listen_options [{raw,6,18,<<5000:64/native>>}]"' >>/etc/rabbitmq/rabbitmq-env.conf
systemctl stop rabbitmq-server; rm -rf /var/lib/rabbitmq/mnesia/*; systemctl start rabbitmq-server
ssh rabbit2 "systemctl stop rabbitmq-server; rm -rf /var/lib/rabbitmq/mnesia/*; systemctl start rabbitmq-server"
ssh rabbit3 "systemctl stop rabbitmq-server; rm -rf /var/lib/rabbitmq/mnesia/*; systemctl start rabbitmq-server"
rabbitmqctl add_user admin pocroot
rabbitmqctl set_user_tags admin administrator
rabbitmqctl set_policy HA '^(?!amq\.).*' '{"ha-mode": "all"}'
rabbitmqctl environment | grep cluster
rabbitmqctl cluster_status
[root@rabbit1 ~]# rabbitmqctl environment | grep pause_minority
{cluster_partition_handling,pause_minority},
[root@rabbit1 ~]# ssh rabbit2 rabbitmqctl environment | grep pause_minority
{cluster_partition_handling,pause_minority},
[root@rabbit1 ~]# ssh rabbit3 rabbitmqctl environment | grep pause_minority
{cluster_partition_handling,pause_minority},
[root@rabbit1 ~]# rabbitmqctl cluster_status | grep partitions
{partitions,[]}]
[root@rabbit3 rabbitmq]# date ; iptables -A INPUT -s rabbit1 -j DROP; iptables -A OUTPUT -d rabbit1 -j DROP ; iptables -A INPUT -s rabbit2 -j DROP; iptables -A OUTPUT -d rabbit2 -j DROP
Fri Feb 20 18:04:59 CET 2015
[root@rabbit3 rabbitmq]# sleep 60; systemctl restart firewalld
results:
[root@rabbit3 rabbitmq]# date; rabbitmqctl cluster_status
Fri Feb 20 18:08:16 CET 2015
Cluster status of node rabbit@rabbit3 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2,rabbit@rabbit3]}]},
{running_nodes,[rabbit@rabbit3]},
{cluster_name,<<"rabbit@rabbit1">>},
{partitions,[{rabbit@rabbit3,[rabbit@rabbit2]}]}]
...done.
rabbit1 log
=INFO REPORT==== 20-Feb-2015::18:05:10 ===
rabbit on node rabbit@rabbit3 down
=INFO REPORT==== 20-Feb-2015::18:05:10 ===
node rabbit@rabbit3 down: etimedout
Rabbit2 log
=INFO REPORT==== 20-Feb-2015::18:05:08 ===
rabbit on node rabbit@rabbit3 down
=INFO REPORT==== 20-Feb-2015::18:05:08 ===
node rabbit@rabbit3 down: etimedout
=ERROR REPORT==== 20-Feb-2015::18:06:01 ===
Mnesia(rabbit@rabbit2): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@rabbit3}
Rabbit3 log
=INFO REPORT==== 20-Feb-2015::18:05:08 ===
rabbit on node rabbit@rabbit3 down
=INFO REPORT==== 20-Feb-2015::18:05:08 ===
node rabbit@rabbit3 down: etimedout
=ERROR REPORT==== 20-Feb-2015::18:06:01 ===
Mnesia(rabbit@rabbit2): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@rabbit3}
I can reproduce what you're seeing. The good news is that the tcp timeout for the two "good" nodes is working. They detect and flag it down (with etimedout as expected) within about 10 seconds of it being firewalled off. However I would expect the "bad" node to notice the other two are gone after about 10 second as well, but clearly that's not happening. I suspect there's some weird bug when iptables/netfilter gets involved, probably triggering the same behavior as bug 1189241. Going to try applying that fix and testing again. Stay tuned. Installing my test kernel with the patch for bug 1189241 seems to fix this. With the new kernel: Success (using iptables)
# Make sure we run the intended kernel
uname -a
ssh rabbit2 'uname -a'
ssh rabbit3 'uname -a'
Linux rabbit1.zokahn.thinkpad 3.10.0-229.el7.x86_64 #1 SMP Fri Feb 6 15:36:18 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
Linux rabbit2.zokahn.thinkpad 3.10.0-229.el7.x86_64 #1 SMP Fri Feb 6 15:36:18 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
Linux rabbit3.zokahn.thinkpad 3.10.0-229.el7.x86_64 #1 SMP Fri Feb 6 15:36:18 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
# Setup the cluster, reset the state of everything
systemctl stop rabbitmq-server; rm -rf /var/lib/rabbitmq/mnesia/*; systemctl start rabbitmq-server
ssh rabbit2 "systemctl stop rabbitmq-server; rm -rf /var/lib/rabbitmq/mnesia/*; systemctl start rabbitmq-server"
ssh rabbit3 "systemctl stop rabbitmq-server; rm -rf /var/lib/rabbitmq/mnesia/*; systemctl start rabbitmq-server"
rabbitmqctl add_user admin pocroot
rabbitmqctl set_user_tags admin administrator
rabbitmqctl set_policy HA '^(?!amq\.).*' '{"ha-mode": "all"}'
rabbitmqctl environment | grep cluster
rabbitmqctl cluster_status
# check environment is in sync with partition recovery
rabbitmqctl environment | grep pause_minority
ssh rabbit2 rabbitmqctl environment | grep pause_minority
ssh rabbit3 rabbitmqctl environment | grep pause_minority
[root@rabbit1 ~]# rabbitmqctl environment | grep pause_minority
{cluster_partition_handling,pause_minority},
[root@rabbit1 ~]# ssh rabbit2 rabbitmqctl environment | grep pause_minority
{cluster_partition_handling,pause_minority},
[root@rabbit1 ~]# ssh rabbit3 rabbitmqctl environment | grep pause_minority
{cluster_partition_handling,pause_minority},
# make sure no partitions to start
rabbitmqctl cluster_status | grep partitions
[root@rabbit1 ~]# rabbitmqctl cluster_status | grep partitions
{partitions,[]}]
# First test, isolate rabbit3 using iptables
date ; iptables -A INPUT -s rabbit1 -j DROP; iptables -A OUTPUT -d rabbit1 -j DROP ; iptables -A INPUT -s rabbit2 -j DROP; iptables -A OUTPUT -d rabbit2 -j DROP
sleep 60; systemctl restart firewalld
# second test, isolating rabbit3 by disabling the nic using libvirt
rabbit1 log
---------------------------
=INFO REPORT==== 24-Feb-2015::13:59:36 ===
rabbit on node rabbit@rabbit3 down
=INFO REPORT==== 24-Feb-2015::13:59:36 ===
node rabbit@rabbit3 down: etimedout
=INFO REPORT==== 24-Feb-2015::14:00:28 ===
rabbit on node rabbit@rabbit3 up
rabbit2 log
---------------------------
=INFO REPORT==== 24-Feb-2015::13:59:42 ===
rabbit on node rabbit@rabbit3 down
=INFO REPORT==== 24-Feb-2015::13:59:42 ===
node rabbit@rabbit3 down: etimedout
=INFO REPORT==== 24-Feb-2015::14:00:28 ===
rabbit on node rabbit@rabbit3 up
rabbit3 log
---------------------------
=INFO REPORT==== 24-Feb-2015::13:59:52 ===
rabbit on node rabbit@rabbit2 down
=INFO REPORT==== 24-Feb-2015::13:59:59 ===
node rabbit@rabbit2 down: etimedout
=WARNING REPORT==== 24-Feb-2015::13:59:59 ===
Cluster minority status detected - awaiting recovery
=INFO REPORT==== 24-Feb-2015::13:59:59 ===
rabbit on node rabbit@rabbit1 down
=INFO REPORT==== 24-Feb-2015::13:59:59 ===
Stopping RabbitMQ
=INFO REPORT==== 24-Feb-2015::13:59:59 ===
node rabbit@rabbit1 down: etimedout
=WARNING REPORT==== 24-Feb-2015::13:59:59 ===
Cluster minority status detected - awaiting recovery
=INFO REPORT==== 24-Feb-2015::14:00:06 ===
Statistics database started.
=INFO REPORT==== 24-Feb-2015::14:00:06 ===
stopped TCP Listener on 192.168.122.83:5672
=ERROR REPORT==== 24-Feb-2015::14:00:28 ===
Mnesia(rabbit@rabbit3): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, rabbit@rabbit1}
=ERROR REPORT==== 24-Feb-2015::14:00:28 ===
Mnesia(rabbit@rabbit3): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, rabbit@rabbit2}
=INFO REPORT==== 24-Feb-2015::14:00:28 ===
Starting RabbitMQ 3.3.5 on Erlang R16B03
Copyright (C) 2007-2014 GoPivotal, Inc.
Licensed under the MPL. See http://www.rabbitmq.com/
=INFO REPORT==== 24-Feb-2015::14:00:28 ===
node : rabbit@rabbit3
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.config
cookie hash : RflLlXitNm70/ikHN/7Tsw==
log : /var/log/rabbitmq/rabbit
sasl log : /var/log/rabbitmq/rabbit
database dir : /var/lib/rabbitmq/mnesia/rabbit@rabbit3
=INFO REPORT==== 24-Feb-2015::14:00:28 ===
Limiting to approx 924 file handles (829 sockets)
=INFO REPORT==== 24-Feb-2015::14:00:28 ===
Memory limit set to 397MB of 993MB total.
=INFO REPORT==== 24-Feb-2015::14:00:28 ===
Disk free limit set to 50MB
=INFO REPORT==== 24-Feb-2015::14:00:28 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index
=INFO REPORT==== 24-Feb-2015::14:00:28 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index
=INFO REPORT==== 24-Feb-2015::14:00:28 ===
started TCP Listener on 192.168.122.83:5672
=INFO REPORT==== 24-Feb-2015::14:00:28 ===
rabbit on node rabbit@rabbit1 up
=INFO REPORT==== 24-Feb-2015::14:00:28 ===
Management plugin started. Port: 15672
=INFO REPORT==== 24-Feb-2015::14:00:28 ===
rabbit on node rabbit@rabbit2 up
=WARNING REPORT==== 24-Feb-2015::14:00:28 ===
The on_load function for module sd_notify returned {error,
{upgrade,
"Upgrade not supported by this NIF library."}}
=INFO REPORT==== 24-Feb-2015::14:00:28 ===
Server startup complete; 6 plugins started.
* rabbitmq_management
* rabbitmq_web_dispatch
* webmachine
* mochiweb
* rabbitmq_management_agent
* amqp_client
Doing the tests i noticed that there is a number of seconds between the etimedout and isolation action. Adding additional tests revealed the following when trying to hit that window with a reconnect:
date ; iptables -A INPUT -s rabbit1 -j DROP; iptables -A OUTPUT -d rabbit1 -j DROP ; iptables -A INPUT -s rabbit2 -j DROP; iptables -A OUTPUT -d rabbit2 -j DROP
sleep 30; systemctl restart firewalld
Rabbit1
=INFO REPORT==== 24-Feb-2015::14:08:55 ===
rabbit on node rabbit@rabbit3 down
=INFO REPORT==== 24-Feb-2015::14:08:55 ===
node rabbit@rabbit3 down: etimedout
Rabbit2
=INFO REPORT==== 24-Feb-2015::14:08:58 ===
rabbit on node rabbit@rabbit3 down
=INFO REPORT==== 24-Feb-2015::14:08:58 ===
node rabbit@rabbit3 down: etimedout
=ERROR REPORT==== 24-Feb-2015::14:09:15 ===
Mnesia(rabbit@rabbit2): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@rabbit3}
Rabbit3
=INFO REPORT==== 24-Feb-2015::14:09:07 ===
rabbit on node rabbit@rabbit2 down
=ERROR REPORT==== 24-Feb-2015::14:09:15 ===
Mnesia(rabbit@rabbit3): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@rabbit2}
=INFO REPORT==== 24-Feb-2015::14:09:16 ===
node rabbit@rabbit2 down: etimedout
=INFO REPORT==== 24-Feb-2015::14:09:16 ===
rabbit on node rabbit@rabbit1 down
=INFO REPORT==== 24-Feb-2015::14:09:16 ===
node rabbit@rabbit1 down: connection_closed
Result:
[root@rabbit1 ~]# rabbitmqctl cluster_status | grep partitions
{partitions,[{rabbit@rabbit2,[rabbit@rabbit3]}]}]
Merged One important thing here, which we've kinda overlooked... OSP is not configuring cluster_partition_handling at all, which means it's using the default value of ignore. That means if the cluster gets partitioned for any reason, it will stay partitioned until an administrator explicitly takes action to correct the partition. I think this is crummy, and we should default to setting cluster_partition_handling to pause_minority. I'll throw together a pull request to do just that. Merged John,this BZ depends on BZ #1189241 which is not fixed yet. Can I verify this bug now or should I wait until the BZ #1189241 will be fixed. Thanks, Leonid. Verified:
Environment:
openstack-foreman-installer-3.0.26-1.el7ost.noarch
The cluster is being restored after the outage - details are below:
This is from the rabbitmq log on the node where the iptables blocking rules were added:
=WARNING REPORT==== 18-Aug-2015::11:23:42 ===
Cluster minority status detected - awaiting recovery
=INFO REPORT==== 18-Aug-2015::11:23:56 ===
Mirrored queue 'cinder-volume' in vhost '/': Slave <rabbit.1108.0> saw deaths of mirrors <rabbit.991.0> <rabbit.1259.0>
=INFO REPORT==== 18-Aug-2015::11:23:56 ===
Mirrored queue 'cinder-volume' in vhost '/': Promoting slave <rabbit.1108.0> to master
=INFO REPORT==== 18-Aug-2015::11:23:56 ===
Mirrored queue 'engine_fanout_d175b2c76c7c4d6892c05249b3392344' in vhost '/': Slave <rabbit.1847.0> saw deaths of mirrors <rabbit.2003.0> <rabbit.1750.0>
=INFO REPORT==== 18-Aug-2015::11:23:56 ===
Mirrored queue 'engine_fanout_d175b2c76c7c4d6892c05249b3392344' in vhost '/': Promoting slave <rabbit.1847.0> to master
This is from the node where the blocking rules were added:
After adding the blocking rules:
rabbitmqctl cluster_status
Cluster status of node 'rabbit@lb-backend-maca25400702876' ...
[{nodes,[{disc,['rabbit@lb-backend-maca25400702875',
'rabbit@lb-backend-maca25400702876',
'rabbit@lb-backend-maca25400702877']}]}]
...done.
[root@mac
Right after restarting the firewall:
rabbitmqctl cluster_status
Cluster status of node 'rabbit@lb-backend-maca25400702876' ...
Error: unable to connect to node 'rabbit@lb-backend-maca25400702876': nodedown
DIAGNOSTICS
===========
attempted to contact: ['rabbit@lb-backend-maca25400702876']
rabbit@lb-backend-maca25400702876:
* connected to epmd (port 4369) on lb-backend-maca25400702876
* epmd reports: node 'rabbit' not running at all
other nodes on lb-backend-maca25400702876: [rabbitmqctl395]
* suggestion: start the node
current node details:
- node name: rabbitmqctl395@maca25400702876
- home dir: /var/lib/rabbitmq
- cookie hash: soeIWU2jk2YNseTyDSlsEA==
After restarting the firewall (restored):
rabbitmqctl cluster_status
Cluster status of node 'rabbit@lb-backend-maca25400702876' ...
[{nodes,[{disc,['rabbit@lb-backend-maca25400702875',
'rabbit@lb-backend-maca25400702876',
'rabbit@lb-backend-maca25400702877']}]},
{running_nodes,['rabbit@lb-backend-maca25400702875',
'rabbit@lb-backend-maca25400702877',
'rabbit@lb-backend-maca25400702876']},
{cluster_name,<<"rabbit.com">>},
{partitions,[]}]
...done.
This is on remote node:
After the blocking rules were added:
rabbitmqctl cluster_status
Cluster status of node 'rabbit@lb-backend-maca25400702875' ...
[{nodes,[{disc,['rabbit@lb-backend-maca25400702875',
'rabbit@lb-backend-maca25400702876',
'rabbit@lb-backend-maca25400702877']}]},
{running_nodes,['rabbit@lb-backend-maca25400702877',
'rabbit@lb-backend-maca25400702875']},
{cluster_name,<<"rabbit.com">>},
{partitions,[]}]
...done.
After the blocking rules were moved:
rabbitmqctl cluster_status
Cluster status of node 'rabbit@lb-backend-maca25400702875' ...
[{nodes,[{disc,['rabbit@lb-backend-maca25400702875',
'rabbit@lb-backend-maca25400702876',
'rabbit@lb-backend-maca25400702877']}]},
{running_nodes,['rabbit@lb-backend-maca25400702876',
'rabbit@lb-backend-maca25400702877',
'rabbit@lb-backend-maca25400702875']},
{cluster_name,<<"rabbit.com">>},
{partitions,[]}]
...done.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1662.html May I ask why not use 'autoheal'? Apparently, Mirantis is using 'autoheal' instead of 'pause_minority' as per https://review.openstack.org/#/c/115518/. The problem we are facing with a 3-controller OpenStack cluster deployed with Red Hat Director is that the RabbitMQ cluster does not survive when two nodes go down. It only survives losing a single node. That is, by default, we have N+1 instead of N+2, which is not optimal, IMHO. (In reply to Felipe Alfaro Solana from comment #33) > May I ask why not use 'autoheal'? Apparently, Mirantis is using 'autoheal' > instead of 'pause_minority' as per https://review.openstack.org/#/c/115518/. > The problem we are facing with a 3-controller OpenStack cluster deployed > with Red Hat Director is that the RabbitMQ cluster does not survive when two > nodes go down. It only survives losing a single node. That is, by default, > we have N+1 instead of N+2, which is not optimal, IMHO. Hello Felipe, It's a CAP theorem question. Both give you partition tolerance. With pause_minority, you get consistency while sacrificing availability. The minority node(s) will pause and disconnect all clients. The clients will reconnect to other nodes in the majority-half of the cluster and resume normal operation. With autoheal, you get availability while sacrificing consistency. The cluster becomes "split-brained". The success of each RPC request is contingent upon all participating connections involved in the request being on the same partition as one another, which is not very likely. So until the partition ends, the system will be in a degraded state and most things are going to fail. Basically Red Hat engineering preferred pause_minority and chose it as the default configuration because the architecture of openstack RPC means a partitioned-but-inconsistent cluster is almost useless. Regards, Pablo. |