Bug 1326712 - Upgrading 7.3->8.0 fails and can't be recovered after network disconnections, RabbitMQ cluster broke
Summary: Upgrading 7.3->8.0 fails and can't be recovered after network disconnections,...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: async
: ---
Assignee: Angus Thomas
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-13 10:58 UTC by Udi Kalifon
Modified: 2016-04-14 16:03 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-04-14 16:03:09 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Udi Kalifon 2016-04-13 10:58:15 UTC
Description of problem:
I tried to upgrade a bare metal HA deployemnt: 3 controllers, 2 computes and 1 ceph. The test was to see if upgrade can be recovered when the network suffers momentary failures. 

To simulate network outages, I connected to the console of one of the controllers and one of the computes with IPMI, and ran an infinite loop which shut the networking service down every 2 minutes for a brief second:

while true; do echo "network going down ...."; systemctl stop network; echo "restoring network...."; systemctl start network; echo "2 minute delay ..."; sleep 120; done

I also simulated a few network hickups on the undercloud node in a similar way, but manually and more rarely (not with a 2-minute loop like I did with the overcloud).

On the 2nd deployment step in the upgrade procedure (major-upgrade-pacemaker.yaml) I suffered a failure which took down the cluster and could not be recovered. If you run journalctl -fn 100 on the controllers you see:

Apr 13 08:41:23 overcloud-controller-1.localdomain cinder-scheduler[21403]: 2016-04-13 08:41:23.962 21403 ERROR oslo.messaging._drivers.impl_rabbit [req-69c4ea23-c7dc-448c-91a6-fc4786d253b3 - - - - -] AMQP server on 10.35.191.14:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
Apr 13 08:41:24 overcloud-controller-1.localdomain su[2800]: pam_unix(su:session): session closed for user rabbitmq
Apr 13 08:41:24 overcloud-controller-1.localdomain cinder-scheduler[21403]: 2016-04-13 08:41:24.982 21403 ERROR oslo.messaging._drivers.impl_rabbit [req-69c4ea23-c7dc-448c-91a6-fc4786d253b3 - - - - -] AMQP server on 10.35.191.12:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
Apr 13 08:41:26 overcloud-controller-1.localdomain cinder-scheduler[21403]: 2016-04-13 08:41:26.001 21403 ERROR oslo.messaging._drivers.impl_rabbit [req-69c4ea23-c7dc-448c-91a6-fc4786d253b3 - - - - -] AMQP server on 10.35.191.13:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds.

You also see similar errors to ^^ for neutron. It seems that AMQP is not reliable in a cluster when there are network disconnections, according to: http://stackoverflow.com/questions/8654053/rabbitmq-cluster-is-not-reconnecting-after-network-failure

I tried to follow the procedure to stop rabbit and re-join the cluster, according to the above instructions and also according to https://bugzilla.redhat.com/show_bug.cgi?id=1299923#c15, but the command "rabbitmqctl stop_app" hangs and that's where I quit.


Steps to Reproduce:
1. Upgrade from 7.3 to 8.0 while causing brief network outages as described in the bug

Comment 2 John Eckersberg 2016-04-14 15:54:22 UTC
What you're trying to do is questionable.  RabbitMQ is likely to partition and/or crash with an unreliable underlying network.  That's just a fact of the architecture and implementation.  The clustering documentation even explicitly says:

"clustering is not recommended over a WAN or when network links between nodes are unreliable"

So the result of this test scenario doesn't surprise me at all.

Don't try to run rabbitmqctl commands to start/stop/modify the cluster if it's under pacemaker control unless you *really* know what you are doing.  I would recommend to just kill -9 any problematic server and let pacemaker do the recovery.  It should be smart enough to handle the edge cases and get everything working again.  Also be sure to check rabbitmqctl cluster_status on each controller to see what view each node has of the cluster (it's probably partitioned given your test circumstances).

But to get at least something productive out of it, it could be worthwhile to look at the rabbitmq logs at the same time if you still have them.

Comment 3 Hugh Brock 2016-04-14 16:03:09 UTC
Closing this NOTABUG.

Udi, a retest with a network partition that stays down, and then an attempted recovery as Eck describes above, could be really useful.


Note You need to log in before you can comment on or make changes to this bug.