Bug 1326712

Summary: Upgrading 7.3->8.0 fails and can't be recovered after network disconnections, RabbitMQ cluster broke
Product: Red Hat OpenStack Reporter: Udi Kalifon <ukalifon>
Component: rhosp-directorAssignee: Angus Thomas <athomas>
Status: CLOSED NOTABUG QA Contact: Arik Chernetsky <achernet>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 8.0 (Liberty)CC: dbecker, hbrock, jeckersb, mburns, mcornea, morazi, rhel-osp-director-maint
Target Milestone: async   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-14 16:03:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Udi Kalifon 2016-04-13 10:58:15 UTC
Description of problem:
I tried to upgrade a bare metal HA deployemnt: 3 controllers, 2 computes and 1 ceph. The test was to see if upgrade can be recovered when the network suffers momentary failures. 

To simulate network outages, I connected to the console of one of the controllers and one of the computes with IPMI, and ran an infinite loop which shut the networking service down every 2 minutes for a brief second:

while true; do echo "network going down ...."; systemctl stop network; echo "restoring network...."; systemctl start network; echo "2 minute delay ..."; sleep 120; done

I also simulated a few network hickups on the undercloud node in a similar way, but manually and more rarely (not with a 2-minute loop like I did with the overcloud).

On the 2nd deployment step in the upgrade procedure (major-upgrade-pacemaker.yaml) I suffered a failure which took down the cluster and could not be recovered. If you run journalctl -fn 100 on the controllers you see:

Apr 13 08:41:23 overcloud-controller-1.localdomain cinder-scheduler[21403]: 2016-04-13 08:41:23.962 21403 ERROR oslo.messaging._drivers.impl_rabbit [req-69c4ea23-c7dc-448c-91a6-fc4786d253b3 - - - - -] AMQP server on 10.35.191.14:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
Apr 13 08:41:24 overcloud-controller-1.localdomain su[2800]: pam_unix(su:session): session closed for user rabbitmq
Apr 13 08:41:24 overcloud-controller-1.localdomain cinder-scheduler[21403]: 2016-04-13 08:41:24.982 21403 ERROR oslo.messaging._drivers.impl_rabbit [req-69c4ea23-c7dc-448c-91a6-fc4786d253b3 - - - - -] AMQP server on 10.35.191.12:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
Apr 13 08:41:26 overcloud-controller-1.localdomain cinder-scheduler[21403]: 2016-04-13 08:41:26.001 21403 ERROR oslo.messaging._drivers.impl_rabbit [req-69c4ea23-c7dc-448c-91a6-fc4786d253b3 - - - - -] AMQP server on 10.35.191.13:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds.

You also see similar errors to ^^ for neutron. It seems that AMQP is not reliable in a cluster when there are network disconnections, according to: http://stackoverflow.com/questions/8654053/rabbitmq-cluster-is-not-reconnecting-after-network-failure

I tried to follow the procedure to stop rabbit and re-join the cluster, according to the above instructions and also according to https://bugzilla.redhat.com/show_bug.cgi?id=1299923#c15, but the command "rabbitmqctl stop_app" hangs and that's where I quit.


Steps to Reproduce:
1. Upgrade from 7.3 to 8.0 while causing brief network outages as described in the bug

Comment 2 John Eckersberg 2016-04-14 15:54:22 UTC
What you're trying to do is questionable.  RabbitMQ is likely to partition and/or crash with an unreliable underlying network.  That's just a fact of the architecture and implementation.  The clustering documentation even explicitly says:

"clustering is not recommended over a WAN or when network links between nodes are unreliable"

So the result of this test scenario doesn't surprise me at all.

Don't try to run rabbitmqctl commands to start/stop/modify the cluster if it's under pacemaker control unless you *really* know what you are doing.  I would recommend to just kill -9 any problematic server and let pacemaker do the recovery.  It should be smart enough to handle the edge cases and get everything working again.  Also be sure to check rabbitmqctl cluster_status on each controller to see what view each node has of the cluster (it's probably partitioned given your test circumstances).

But to get at least something productive out of it, it could be worthwhile to look at the rabbitmq logs at the same time if you still have them.

Comment 3 Hugh Brock 2016-04-14 16:03:09 UTC
Closing this NOTABUG.

Udi, a retest with a network partition that stays down, and then an attempted recovery as Eck describes above, could be really useful.