Bug 1122314

Summary: RabbitMQ clustering fails depending on which node has the VIP
Product: Red Hat OpenStack Reporter: John Eckersberg <jeckersb>
Component: openstack-foreman-installerAssignee: John Eckersberg <jeckersb>
Status: CLOSED ERRATA QA Contact: Leonid Natapov <lnatapov>
Severity: high Docs Contact:
Priority: unspecified    
Version: Foreman (RHEL 6)CC: jguiditt, lnatapov, mburns, morazi, rhos-maint, yeylon
Target Milestone: ga   
Target Release: Installer   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-foreman-installer-2.0.17-1.el6ost Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-08-21 18:06:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description John Eckersberg 2014-07-22 22:35:30 UTC
I will defer to the comment I just stuck in the code to explain:

      # This is very subtle but important.  The node that is first in
      # lb_backend_server_names needs to come up first.  The names
      # array and the addrs array are ordered the same, e.g. names[i]
      # is the same host as addrs[i] for all i.  So the IP we pull off
      # the front of addrs will be on the first host in names.  This
      # matters because the names array is what generates the
      # cluster_nodes value in the rabbitmq config.  When a node
      # starts the first time and it is configured to cluster, it
      # tries to join each node in cluster_nodes in succession.
      # Whichever node is first to start will try to join a cluster
      # with the others, time out against each, and then start a new
      # cluster with only itself as a member.  Each additional host to
      # start will then try each host in order until it get to a node
      # which has already been started, and join the cluster.
      #
      # However, there is a problem if the first node to start is not
      # the first node in the list.  Suppose the third node in the
      # list starts first, and then the first two nodes in the list
      # start up in parallel.  The first node will attempt to cluster
      # with the second node (it realizes that the first node is
      # itself and skips it).  The second node tries to cluster with
      # the first node.  Because neither host has an initialized
      # cluster, the clustering operation will fail on both nodes.
      #
      # By forcing the first node in the config to come up first, the
      # others can be started in parallel and be guaranteed to join
      # the cluster via the first node and its running cluster.

Presently RabbitMQ starts first on whatever node has the VIP.  If that node is not the first in the cluster_nodes list, the above problem exhibits.  Change the logic to start the service on the first node before the others.

Comment 2 John Eckersberg 2014-07-23 01:10:58 UTC
https://github.com/redhat-openstack/astapor/pull/326

Comment 7 Leonid Natapov 2014-08-18 10:53:09 UTC
openstack-foreman-installer-2.0.20-1.el6ost

Verified according to this:


    Review the puppet logs for the controllers. One host will not run the exec for i-am-first-rabbitmq-node-OR-rabbitmq-is-up-on-first-node. The other two will. Note which one does not.

    Look at the cluster_nodes list in /etc/rabbitmq/rabbitmq.config, and note the first node in the list. This should match the node above.

Comment 8 John Eckersberg 2014-08-19 20:26:04 UTC
*** Bug 1120288 has been marked as a duplicate of this bug. ***

Comment 9 errata-xmlrpc 2014-08-21 18:06:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1090.html