Bug 1184280

Summary:	rabbitmq cluster resource-agent for reliable bootstrap/recovery
Product:	Red Hat Enterprise Linux 7	Reporter:	Crag Wolfe <cwolfe>
Component:	resource-agents	Assignee:	Fabio Massimo Di Nitto <fdinitto>
Status:	CLOSED ERRATA	QA Contact:	Leonid Natapov <lnatapov>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	7.0	CC:	agk, cfeist, cluster-maint, djansa, fdinitto, jeckersb, jguiditt, jherrman, mburns, mnovacek, morazi, nbarcet, oalbrigt, oblaut, ohochman, rhos-maint, sasha, tlavigne, ushkalim, yeylon
Target Milestone:	rc	Keywords:	ZStream
Target Release:	7.2
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	resource-agents-3.9.5-43.el7	Doc Type:	Enhancement
Doc Text:	This update introduces the rabbitmq-cluster resource agent for managing clustered RabbitMQ instances with the Pacemaker cluster manager.	Story Points:	---
Clone Of:	1168755
Clones:	1185444 1185753 1185754 (view as bug list)		Environment:
Last Closed:	2015-11-19 04:41:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1185444, 1185907, 1185909
Bug Blocks:	1168755, 1177026, 1185753, 1185754

Comment 1 David Vossel 2015-01-23 18:13:38 UTC

I created a rabbitmq resource agent that handles reliably bootstrapping the rabbitmq cluster. It also reliably recovers rabbitmq instances within the cluster.

https://github.com/davidvossel/resource-agents/blob/rabbitmq-cluster/heartbeat/rabbitmq-cluster

Below is a scenario file that outlines how to use the agent.

https://github.com/davidvossel/phd/blob/master/scenarios/rabbitmq-cluster.scenario

The agent must be used as a clone resource with the ordered=true attribute set. Also, the agent requires NOT setting cluster_nodes in the /etc/rabbitmq/rabbitmq.config file. The agent knows how to join and bootstrap the cluster based on what nodes the agent is actually running on. Explicitly setting this list will actually cause problems.

There is currently an selinux permissions problem that needs to be resolved before this agent can be used in enforcing mode. I am creating an issue to track the selinux policy update.

-- David

Comment 2 David Vossel 2015-01-23 19:05:57 UTC

I want the agent's logic reviewed by someone who is more familiar with rabbitmq-server's usage in osp than I am.  This agent is a hammer. When things don't work, the agent using a brute-force technique of clearing the /var/log/rabbitmq/mnesia directory and trying again.

There are two startup cases we need to be aware of.

1. Bootstrap. No other rabbitmq-server instances are up yet.

- the agent wipes /var/log/rabbitmq/mnesia
- starts rabbitmq-server and does not join any other nodes.
- sets the ha policy 'rabbitmqctl set_policy ....'

2. Joining existing rabbitmq cluster. This happens when the agent detects other rabbitmq-server instances exist elsewhere in the cluster.

- First the agent attempts to start rabbitmq-server and join the existing instances in the cluster.
- If that fails, we use the hammer and wipe /var/log/rabbitmq/mnesia. Then we attempt the startup/join again.



The assumption here is as long as a single rabbitmq-server instance exists data should remain consistent even if we have to re-initialize a single instance to successfully join the cluster.

If however all the rabbitmq-server instances are down, and the agent has to bootstrap the rabbitmq cluster, messages will be lost.

Are we comfortable with this behavior?

-- David

Comment 3 John Eckersberg 2015-01-23 19:35:54 UTC

+1, that looks good to me.

One thing to note for the future, RabbitMQ >= 3.4.0 has a force_boot option which supposedly can be used when all of the nodes go down and you need to force one of them back on.  That's a much less "big hammer" recovery scenario.  Unfortunately, for now we're stuck on RabbitMQ 3.3.x so it doesn't do us much good today.

Comment 4 David Vossel 2015-01-23 19:46:55 UTC

(In reply to John Eckersberg from comment #3)
> +1, that looks good to me.

excellent

> One thing to note for the future, RabbitMQ >= 3.4.0 has a force_boot option

I noticed that as well. In practice, I read people weren't having much
luck with force_boot. In the future we should definitely give it a go
though.

> which supposedly can be used when all of the nodes go down and you need to
> force one of them back on.  That's a much less "big hammer" recovery
> scenario.  Unfortunately, for now we're stuck on RabbitMQ 3.3.x so it
> doesn't do us much good today.

Comment 10 michal novacek 2015-08-12 12:53:08 UTC

I have verified that the rabbitmq-cluster resource agent included in
resource-agents-3.9.5-50.el7.x86_64 is the one mentioned in comment #1.

--

[root@virt-150 ~]# wget -q -O rabbitmq-cluster.new \
https://raw.githubusercontent.com/davidvossel/resource-agents/rabbitmq-cluster/heartbeat/rabbitmq-cluster

[root@virt-150 ~]# diff rabbitmq-cluster.new /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster

[root@virt-150 ~]# echo $?
0

Comment 13 Udi Shkalim 2015-10-08 15:24:43 UTC

Verified on resource-agents-3.9.5-54.el7.x86_64

RA:
Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]


# rabbitmqctl status
Status of node 'rabbit@overcloud-controller-2' ...
[{pid,16580},
 {running_applications,[{rabbit,"RabbitMQ","3.3.5"},
                        {os_mon,"CPO  CXC 138 46","2.2.14"},
                        {mnesia,"MNESIA  CXC 138 12","4.11"},
                        {xmerl,"XML parser","1.3.6"},
                        {sasl,"SASL  CXC 138 11","2.3.4"},
                        {stdlib,"ERTS  CXC 138 10","1.19.4"},
                        {kernel,"ERTS  CXC 138 10","2.16.4"}]},
 {os,{unix,linux}},
 {erlang_version,"Erlang R16B03-1 (erts-5.10.4) [source] [64-bit] [smp:12:12] [async-threads:30] [hipe] [kernel-poll:true]\n"},
 {memory,[{total,97288440},
          {connection_procs,4795688},
          {queue_procs,7458048},
          {plugins,0},
          {other_proc,14209672},
          {mnesia,1354472},
          {mgmt_db,0},
          {msg_index,294072},
          {other_ets,2382768},
          {binary,42621456},
          {code,16707845},
          {atom,891833},
          {other_system,6572586}]},
 {alarms,[]},
 {listeners,[{clustering,35672,"::"},{amqp,5672,"172.17.0.13"}]},
 {vm_memory_high_watermark,0.4},
 {vm_memory_limit,13423481651},
 {disk_free_limit,50000000},
 {disk_free,470248812544},
 {file_descriptors,[{total_limit,3996},
                    {total_used,112},
                    {sockets_limit,3594},
                    {sockets_used,110}]},
 {processes,[{limit,1048576},{used,1983}]},
 {run_queue,0},
 {uptime,31522}]
...done.

Comment 15 errata-xmlrpc 2015-11-19 04:41:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2190.html