1184280 – rabbitmq cluster resource-agent for reliable bootstrap/recovery

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1184280 - rabbitmq cluster resource-agent for reliable bootstrap/recovery

Summary: rabbitmq cluster resource-agent for reliable bootstrap/recovery

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	7.2
Assignee:	Fabio Massimo Di Nitto
QA Contact:	Leonid Natapov
Docs Contact:
URL:
Whiteboard:
Depends On:	1185444 1185907 1185909
Blocks:	1168755 1177026 1185753 1185754
TreeView+	depends on / blocked

Reported:	2015-01-20 22:52 UTC by Crag Wolfe
Modified:	2023-02-22 23:02 UTC (History)
CC List:	20 users (show)
Fixed In Version:	resource-agents-3.9.5-43.el7
Doc Type:	Enhancement
Doc Text:	This update introduces the rabbitmq-cluster resource agent for managing clustered RabbitMQ instances with the Pacemaker cluster manager.
Clone Of:	1168755
Clones:	1185444 1185753 1185754 (view as bug list)
Environment:
Last Closed:	2015-11-19 04:41:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:2190	0	normal	SHIPPED_LIVE	resource-agents bug fix and enhancement update	2015-11-19 08:06:48 UTC

Comment 1 David Vossel 2015-01-23 18:13:38 UTC

I created a rabbitmq resource agent that handles reliably bootstrapping the rabbitmq cluster. It also reliably recovers rabbitmq instances within the cluster.

https://github.com/davidvossel/resource-agents/blob/rabbitmq-cluster/heartbeat/rabbitmq-cluster

Below is a scenario file that outlines how to use the agent.

https://github.com/davidvossel/phd/blob/master/scenarios/rabbitmq-cluster.scenario

The agent must be used as a clone resource with the ordered=true attribute set. Also, the agent requires NOT setting cluster_nodes in the /etc/rabbitmq/rabbitmq.config file. The agent knows how to join and bootstrap the cluster based on what nodes the agent is actually running on. Explicitly setting this list will actually cause problems.

There is currently an selinux permissions problem that needs to be resolved before this agent can be used in enforcing mode. I am creating an issue to track the selinux policy update.

-- David

Comment 2 David Vossel 2015-01-23 19:05:57 UTC

I want the agent's logic reviewed by someone who is more familiar with rabbitmq-server's usage in osp than I am.  This agent is a hammer. When things don't work, the agent using a brute-force technique of clearing the /var/log/rabbitmq/mnesia directory and trying again.

There are two startup cases we need to be aware of.

1. Bootstrap. No other rabbitmq-server instances are up yet.

- the agent wipes /var/log/rabbitmq/mnesia
- starts rabbitmq-server and does not join any other nodes.
- sets the ha policy 'rabbitmqctl set_policy ....'

2. Joining existing rabbitmq cluster. This happens when the agent detects other rabbitmq-server instances exist elsewhere in the cluster.

- First the agent attempts to start rabbitmq-server and join the existing instances in the cluster.
- If that fails, we use the hammer and wipe /var/log/rabbitmq/mnesia. Then we attempt the startup/join again.



The assumption here is as long as a single rabbitmq-server instance exists data should remain consistent even if we have to re-initialize a single instance to successfully join the cluster.

If however all the rabbitmq-server instances are down, and the agent has to bootstrap the rabbitmq cluster, messages will be lost.

Are we comfortable with this behavior?

-- David

Comment 3 John Eckersberg 2015-01-23 19:35:54 UTC

+1, that looks good to me.

One thing to note for the future, RabbitMQ >= 3.4.0 has a force_boot option which supposedly can be used when all of the nodes go down and you need to force one of them back on.  That's a much less "big hammer" recovery scenario.  Unfortunately, for now we're stuck on RabbitMQ 3.3.x so it doesn't do us much good today.

Comment 4 David Vossel 2015-01-23 19:46:55 UTC

(In reply to John Eckersberg from comment #3)
> +1, that looks good to me.

excellent

> One thing to note for the future, RabbitMQ >= 3.4.0 has a force_boot option

I noticed that as well. In practice, I read people weren't having much
luck with force_boot. In the future we should definitely give it a go
though.

> which supposedly can be used when all of the nodes go down and you need to
> force one of them back on.  That's a much less "big hammer" recovery
> scenario.  Unfortunately, for now we're stuck on RabbitMQ 3.3.x so it
> doesn't do us much good today.

Comment 10 michal novacek 2015-08-12 12:53:08 UTC

I have verified that the rabbitmq-cluster resource agent included in
resource-agents-3.9.5-50.el7.x86_64 is the one mentioned in comment #1.

--

[root@virt-150 ~]# wget -q -O rabbitmq-cluster.new \
https://raw.githubusercontent.com/davidvossel/resource-agents/rabbitmq-cluster/heartbeat/rabbitmq-cluster

[root@virt-150 ~]# diff rabbitmq-cluster.new /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster

[root@virt-150 ~]# echo $?
0

Comment 13 Udi Shkalim 2015-10-08 15:24:43 UTC

Verified on resource-agents-3.9.5-54.el7.x86_64

RA:
Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]


# rabbitmqctl status
Status of node 'rabbit@overcloud-controller-2' ...
[{pid,16580},
 {running_applications,[{rabbit,"RabbitMQ","3.3.5"},
                        {os_mon,"CPO  CXC 138 46","2.2.14"},
                        {mnesia,"MNESIA  CXC 138 12","4.11"},
                        {xmerl,"XML parser","1.3.6"},
                        {sasl,"SASL  CXC 138 11","2.3.4"},
                        {stdlib,"ERTS  CXC 138 10","1.19.4"},
                        {kernel,"ERTS  CXC 138 10","2.16.4"}]},
 {os,{unix,linux}},
 {erlang_version,"Erlang R16B03-1 (erts-5.10.4) [source] [64-bit] [smp:12:12] [async-threads:30] [hipe] [kernel-poll:true]\n"},
 {memory,[{total,97288440},
          {connection_procs,4795688},
          {queue_procs,7458048},
          {plugins,0},
          {other_proc,14209672},
          {mnesia,1354472},
          {mgmt_db,0},
          {msg_index,294072},
          {other_ets,2382768},
          {binary,42621456},
          {code,16707845},
          {atom,891833},
          {other_system,6572586}]},
 {alarms,[]},
 {listeners,[{clustering,35672,"::"},{amqp,5672,"172.17.0.13"}]},
 {vm_memory_high_watermark,0.4},
 {vm_memory_limit,13423481651},
 {disk_free_limit,50000000},
 {disk_free,470248812544},
 {file_descriptors,[{total_limit,3996},
                    {total_used,112},
                    {sockets_limit,3594},
                    {sockets_used,110}]},
 {processes,[{limit,1048576},{used,1983}]},
 {run_queue,0},
 {uptime,31522}]
...done.

Comment 15 errata-xmlrpc 2015-11-19 04:41:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2190.html

Note You need to log in before you can comment on or make changes to this bug.

agk
cfeist
cluster-maint
djansa
fdinitto
jeckersb
jguiditt
jherrman
mburns
mnovacek
morazi
nbarcet
oalbrigt
oblaut
ohochman
rhos-maint
sasha
tlavigne
ushkalim
yeylon