Bug 1184280
| Summary: | rabbitmq cluster resource-agent for reliable bootstrap/recovery | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Crag Wolfe <cwolfe> | |
| Component: | resource-agents | Assignee: | Fabio Massimo Di Nitto <fdinitto> | |
| Status: | CLOSED ERRATA | QA Contact: | Leonid Natapov <lnatapov> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 7.0 | CC: | agk, cfeist, cluster-maint, djansa, fdinitto, jeckersb, jguiditt, jherrman, mburns, mnovacek, morazi, nbarcet, oalbrigt, oblaut, ohochman, rhos-maint, sasha, tlavigne, ushkalim, yeylon | |
| Target Milestone: | rc | Keywords: | ZStream | |
| Target Release: | 7.2 | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | resource-agents-3.9.5-43.el7 | Doc Type: | Enhancement | |
| Doc Text: |
This update introduces the rabbitmq-cluster resource agent for managing clustered RabbitMQ instances with the Pacemaker cluster manager.
|
Story Points: | --- | |
| Clone Of: | 1168755 | |||
| : | 1185444 1185753 1185754 (view as bug list) | Environment: | ||
| Last Closed: | 2015-11-19 04:41:29 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1185444, 1185907, 1185909 | |||
| Bug Blocks: | 1168755, 1177026, 1185753, 1185754 | |||
|
Comment 1
David Vossel
2015-01-23 18:13:38 UTC
I want the agent's logic reviewed by someone who is more familiar with rabbitmq-server's usage in osp than I am. This agent is a hammer. When things don't work, the agent using a brute-force technique of clearing the /var/log/rabbitmq/mnesia directory and trying again. There are two startup cases we need to be aware of. 1. Bootstrap. No other rabbitmq-server instances are up yet. - the agent wipes /var/log/rabbitmq/mnesia - starts rabbitmq-server and does not join any other nodes. - sets the ha policy 'rabbitmqctl set_policy ....' 2. Joining existing rabbitmq cluster. This happens when the agent detects other rabbitmq-server instances exist elsewhere in the cluster. - First the agent attempts to start rabbitmq-server and join the existing instances in the cluster. - If that fails, we use the hammer and wipe /var/log/rabbitmq/mnesia. Then we attempt the startup/join again. The assumption here is as long as a single rabbitmq-server instance exists data should remain consistent even if we have to re-initialize a single instance to successfully join the cluster. If however all the rabbitmq-server instances are down, and the agent has to bootstrap the rabbitmq cluster, messages will be lost. Are we comfortable with this behavior? -- David +1, that looks good to me. One thing to note for the future, RabbitMQ >= 3.4.0 has a force_boot option which supposedly can be used when all of the nodes go down and you need to force one of them back on. That's a much less "big hammer" recovery scenario. Unfortunately, for now we're stuck on RabbitMQ 3.3.x so it doesn't do us much good today. (In reply to John Eckersberg from comment #3) > +1, that looks good to me. excellent > One thing to note for the future, RabbitMQ >= 3.4.0 has a force_boot option I noticed that as well. In practice, I read people weren't having much luck with force_boot. In the future we should definitely give it a go though. > which supposedly can be used when all of the nodes go down and you need to > force one of them back on. That's a much less "big hammer" recovery > scenario. Unfortunately, for now we're stuck on RabbitMQ 3.3.x so it > doesn't do us much good today. I have verified that the rabbitmq-cluster resource agent included in resource-agents-3.9.5-50.el7.x86_64 is the one mentioned in comment #1. -- [root@virt-150 ~]# wget -q -O rabbitmq-cluster.new \ https://raw.githubusercontent.com/davidvossel/resource-agents/rabbitmq-cluster/heartbeat/rabbitmq-cluster [root@virt-150 ~]# diff rabbitmq-cluster.new /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster [root@virt-150 ~]# echo $? 0 Verified on resource-agents-3.9.5-54.el7.x86_64
RA:
Clone Set: rabbitmq-clone [rabbitmq]
Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
# rabbitmqctl status
Status of node 'rabbit@overcloud-controller-2' ...
[{pid,16580},
{running_applications,[{rabbit,"RabbitMQ","3.3.5"},
{os_mon,"CPO CXC 138 46","2.2.14"},
{mnesia,"MNESIA CXC 138 12","4.11"},
{xmerl,"XML parser","1.3.6"},
{sasl,"SASL CXC 138 11","2.3.4"},
{stdlib,"ERTS CXC 138 10","1.19.4"},
{kernel,"ERTS CXC 138 10","2.16.4"}]},
{os,{unix,linux}},
{erlang_version,"Erlang R16B03-1 (erts-5.10.4) [source] [64-bit] [smp:12:12] [async-threads:30] [hipe] [kernel-poll:true]\n"},
{memory,[{total,97288440},
{connection_procs,4795688},
{queue_procs,7458048},
{plugins,0},
{other_proc,14209672},
{mnesia,1354472},
{mgmt_db,0},
{msg_index,294072},
{other_ets,2382768},
{binary,42621456},
{code,16707845},
{atom,891833},
{other_system,6572586}]},
{alarms,[]},
{listeners,[{clustering,35672,"::"},{amqp,5672,"172.17.0.13"}]},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,13423481651},
{disk_free_limit,50000000},
{disk_free,470248812544},
{file_descriptors,[{total_limit,3996},
{total_used,112},
{sockets_limit,3594},
{sockets_used,110}]},
{processes,[{limit,1048576},{used,1983}]},
{run_queue,0},
{uptime,31522}]
...done.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2190.html |