Bug 1338657
Summary: | RabbitMQ resources fail to start on 3 controllers nodes with pacemaker deployment | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> |
Component: | rabbitmq-server | Assignee: | Peter Lemenkov <plemenko> |
Status: | CLOSED ERRATA | QA Contact: | Udi Shkalim <ushkalim> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 9.0 (Mitaka) | CC: | apevec, dbecker, jeckersb, jjoyce, lhh, mburns, morazi, ohochman, plemenko, rhel-osp-director-maint, royoung, rscarazz, sasha, srevivo, tvignaud |
Target Milestone: | ga | Keywords: | Automation, AutomationBlocker |
Target Release: | 9.0 (Mitaka) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | rabbitmq-server-3.6.2-2.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-08-11 12:22:23 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Marius Cornea
2016-05-23 08:24:45 UTC
I managed to get it started after running 'pcs resource debug-start rabbitmq' on each of the controller nodes and then 'pcs resource cleanup rabbitmq' [root@overcloud-controller-0 ~]# pcs resource debug-start rabbitmq Operation start for rabbitmq:0 (ocf:heartbeat:rabbitmq-cluster) returned 0 > stdout: Waiting for 'rabbit@overcloud-controller-0' ... > stdout: pid is 27210 ... > stderr: ERROR: Unexpected return code from '/usr/sbin/rabbitmqctl cluster status' exit code: 69 > stderr: INFO: Bootstrapping rabbitmq cluster > stderr: INFO: Waiting for server to start > stderr: DEBUG: RabbitMQ server is running normally > stderr: INFO: cluster bootstrapped > stderr: INFO: Policy set: ha-all ^(?!amq\.).* {"ha-mode":"all"} > stderr: DEBUG: rabbitmq:0 start : 0 Still, I wasn't able to proceed with the installation, the logs in /var/log/rabbitmq showing this kind of errors: =ERROR REPORT==== 23-May-2016::11:39:23 === Error on AMQP connection <0.1000.0> (10.0.0.15:54125 -> 10.0.0.15:5672, state: starting): AMQPLAIN login refused: user 'guest' can only connect via localhost This is how the rabbitmq.config looks on one of the nodes: [root@overcloud-controller-0 heat-admin]# cat /etc/rabbitmq/rabbitmq.config % This file managed by Puppet % Template Path: rabbitmq/templates/rabbitmq.config [ {rabbit, [ {tcp_listen_options, [binary, {packet, raw}, {reuseaddr, true}, {backlog, 128}, {nodelay, true}, {exit_on_close, false}] }, {cluster_partition_handling, pause_minority}, {tcp_listen_options, [binary, {packet, raw}, {reuseaddr, true}, {backlog, 128}, {nodelay, true}, {exit_on_close, false}, {keepalive, true}]}, {default_user, <<"guest">>}, {default_pass, <<"zdbXdPQmW47Yfw9wewr3pJZjQ">>} ]}, {kernel, [ {inet_dist_listen_max, 35672}, {inet_dist_listen_min, 35672} ]} , {rabbitmq_management, [ {listener, [ {port, 15672} ,{ip, "10.0.0.15"} ]} ]} ]. % EOF Current status - I've got a fix which changes resource-agents' script, but I'm not going to push it. Instead I'm working on a patch which changes rabbitmq-server only. Ok, I've found root cause. RabbitMQ returns more error codes starting from ver. 3.6.x, and unfortunately some values were changed. I think it's safe to say that API or even ABI was changed. Here is a workaround for current resource agent script: https://github.com/lemenkov/resource-agents/commit/5bd3a0b Meanwhile expect a fixed RabbitMQ build (with the changes partially reverted) soon. I've just made a build - please try: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11089199 Ok, here are the news. This is still just a workaround since it changes the API back to the old version by removing a fine-grained error reporting. This package returns just "2" as an error code instead of different ones properly reflecting actual issue. Use this build for now, but I'm going to roll another one coupled with resource-agents build patched to use both new error codes (from this build and up) and the old ones (ver. 3.3.5). There shouldn't be any user visible change from the pacemaker's user point of view. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1597.html |