Description of problem: RabbitMQ resources fail to start in HA IPv6 deployment so the overcloud deployment fails. Version-Release number of selected component (if applicable): rabbitmq-server-3.6.2-3.el7ost.noarch openstack-tripleo-heat-templates-2.0.0-11.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy overcloud with IPv6 isolated networks Actual results: Deployment fails: Clone Set: rabbitmq-clone [rabbitmq] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Failed Actions: * rabbitmq_start_0 on overcloud-controller-1 'unknown error' (1): call=86, status=complete, exitreason='none', last-rc-change='Fri Jun 17 16:29:14 2016', queued=0ms, exec=1274ms * rabbitmq_start_0 on overcloud-controller-2 'unknown error' (1): call=92, status=complete, exitreason='none', last-rc-change='Fri Jun 17 16:29:17 2016', queued=0ms, exec=1393ms * rabbitmq_start_0 on overcloud-controller-0 'unknown error' (1): call=91, status=complete, exitreason='none', last-rc-change='Fri Jun 17 16:29:20 2016', queued=1ms, exec=2410ms Expected results: RabbitMQ resources get started Additional info: The server appears to be listening on port 5672 ss -ln | grep 5672 tcp LISTEN 0 128 fd00:fd00:fd00:2000::13:5672 :::* tcp LISTEN 0 128 :::35672 :::* There are some errors showing up in rabbit\@overcloud-controller-0.log, attaching the log. In keystone.log I can see many amqp connection errors. Attaching that log as well.
Created attachment 1169177 [details] rabbit
Created attachment 1169178 [details] keystone.log
Created attachment 1169179 [details] MnesiaCore.rabbit@overcloud-controller-0_1466_181123_372835
This is probably due to... [root@overcloud-controller-0 ~]# rabbitmqctl cluster_status Cluster status of node 'rabbit@overcloud-controller-0' ... Error: operation cluster_status used with invalid parameter: [] The resource agent relies heavily on the cluster_status command, so if it's broken I'm not surprised that nothing works in HA.
Actually scratch that. The reason cluster_status doesn't work is because the mnesia app isn't running, so it can't contact the database to retrieve the cluster status. And the reason mnesia isn't running can be seen in the rabbit log: ** Reason for termination == ** {{unparseable,"/usr/bin/df: '/var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-0': No such file or directory\n"},835") ... ** FATAL ** {error,{"Cannot rename disk_log file",latest_log, "/var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-0/PREVIOUS.LOG", {log_header,trans_log,"4.3","4.13.4", 'rabbit@overcloud-controller-0', {1466,181123,357657}}, {file_error,"/var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-0/PREVIOUS.LOG", enoent}}} And confirmed that there is no /var/lib/rabbitmq/mnesia directory on the filesystem. So something (probably the resource agent) is erroneously wiping that directory while the service is running.
I'm guessing this has something to do with: http://pkgs.devel.redhat.com/cgit/rpms/rabbitmq-server/commit/?h=rhos-9.0-rhel-7&id=76db4bc4dd7312949c8c0415305c53822e15cd4c Maybe the exit codes are still different from what the resource agent expects?
First thing - network on that cluster looks unreliable. I see constant ssh freezes.
(In reply to John Eckersberg from comment #8) > I'm guessing this has something to do with: > > http://pkgs.devel.redhat.com/cgit/rpms/rabbitmq-server/commit/?h=rhos-9.0- > rhel-7&id=76db4bc4dd7312949c8c0415305c53822e15cd4c > > Maybe the exit codes are still different from what the resource agent > expects? Nope, I don't think so.
I looked at this a bit more on a different system today. It looks like one or more of the ipv6 patches were lost and/or broken during the rebase to erlang 18. At the very least the epmd client is only trying to connect via ipv4, which fails so pretty much nothing works with clustering.
*** Bug 1347432 has been marked as a duplicate of this bug. ***
Ok, I've finally found what's going on there. 0. You need to change rabbitmq config if using in IPv6-only network. 1. We need to fix Erlang/OTP to properly work in IPv6 networks. Patch is available, so expect a new build soon: https://github.com/fedora-erlang/erlang/commit/6dc7add 2. We need to fix RabbitMQ to respect "proto_dist" as well. Also I believe we should prevent further "IPv6-only environment" failures and cherry pick this patch to allow skilling IPv4 localhost entirely: * https://github.com/erlang/otp/pull/1075 My ETA for all that is July, 18 or 19.
Almost forgot another one thing: 4. We need to use more IP-protocol version agnostic EPMD service dependency than just epmd.0.0.socket. Something like epmd@::.socket ?
Status update. Regarding task no.2 (see list of tasks above) - patch available: https://github.com/lemenkov/rabbitmq-common/commit/ee36f22 I didn't consider anything about task no.3 yet.
The issues here are listed as 0,1,2,4 what is issue 3? Please confirm for us to review -IPv6 team member
(In reply to rprakash from comment #19) > The issues here are listed as 0,1,2,4 what is issue 3? > Please confirm for us to review -IPv6 team member Forgot to number it - this is issue no.3 cherry pick this patch to allow skipping IPv4 localhost entirely: * https://github.com/erlang/otp/pull/1075 Status report. I've managed it to wok on a various IPv6 configurations. You need to do the following things: 0. You need to create /etc/rabbitmq/rabbitmq-env.conf file with the following contents: SERVER_ERL_ARGS="-proto_dist inet6_tcp +P 1048576" CTL_ERL_ARGS=${SERVER_ERL_ARGS} 1. Upgrade Erlang up to erlang-18.3.4.1-1.el7ost 2. Upgrade RabbitMQ up to rabbitmq-server-3.6.3-4.el7ost This will allow you to run RabbitMQ under pacemaker. Please test and report any issues. I'm working on fixing systemd service file.
(In reply to Peter Lemenkov from comment #20) > 0. You need to create /etc/rabbitmq/rabbitmq-env.conf file with the > following contents: > > SERVER_ERL_ARGS="-proto_dist inet6_tcp +P 1048576" > CTL_ERL_ARGS=${SERVER_ERL_ARGS} Should we clone this BZ to implement this configuration change via OSPd?
(In reply to Marius Cornea from comment #21) > (In reply to Peter Lemenkov from comment #20) > > 0. You need to create /etc/rabbitmq/rabbitmq-env.conf file with the > > following contents: > > > > SERVER_ERL_ARGS="-proto_dist inet6_tcp +P 1048576" > > CTL_ERL_ARGS=${SERVER_ERL_ARGS} > > Should we clone this BZ to implement this configuration change via OSPd? Yes, we should do that. Something like if CLUSTERING_OVER_IPV4 then SERVER_ERL_ARGS="-proto_dist inet_tcp +P 1048576" else // CLUSTERING_OVER_IPV6 SERVER_ERL_ARGS="-proto_dist inet6_tcp +P 1048576" fi should be added to configurator application (Director?).
Btw I'm curious is it possible that even IPv4 localhost (127.x.x.x) won't be available for RabbitMQ? I mean should we consider case of a fully IPv6 environment as well?
I don't think we can do that at this point since there is at least one nic configured with an IPv4 address which is for the OSPd provisioning network.
Is this issue fixed? When will the fix be picked up into official Apex build (or Mitaka build)? Thank you.
Created attachment 1183743 [details] rabbit\@overcloud-controller-0.log My deployment is still failing, from what I can see in the log in the same way as it was reported initially: [root@overcloud-controller-0 heat-admin]# rpm -qa | grep rabbitmq rabbitmq-server-3.6.3-4.el7ost.noarch Clone Set: rabbitmq-clone [rabbitmq] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Failed Actions: * rabbitmq_start_0 on overcloud-controller-2 'unknown error' (1): call=86, status=complete, exitreason='none', last-rc-change='Mon Jul 25 11:08:04 2016', queued=0ms, exec=2713ms * rabbitmq_start_0 on overcloud-controller-0 'unknown error' (1): call=92, status=complete, exitreason='none', last-rc-change='Mon Jul 25 11:08:09 2016', queued=0ms, exec=2552ms * rabbitmq_start_0 on overcloud-controller-1 'unknown error' (1): call=91, status=complete, exitreason='none', last-rc-change='Mon Jul 25 11:08:14 2016', queued=0ms, exec=2887ms Attaching /var/log/rabbitmq/rabbit\@overcloud-controller-0.log.
It seems that this issue has nothing with neither IPv6 not RabbitMQ. I was able to fix cluster w/o changing configuration, only using rabbitmqctl itself. The only change I did is upgrade to 3.6.3-5 build which fixed a nasty timeout during cluster node removal. See bug 1356169 for further details. This (iupgrade to 3.6.3-5) might be considered as fix for that issue as well, since it prevents pcs from marking perfectly fine healthy RabbitMQ nodes as failed.
Can QA verify that upgrading to rabbitmq-server-3.6.3-5.el7ost fixes this issue? Thanks Bin
(In reply to Bin Hu from comment #30) > Can QA verify that upgrading to rabbitmq-server-3.6.3-5.el7ost fixes this > issue? > > Thanks > Bin Bin, you need to be patience. there are multiple overlapping bugs at the moment that makes verification of bugs like this one difficult. Our QE engineers will verify this specific bug, once others are fixed and system is more stable.
I tested this with manually installed rabbitmq-server-3.6.3-5.el7ost.noarch on overcloud image and on a fresh deployment I couldn't reproduce the issue so I think Peter is right.
(In reply to Marius Cornea from comment #32) > I tested this with manually installed rabbitmq-server-3.6.3-5.el7ost.noarch > on overcloud image and on a fresh deployment I couldn't reproduce the issue > so I think Peter is right. Ok I am moving this bug back ON_QA and once we can verify it properly after the other underlying bug is fixed, we can move it to VERIFIED.
OK, I updated the images with the latest core packages and the deployment completed successfully with no failed resources. Moving this to verified. [root@overcloud-controller-0 heat-admin]# rpm -qa | grep rabbitmq rabbitmq-server-3.6.3-5.el7ost.noarch
(In reply to Marius Cornea from comment #35) > OK, I updated the images with the latest core packages and the deployment > completed successfully with no failed resources. Moving this to verified. > > [root@overcloud-controller-0 heat-admin]# rpm -qa | grep rabbitmq > rabbitmq-server-3.6.3-5.el7ost.noarch Btw what's your resource-agents versions?
(In reply to Peter Lemenkov from comment #36) > (In reply to Marius Cornea from comment #35) > > OK, I updated the images with the latest core packages and the deployment > > completed successfully with no failed resources. Moving this to verified. > > > > [root@overcloud-controller-0 heat-admin]# rpm -qa | grep rabbitmq > > rabbitmq-server-3.6.3-5.el7ost.noarch > > Btw what's your resource-agents versions? resource-agents-3.9.5-54.el7_2.10.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1597.html