Bug 1347802 - RabbitMQ resources fail to start in HA IPv6 deployment
Summary: RabbitMQ resources fail to start in HA IPv6 deployment
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ga
: 9.0 (Mitaka)
Assignee: Peter Lemenkov
QA Contact: Marius Cornea
URL:
Whiteboard:
: 1347432 (view as bug list)
Depends On:
Blocks: 1344405 1358311
TreeView+ depends on / blocked
 
Reported: 2016-06-17 16:46 UTC by Marius Cornea
Modified: 2016-11-10 14:12 UTC (History)
23 users (show)

Fixed In Version: rabbitmq-server-3.6.3-4.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1358311 (view as bug list)
Environment:
Last Closed: 2016-08-11 12:26:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
rabbit@overcloud-controller-0.log (47.94 KB, text/plain)
2016-06-17 16:49 UTC, Marius Cornea
no flags Details
keystone.log (26.78 KB, text/plain)
2016-06-17 16:50 UTC, Marius Cornea
no flags Details
MnesiaCore.rabbit@overcloud-controller-0_1466_181123_372835 (182.65 KB, application/octet-stream)
2016-06-17 16:50 UTC, Marius Cornea
no flags Details
rabbit\@overcloud-controller-0.log (48.82 KB, text/plain)
2016-07-25 11:19 UTC, Marius Cornea
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github erlang otp pull 1075 0 'None' open epmd: require explicitly adding loopback address 2021-02-03 09:39:17 UTC
Github erlang otp pull 1129 0 'None' closed Respect -proto_dist switch while connecting to/registering at EPMD 2021-02-03 09:39:17 UTC
Red Hat Product Errata RHEA-2016:1597 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 9 Release Candidate Advisory 2016-08-11 16:06:52 UTC

Description Marius Cornea 2016-06-17 16:46:24 UTC
Description of problem:
RabbitMQ resources fail to start in HA IPv6 deployment so the overcloud deployment fails.

Version-Release number of selected component (if applicable):
rabbitmq-server-3.6.2-3.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-11.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with IPv6 isolated networks

Actual results:
Deployment fails:

 Clone Set: rabbitmq-clone [rabbitmq]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
Failed Actions:
* rabbitmq_start_0 on overcloud-controller-1 'unknown error' (1): call=86, status=complete, exitreason='none',
    last-rc-change='Fri Jun 17 16:29:14 2016', queued=0ms, exec=1274ms
* rabbitmq_start_0 on overcloud-controller-2 'unknown error' (1): call=92, status=complete, exitreason='none',
    last-rc-change='Fri Jun 17 16:29:17 2016', queued=0ms, exec=1393ms
* rabbitmq_start_0 on overcloud-controller-0 'unknown error' (1): call=91, status=complete, exitreason='none',
    last-rc-change='Fri Jun 17 16:29:20 2016', queued=1ms, exec=2410ms


Expected results:
RabbitMQ resources get started

Additional info:
The server appears to be listening on port 5672
ss -ln | grep 5672
tcp    LISTEN     0      128     fd00:fd00:fd00:2000::13:5672                 :::*                  
tcp    LISTEN     0      128      :::35672                :::*                  

There are some errors showing up in rabbit\@overcloud-controller-0.log, attaching the log.

In keystone.log I can see many amqp connection errors. Attaching that log as well.

Comment 2 Marius Cornea 2016-06-17 16:49:54 UTC
Created attachment 1169177 [details]
rabbit

Comment 3 Marius Cornea 2016-06-17 16:50:21 UTC
Created attachment 1169178 [details]
keystone.log

Comment 4 Marius Cornea 2016-06-17 16:50:42 UTC
Created attachment 1169179 [details]
MnesiaCore.rabbit@overcloud-controller-0_1466_181123_372835

Comment 6 John Eckersberg 2016-06-17 18:25:10 UTC
This is probably due to...

[root@overcloud-controller-0 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@overcloud-controller-0' ...
Error: operation cluster_status used with invalid parameter: []

The resource agent relies heavily on the cluster_status command, so if it's broken I'm not surprised that nothing works in HA.

Comment 7 John Eckersberg 2016-06-17 18:38:39 UTC
Actually scratch that.  The reason cluster_status doesn't work is because the mnesia app isn't running, so it can't contact the database to retrieve the cluster status.

And the reason mnesia isn't running can be seen in the rabbit log:

** Reason for termination == 
** {{unparseable,"/usr/bin/df: '/var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-0': No such file or directory\n"},835")

...

 ** FATAL ** {error,{"Cannot rename disk_log file",latest_log,
                     "/var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-0/PREVIOUS.LOG",
                     {log_header,trans_log,"4.3","4.13.4",
                                 'rabbit@overcloud-controller-0',
                                 {1466,181123,357657}},
                     {file_error,"/var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-0/PREVIOUS.LOG",
                                 enoent}}}


And confirmed that there is no /var/lib/rabbitmq/mnesia directory on the filesystem.  So something (probably the resource agent) is erroneously wiping that directory while the service is running.

Comment 8 John Eckersberg 2016-06-17 18:48:19 UTC
I'm guessing this has something to do with:

http://pkgs.devel.redhat.com/cgit/rpms/rabbitmq-server/commit/?h=rhos-9.0-rhel-7&id=76db4bc4dd7312949c8c0415305c53822e15cd4c

Maybe the exit codes are still different from what the resource agent expects?

Comment 9 Peter Lemenkov 2016-06-21 15:04:11 UTC
First thing - network on that cluster looks unreliable. I see constant ssh freezes.

Comment 10 Peter Lemenkov 2016-06-21 15:04:34 UTC
(In reply to John Eckersberg from comment #8)
> I'm guessing this has something to do with:
> 
> http://pkgs.devel.redhat.com/cgit/rpms/rabbitmq-server/commit/?h=rhos-9.0-
> rhel-7&id=76db4bc4dd7312949c8c0415305c53822e15cd4c
> 
> Maybe the exit codes are still different from what the resource agent
> expects?

Nope, I don't think so.

Comment 14 John Eckersberg 2016-06-22 20:45:18 UTC
I looked at this a bit more on a different system today.  It looks like one or more of the ipv6 patches were lost and/or broken during the rebase to erlang 18.  At the very least the epmd client is only trying to connect via ipv4, which fails so pretty much nothing works with clustering.

Comment 15 Jay Dobies 2016-06-23 13:10:36 UTC
*** Bug 1347432 has been marked as a duplicate of this bug. ***

Comment 16 Peter Lemenkov 2016-07-14 14:59:09 UTC
Ok, I've finally found what's going on there.

0. You need to change rabbitmq config if using in IPv6-only network.
1. We need to fix Erlang/OTP to properly work in IPv6 networks. Patch is available, so expect a new build soon:

https://github.com/fedora-erlang/erlang/commit/6dc7add

2. We need to fix RabbitMQ to respect "proto_dist" as well.

Also I believe we should prevent further "IPv6-only environment" failures and cherry pick this patch to allow skilling IPv4 localhost entirely:

* https://github.com/erlang/otp/pull/1075

My ETA for all that is July, 18 or 19.

Comment 17 Peter Lemenkov 2016-07-14 15:05:01 UTC
Almost forgot another one thing:

4. We need to use more IP-protocol version agnostic EPMD service dependency than just epmd.0.0.socket. Something like epmd@::.socket ?

Comment 18 Peter Lemenkov 2016-07-15 14:36:26 UTC
Status update. Regarding task no.2 (see list of tasks above) - patch available:

https://github.com/lemenkov/rabbitmq-common/commit/ee36f22

I didn't consider anything about task no.3 yet.

Comment 19 rprakash 2016-07-15 19:31:23 UTC
The issues here are listed as 0,1,2,4 what is issue 3?
Please confirm for us to review -IPv6 team member

Comment 20 Peter Lemenkov 2016-07-17 16:43:44 UTC
(In reply to rprakash from comment #19)
> The issues here are listed as 0,1,2,4 what is issue 3?
> Please confirm for us to review -IPv6 team member

Forgot to number it - this is issue no.3

cherry pick this patch to allow skipping IPv4 localhost entirely:

* https://github.com/erlang/otp/pull/1075

Status report. I've managed it to wok on a various IPv6 configurations. You need to do the following things:


0. You need to create /etc/rabbitmq/rabbitmq-env.conf file with the following contents:

SERVER_ERL_ARGS="-proto_dist inet6_tcp +P 1048576"
CTL_ERL_ARGS=${SERVER_ERL_ARGS}

1. Upgrade Erlang up to erlang-18.3.4.1-1.el7ost 
2. Upgrade RabbitMQ up to rabbitmq-server-3.6.3-4.el7ost

This will allow you to run RabbitMQ under pacemaker. Please test and report any issues.

I'm working on fixing systemd service file.

Comment 21 Marius Cornea 2016-07-19 11:20:24 UTC
(In reply to Peter Lemenkov from comment #20)
> 0. You need to create /etc/rabbitmq/rabbitmq-env.conf file with the
> following contents:
> 
> SERVER_ERL_ARGS="-proto_dist inet6_tcp +P 1048576"
> CTL_ERL_ARGS=${SERVER_ERL_ARGS}

Should we clone this BZ to implement this configuration change via OSPd?

Comment 23 Peter Lemenkov 2016-07-20 13:34:59 UTC
(In reply to Marius Cornea from comment #21)
> (In reply to Peter Lemenkov from comment #20)
> > 0. You need to create /etc/rabbitmq/rabbitmq-env.conf file with the
> > following contents:
> > 
> > SERVER_ERL_ARGS="-proto_dist inet6_tcp +P 1048576"
> > CTL_ERL_ARGS=${SERVER_ERL_ARGS}
> 
> Should we clone this BZ to implement this configuration change via OSPd?

Yes, we should do that. Something like 

if CLUSTERING_OVER_IPV4 then
  SERVER_ERL_ARGS="-proto_dist inet_tcp +P 1048576"
else // CLUSTERING_OVER_IPV6
  SERVER_ERL_ARGS="-proto_dist inet6_tcp +P 1048576"
fi

should be added to configurator application (Director?).

Comment 24 Peter Lemenkov 2016-07-20 14:09:36 UTC
Btw I'm curious is it possible that even IPv4 localhost (127.x.x.x) won't be available for RabbitMQ? I mean should we consider case of a fully IPv6 environment as well?

Comment 25 Marius Cornea 2016-07-20 14:42:48 UTC
I don't think we can do that at this point since there is at least one nic configured with an IPv4 address which is for the OSPd provisioning network.

Comment 26 Bin Hu 2016-07-25 04:59:26 UTC
Is this issue fixed? When will the fix be picked up into official Apex build (or Mitaka build)? Thank you.

Comment 27 Marius Cornea 2016-07-25 11:19:28 UTC
Created attachment 1183743 [details]
rabbit\@overcloud-controller-0.log

My deployment is still failing, from what I can see in the log in the same way as it was reported initially:

[root@overcloud-controller-0 heat-admin]# rpm -qa | grep rabbitmq
rabbitmq-server-3.6.3-4.el7ost.noarch

 Clone Set: rabbitmq-clone [rabbitmq]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Failed Actions:
* rabbitmq_start_0 on overcloud-controller-2 'unknown error' (1): call=86, status=complete, exitreason='none',
    last-rc-change='Mon Jul 25 11:08:04 2016', queued=0ms, exec=2713ms
* rabbitmq_start_0 on overcloud-controller-0 'unknown error' (1): call=92, status=complete, exitreason='none',
    last-rc-change='Mon Jul 25 11:08:09 2016', queued=0ms, exec=2552ms
* rabbitmq_start_0 on overcloud-controller-1 'unknown error' (1): call=91, status=complete, exitreason='none',
    last-rc-change='Mon Jul 25 11:08:14 2016', queued=0ms, exec=2887ms

Attaching /var/log/rabbitmq/rabbit\@overcloud-controller-0.log.

Comment 29 Peter Lemenkov 2016-07-25 15:45:09 UTC
It seems that this issue has nothing with neither IPv6 not RabbitMQ. I was able to fix cluster w/o changing configuration, only using rabbitmqctl itself.

The only change I did is upgrade to 3.6.3-5 build which fixed a nasty timeout during cluster node removal. See bug 1356169 for further details.

This (iupgrade to 3.6.3-5) might be considered as fix for that issue as well, since it prevents pcs from marking perfectly fine healthy RabbitMQ nodes as failed.

Comment 30 Bin Hu 2016-07-26 04:39:58 UTC
Can QA verify that upgrading to rabbitmq-server-3.6.3-5.el7ost fixes this issue?

Thanks
Bin

Comment 31 Fabio Massimo Di Nitto 2016-07-26 08:27:53 UTC
(In reply to Bin Hu from comment #30)
> Can QA verify that upgrading to rabbitmq-server-3.6.3-5.el7ost fixes this
> issue?
> 
> Thanks
> Bin

Bin, you need to be patience. there are multiple overlapping bugs at the moment that makes verification of bugs like this one difficult.

Our QE engineers will verify this specific bug, once others are fixed and system is more stable.

Comment 32 Marius Cornea 2016-07-26 08:45:43 UTC
I tested this with manually installed rabbitmq-server-3.6.3-5.el7ost.noarch on overcloud image and on a fresh deployment I couldn't reproduce the issue so I think Peter is right.

Comment 33 Fabio Massimo Di Nitto 2016-07-26 12:00:00 UTC
(In reply to Marius Cornea from comment #32)
> I tested this with manually installed rabbitmq-server-3.6.3-5.el7ost.noarch
> on overcloud image and on a fresh deployment I couldn't reproduce the issue
> so I think Peter is right.

Ok I am moving this bug back ON_QA and once we can verify it properly after the other underlying bug is fixed, we can move it to VERIFIED.

Comment 35 Marius Cornea 2016-07-27 11:10:39 UTC
OK, I updated the images with the latest core packages and the deployment completed successfully with no failed resources. Moving this to verified. 

[root@overcloud-controller-0 heat-admin]# rpm -qa | grep rabbitmq
rabbitmq-server-3.6.3-5.el7ost.noarch

Comment 36 Peter Lemenkov 2016-07-27 11:50:41 UTC
(In reply to Marius Cornea from comment #35)
> OK, I updated the images with the latest core packages and the deployment
> completed successfully with no failed resources. Moving this to verified. 
> 
> [root@overcloud-controller-0 heat-admin]# rpm -qa | grep rabbitmq
> rabbitmq-server-3.6.3-5.el7ost.noarch

Btw what's your resource-agents versions?

Comment 37 Marius Cornea 2016-07-27 11:51:28 UTC
(In reply to Peter Lemenkov from comment #36)
> (In reply to Marius Cornea from comment #35)
> > OK, I updated the images with the latest core packages and the deployment
> > completed successfully with no failed resources. Moving this to verified. 
> > 
> > [root@overcloud-controller-0 heat-admin]# rpm -qa | grep rabbitmq
> > rabbitmq-server-3.6.3-5.el7ost.noarch
> 
> Btw what's your resource-agents versions?

resource-agents-3.9.5-54.el7_2.10.x86_64

Comment 39 errata-xmlrpc 2016-08-11 12:26:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1597.html


Note You need to log in before you can comment on or make changes to this bug.