Bug 1358311 - RabbitMQ resources fail to start in HA IPv6 deployment
Summary: RabbitMQ resources fail to start in HA IPv6 deployment
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ga
: 9.0 (Mitaka)
Assignee: Michele Baldessari
QA Contact: Omri Hochman
URL:
Whiteboard:
Depends On: 1347802
Blocks: 1344405
TreeView+ depends on / blocked
 
Reported: 2016-07-20 13:38 UTC by Marius Cornea
Modified: 2016-07-25 07:29 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1347802
Environment:
Last Closed: 2016-07-25 07:29:22 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Marius Cornea 2016-07-20 13:38:02 UTC
(In reply to Marius Cornea from comment #21)
> (In reply to Peter Lemenkov from comment #20)
> > 0. You need to create /etc/rabbitmq/rabbitmq-env.conf file with the
> > following contents:
> > 
> > SERVER_ERL_ARGS="-proto_dist inet6_tcp +P 1048576"
> > CTL_ERL_ARGS=${SERVER_ERL_ARGS}
> 
> Should we clone this BZ to implement this configuration change via OSPd?

Yes, we should do that. Something like 

if CLUSTERING_OVER_IPV4 then
  SERVER_ERL_ARGS="-proto_dist inet_tcp +P 1048576"
else // CLUSTERING_OVER_IPV6
  SERVER_ERL_ARGS="-proto_dist inet6_tcp +P 1048576"
fi

should be added to configurator application (Director?).


+++ This bug was initially created as a clone of Bug #1347802 +++

Description of problem:
RabbitMQ resources fail to start in HA IPv6 deployment so the overcloud deployment fails.

Version-Release number of selected component (if applicable):
rabbitmq-server-3.6.2-3.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-11.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with IPv6 isolated networks

Actual results:
Deployment fails:

 Clone Set: rabbitmq-clone [rabbitmq]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
Failed Actions:
* rabbitmq_start_0 on overcloud-controller-1 'unknown error' (1): call=86, status=complete, exitreason='none',
    last-rc-change='Fri Jun 17 16:29:14 2016', queued=0ms, exec=1274ms
* rabbitmq_start_0 on overcloud-controller-2 'unknown error' (1): call=92, status=complete, exitreason='none',
    last-rc-change='Fri Jun 17 16:29:17 2016', queued=0ms, exec=1393ms
* rabbitmq_start_0 on overcloud-controller-0 'unknown error' (1): call=91, status=complete, exitreason='none',
    last-rc-change='Fri Jun 17 16:29:20 2016', queued=1ms, exec=2410ms


Expected results:
RabbitMQ resources get started

Additional info:
The server appears to be listening on port 5672
ss -ln | grep 5672
tcp    LISTEN     0      128     fd00:fd00:fd00:2000::13:5672                 :::*                  
tcp    LISTEN     0      128      :::35672                :::*                  

There are some errors showing up in rabbit\@overcloud-controller-0.log, attaching the log.

In keystone.log I can see many amqp connection errors. Attaching that log as well.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-06-17 12:46:29 EDT ---

Since this issue was entered in bugzilla without a release flag set, rhos-8.0.z? has been automatically added to ensure that it is properly evaluated for this release.

--- Additional comment from Marius Cornea on 2016-06-17 12:49 EDT ---



--- Additional comment from Marius Cornea on 2016-06-17 12:50 EDT ---



--- Additional comment from Marius Cornea on 2016-06-17 12:50 EDT ---



--- Additional comment from Marius Cornea on 2016-06-17 12:54:14 EDT ---

Undercloud is available @ ssh stack@10.35.117.132 -p 2900; pass stack

--- Additional comment from John Eckersberg on 2016-06-17 14:25:10 EDT ---

This is probably due to...

[root@overcloud-controller-0 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@overcloud-controller-0' ...
Error: operation cluster_status used with invalid parameter: []

The resource agent relies heavily on the cluster_status command, so if it's broken I'm not surprised that nothing works in HA.

--- Additional comment from John Eckersberg on 2016-06-17 14:38:39 EDT ---

Actually scratch that.  The reason cluster_status doesn't work is because the mnesia app isn't running, so it can't contact the database to retrieve the cluster status.

And the reason mnesia isn't running can be seen in the rabbit log:

** Reason for termination == 
** {{unparseable,"/usr/bin/df: '/var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-0': No such file or directory\n"},835")

...

 ** FATAL ** {error,{"Cannot rename disk_log file",latest_log,
                     "/var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-0/PREVIOUS.LOG",
                     {log_header,trans_log,"4.3","4.13.4",
                                 'rabbit@overcloud-controller-0',
                                 {1466,181123,357657}},
                     {file_error,"/var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-0/PREVIOUS.LOG",
                                 enoent}}}


And confirmed that there is no /var/lib/rabbitmq/mnesia directory on the filesystem.  So something (probably the resource agent) is erroneously wiping that directory while the service is running.

--- Additional comment from John Eckersberg on 2016-06-17 14:48:19 EDT ---

I'm guessing this has something to do with:

http://pkgs.devel.redhat.com/cgit/rpms/rabbitmq-server/commit/?h=rhos-9.0-rhel-7&id=76db4bc4dd7312949c8c0415305c53822e15cd4c

Maybe the exit codes are still different from what the resource agent expects?

--- Additional comment from Peter Lemenkov on 2016-06-21 11:04:11 EDT ---

First thing - network on that cluster looks unreliable. I see constant ssh freezes.

--- Additional comment from Peter Lemenkov on 2016-06-21 11:04:34 EDT ---

(In reply to John Eckersberg from comment #8)
> I'm guessing this has something to do with:
> 
> http://pkgs.devel.redhat.com/cgit/rpms/rabbitmq-server/commit/?h=rhos-9.0-
> rhel-7&id=76db4bc4dd7312949c8c0415305c53822e15cd4c
> 
> Maybe the exit codes are still different from what the resource agent
> expects?

Nope, I don't think so.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-06-21 11:23:13 EDT ---

This bugzilla has had the "blocker?" flag added after it has already been approved for a release, so it needs to be reviewed again to determine if it truly is a blocker.

--- Additional comment from Peter Lemenkov on 2016-06-21 11:41:35 EDT ---

(In reply to Marius Cornea from comment #5)
> Undercloud is available @ ssh stack@10.35.117.132 -p 2900; pass stack

Mario, could you please fix networking issues (or redeploy it from scratch). I can barely use it - network is highly unreliable.

--- Additional comment from Marius Cornea on 2016-06-21 12:37:19 EDT ---

(In reply to Peter Lemenkov from comment #12)
> Mario, could you please fix networking issues (or redeploy it from scratch).
> I can barely use it - network is highly unreliable.

It's a virtual environment with all nodes running on the same machine. I couldn't see any issue with the connection, maybe it was something temporary to the TLV lab where this machine it's hosted. Can you try again?

 mtr --report 10.35.117.132
Start: Tue Jun 21 18:36:18 2016
HOST: remoteur.brq.redhat.com     Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ovpn-204-1.brq.redhat.com  0.0%    10   44.2  45.5  43.7  48.4   1.5
  2.|-- 10.40.17.254               0.0%    10   48.0  50.8  45.7  82.1  11.0
  3.|-- 10.40.255.50               0.0%    10   45.2  46.9  44.9  59.5   4.5
  4.|-- 10.34.253.122              0.0%    10   46.4  47.2  45.2  53.1   2.8
  5.|-- 10.34.253.79               0.0%    10   46.6  46.6  44.9  53.1   2.4
  6.|-- 10.34.253.84               0.0%    10   44.5  46.1  44.5  49.0   1.2
  7.|-- 10.4.57.142                0.0%    10  118.6 116.7 115.3 118.6   0.9
  8.|-- 10.35.253.10               0.0%    10  120.9 130.1 116.6 221.9  32.4
  9.|-- 10.35.254.29               0.0%    10  117.2 123.3 115.9 158.9  14.0
 10.|-- seal30                     0.0%    10  116.6 119.1 115.4 132.2   6.1

--- Additional comment from John Eckersberg on 2016-06-22 16:45:18 EDT ---

I looked at this a bit more on a different system today.  It looks like one or more of the ipv6 patches were lost and/or broken during the rebase to erlang 18.  At the very least the epmd client is only trying to connect via ipv4, which fails so pretty much nothing works with clustering.

--- Additional comment from Jay Dobies on 2016-06-23 09:10:36 EDT ---



--- Additional comment from Peter Lemenkov on 2016-07-14 10:59:09 EDT ---

Ok, I've finally found what's going on there.

0. You need to change rabbitmq config if using in IPv6-only network.
1. We need to fix Erlang/OTP to properly work in IPv6 networks. Patch is available, so expect a new build soon:

https://github.com/fedora-erlang/erlang/commit/6dc7add

2. We need to fix RabbitMQ to respect "proto_dist" as well.

Also I believe we should prevent further "IPv6-only environment" failures and cherry pick this patch to allow skilling IPv4 localhost entirely:

* https://github.com/erlang/otp/pull/1075

My ETA for all that is July, 18 or 19.

--- Additional comment from Peter Lemenkov on 2016-07-14 11:05:01 EDT ---

Almost forgot another one thing:

4. We need to use more IP-protocol version agnostic EPMD service dependency than just epmd@0.0.0.0.socket. Something like epmd@::.socket ?

--- Additional comment from Peter Lemenkov on 2016-07-15 10:36:26 EDT ---

Status update. Regarding task no.2 (see list of tasks above) - patch available:

https://github.com/lemenkov/rabbitmq-common/commit/ee36f22

I didn't consider anything about task no.3 yet.

--- Additional comment from rprakash on 2016-07-15 15:31:23 EDT ---

The issues here are listed as 0,1,2,4 what is issue 3?
Please confirm for us to review -IPv6 team member

--- Additional comment from Peter Lemenkov on 2016-07-17 12:43:44 EDT ---

(In reply to rprakash from comment #19)
> The issues here are listed as 0,1,2,4 what is issue 3?
> Please confirm for us to review -IPv6 team member

Forgot to number it - this is issue no.3

cherry pick this patch to allow skipping IPv4 localhost entirely:

* https://github.com/erlang/otp/pull/1075

Status report. I've managed it to wok on a various IPv6 configurations. You need to do the following things:


0. You need to create /etc/rabbitmq/rabbitmq-env.conf file with the following contents:

SERVER_ERL_ARGS="-proto_dist inet6_tcp +P 1048576"
CTL_ERL_ARGS=${SERVER_ERL_ARGS}

1. Upgrade Erlang up to erlang-18.3.4.1-1.el7ost 
2. Upgrade RabbitMQ up to rabbitmq-server-3.6.3-4.el7ost

This will allow you to run RabbitMQ under pacemaker. Please test and report any issues.

I'm working on fixing systemd service file.

--- Additional comment from Marius Cornea on 2016-07-19 07:20:24 EDT ---

(In reply to Peter Lemenkov from comment #20)
> 0. You need to create /etc/rabbitmq/rabbitmq-env.conf file with the
> following contents:
> 
> SERVER_ERL_ARGS="-proto_dist inet6_tcp +P 1048576"
> CTL_ERL_ARGS=${SERVER_ERL_ARGS}

Should we clone this BZ to implement this configuration change via OSPd?

--- Additional comment from errata-xmlrpc on 2016-07-19 09:37:44 EDT ---

Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHEA-2016:23271-01
https://errata.devel.redhat.com/advisory/23271

--- Additional comment from Peter Lemenkov on 2016-07-20 09:34:59 EDT ---

(In reply to Marius Cornea from comment #21)
> (In reply to Peter Lemenkov from comment #20)
> > 0. You need to create /etc/rabbitmq/rabbitmq-env.conf file with the
> > following contents:
> > 
> > SERVER_ERL_ARGS="-proto_dist inet6_tcp +P 1048576"
> > CTL_ERL_ARGS=${SERVER_ERL_ARGS}
> 
> Should we clone this BZ to implement this configuration change via OSPd?

Yes, we should do that. Something like 

if CLUSTERING_OVER_IPV4 then
  SERVER_ERL_ARGS="-proto_dist inet_tcp +P 1048576"
else // CLUSTERING_OVER_IPV6
  SERVER_ERL_ARGS="-proto_dist inet6_tcp +P 1048576"
fi

should be added to configurator application (Director?).

Comment 1 Michele Baldessari 2016-07-20 13:57:40 UTC
Hi Peter, 

I have a question about backwards compatibility. Suppose we somehow implement:
if CLUSTERING_OVER_IPV4 then
  SERVER_ERL_ARGS="-proto_dist inet_tcp +P 1048576"
else // CLUSTERING_OVER_IPV6
  SERVER_ERL_ARGS="-proto_dist inet6_tcp +P 1048576"
fi

What happens if we set those variables in director/tripleo and rabbit does not
have your patch included?

Just trying to understand the impact of how to backport this and how we need
to synchronize rabbit+director/tripleo here.

Thanks,
Michele

Comment 2 Peter Lemenkov 2016-07-20 14:06:37 UTC
First of all, rabbitmq w/o patches won't behave well in a fully IPv6 environment. So you may consider that cluster management over IPv6 is broken anyway. This means that the only affected case which might have some regressions is cluster management over IPv4 (RabbitMQ itself can open AMQP ports using IPv4 and/or IPv6, but clustering was always done over IPv4). And for IPv4 adding "-proto_dist inet_tcp" won't change anything since it's a default value.

Note that we're stepping into unchartered territory. IPv6 Erlang distribution is a very new thing. E.g. people *always* rely on IPv4 for cluster management before. So we may see some other unknown effects.

Comment 3 Michele Baldessari 2016-07-21 18:57:08 UTC
(I wanted to test stuff in master but I am blocked by https://review.openstack.org/#/c/345542 and https://bugs.launchpad.net/tripleo/+bug/1605363)

Hi Peter,

just a small clarification. Today in mitaka we already have the following:
$rabbit_ipv6 = str2bool(hiera('rabbit_ipv6', false))
if $rabbit_ipv6 {
    $rabbit_env = merge(hiera('rabbitmq_environment'), {
      'RABBITMQ_SERVER_START_ARGS' => '"-proto_dist inet6_tcp"'   
    })
} else {
  $rabbit_env = hiera('rabbitmq_environment')
} 
  
class { '::rabbitmq':
  service_manage          => false,
  tcp_keepalive           => false,
  config_kernel_variables => hiera('rabbitmq_kernel_variables'),
  config_variables        => hiera('rabbitmq_config_variables'),
  environment_variables   => $rabbit_env,
} ....

So basically the only thing missing as far as I understand is the
addition of "-proto_dist inet_tcp" in the ipv4 case. But since you mentioned
before that this is the default, I think we should be good already, am I correct here or do we need to force "-proto_dist inet_tcp" in the ipv4 case?

Thanks,
Michele

Comment 4 Michele Baldessari 2016-07-21 19:13:20 UTC
If the answer will be "yes, we need it", this will be the needed change:
diff --git a/puppet/manifests/overcloud_controller_pacemaker.pp b/puppet/manifests/overcloud_controller_pacemaker.pp
index ef54df2f2ce3..63d636afb108 100644
--- a/puppet/manifests/overcloud_controller_pacemaker.pp
+++ b/puppet/manifests/overcloud_controller_pacemaker.pp
@@ -122,7 +122,9 @@ if hiera('step') >= 1 {
         'RABBITMQ_SERVER_START_ARGS' => '"-proto_dist inet6_tcp"'
       })
   } else {
-    $rabbit_env = hiera('rabbitmq_environment')
+    $rabbit_env = merge(hiera('rabbitmq_environment'), {
+        'RABBITMQ_SERVER_START_ARGS' => '"-proto_dist inet_tcp"'
+    })
   }
 
   class { '::rabbitmq':

Comment 5 Peter Lemenkov 2016-07-22 08:55:18 UTC
(In reply to Michele Baldessari from comment #3)

> So basically the only thing missing as far as I understand is the
> addition of "-proto_dist inet_tcp" in the ipv4 case. But since you mentioned
> before that this is the default, I think we should be good already, am I
> correct here or do we need to force "-proto_dist inet_tcp" in the ipv4 case?


We don't need to add anything extra in case of IPv4. So no, there is no need to add a special case of IPv4.

Comment 6 Michele Baldessari 2016-07-25 07:29:22 UTC
Thanks Peter,

so I am closing this one since the code is in tripleo mitaka already.
Marius, if testing of https://bugzilla.redhat.com/show_bug.cgi?id=1347802 brings
up anything fishy do let us know.

thanks all,
Michele


Note You need to log in before you can comment on or make changes to this bug.