Bug 1104193

Summary: rabbitmqctl doesn't work
Product: Red Hat OpenStack Reporter: Fabio Massimo Di Nitto <fdinitto>
Component: rabbitmq-serverAssignee: John Eckersberg <jeckersb>
Status: CLOSED ERRATA QA Contact: Ami Jeain <ajeain>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.0 (RHEL 7)CC: acathrow, apevec, jeckersb, lhh, yeylon
Target Milestone: rc   
Target Release: 5.0 (RHEL 7)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: erlang-R16B-03.7min.1.el7ost rabbitmq-server-3.1.5-6.3.el7ost Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1104843 (view as bug list) Environment:
Last Closed: 2014-07-08 15:13:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1104843, 1083890    

Description Fabio Massimo Di Nitto 2014-06-03 13:22:48 UTC
[root@rhos5-rabbitmq1 ~]# systemctl start rabbitmq-server
[root@rhos5-rabbitmq1 ~]# rabbitmqctl status
Status of node 'rabbit@rhos5-rabbitmq1' ...
Error: unable to connect to node 'rabbit@rhos5-rabbitmq1': nodedown

DIAGNOSTICS
===========

nodes in question: ['rabbit@rhos5-rabbitmq1']

hosts, their running nodes and ports:
- rhos5-rabbitmq1: [{rabbitmqctl2980,45861}]

current node details:
- node name: 'rabbitmqctl2980@rhos5-rabbitmq1'
- home dir: /var/lib/rabbitmq
- cookie hash: vzI5NUyusQhCrkRr6GAhpQ==


I am unable to use rabbitmqclt. This is a blocker for HA to configure rabbitmq ha policies.

Comment 2 John Eckersberg 2014-06-03 14:32:29 UTC
Try restarting rabbitmq-server and then run rabbitmqctl again.  I've seen this happen sometimes where for some reason rabbitmqctl cannot contact the node the very first time it is started, but restarting it fixes it.  I need to debug further to see why this happens.

Comment 3 Fabio Massimo Di Nitto 2014-06-03 14:34:02 UTC
(In reply to John Eckersberg from comment #2)
> Try restarting rabbitmq-server and then run rabbitmqctl again.  I've seen
> this happen sometimes where for some reason rabbitmqctl cannot contact the
> node the very first time it is started, but restarting it fixes it.  I need
> to debug further to see why this happens.

I already tested several restarts without success and I can't use a workaround on a cluster where services are restarted as necessary. I need a fix or i can't configure HA queues in rabbit as all how-to I found requires rabbitmqclt to do the config bits.

Comment 4 John Eckersberg 2014-06-03 14:49:46 UTC
Here's the behavior I see, where a restart fixes it.  I will investigate the cause next, but wanted to post the symptom here first:


[root@control yum.repos.d]# service rabbitmq-server start
Redirecting to /bin/systemctl start  rabbitmq-server.service
[root@control yum.repos.d]# service rabbitmq-server status
Redirecting to /bin/systemctl status  rabbitmq-server.service
rabbitmq-server.service - RabbitMQ broker
   Loaded: loaded (/usr/lib/systemd/system/rabbitmq-server.service; disabled)
   Active: active (running) since Mon 2014-04-21 12:11:32 EDT; 2s ago
  Process: 19454 ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmqctl wait /var/run/rabbitmq/pid (code=exited, status=0/SUCCESS)
  Process: 19428 ExecStartPre=/bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl status > /dev/null 2>&1 (code=exited, status=2)
 Main PID: 19453 (beam)
   CGroup: /system.slice/rabbitmq-server.service
           ├─19453 /usr/lib64/erlang/erts-5.10.4/bin/beam -W w -K true -A30 -P 1048576 -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.1.5/sbin...
           ├─19547 inet_gethost 4
           └─19548 inet_gethost 4

Apr 21 12:11:30 control.example.org rabbitmqctl[19454]: pid is 19453 ...
Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: RabbitMQ 3.1.5. Copyright (C) 2007-2013 GoPivotal, Inc.
Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: ##  ##
Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: ##########  Logs: /var/log/rabbitmq/rabbit
Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: ######  ##        /var/log/rabbitmq/rabbit
Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: ##########
Apr 21 12:11:32 control.example.org rabbitmq-server[19453]: Starting broker... completed with 0 plugins.
Apr 21 12:11:32 control.example.org rabbitmqctl[19454]: ...done.
Apr 21 12:11:32 control.example.org systemd[1]: Started RabbitMQ broker.
[root@control yum.repos.d]# rabbitmqctl status
Status of node rabbit@control ...
Error: unable to connect to node rabbit@control: nodedown

DIAGNOSTICS
===========

nodes in question: [rabbit@control]

hosts, their running nodes and ports:
- control: [{rabbitmqctl19571,56593}]

current node details:
- node name: rabbitmqctl19571@control
- home dir: /var/lib/rabbitmq
- cookie hash: vDcXSX+o4/EfW12EcE8zWA==

[root@control yum.repos.d]# service rabbitmq-server restart
Redirecting to /bin/systemctl restart  rabbitmq-server.service
[root@control yum.repos.d]# rabbitmqctl status
Status of node rabbit@control ...
[{pid,19666},
 {running_applications,[{rabbit,"RabbitMQ","3.1.5"},
                        {os_mon,"CPO  CXC 138 46","2.2.14"},
                        {xmerl,"XML parser","1.3.5"},
                        {mnesia,"MNESIA  CXC 138 12","4.11"},
                        {sasl,"SASL  CXC 138 11","2.3.4"},
                        {stdlib,"ERTS  CXC 138 10","1.19.4"},
                        {kernel,"ERTS  CXC 138 10","2.16.4"}]},
 {os,{unix,linux}},
 {erlang_version,"Erlang R16B03 (erts-5.10.4) [source] [64-bit] [async-threads:30] [hipe] [kernel-poll:true]\n"},
 {memory,[{total,34667592},
          {connection_procs,2720},
          {queue_procs,5440},
          {plugins,0},
          {other_proc,13357816},
          {mnesia,57376},
          {mgmt_db,0},
          {msg_index,21512},
          {other_ets,727296},
          {binary,7512},
          {code,16434406},
          {atom,594537},
          {other_system,3458977}]},
 {vm_memory_high_watermark,0.4},
 {vm_memory_limit,772199219},
 {disk_free_limit,1000000000},
 {disk_free,8061767680},
 {file_descriptors,[{total_limit,924},
                    {total_used,3},
                    {sockets_limit,829},
                    {sockets_used,1}]},
 {processes,[{limit,1048576},{used,122}]},
 {run_queue,0},
 {uptime,4}]
...done.

Comment 5 John Eckersberg 2014-06-03 16:40:23 UTC
I've verified this happens on F20 as well, so it's not something related to the way we've packaged rabbit/erlang for osp5

Comment 6 John Eckersberg 2014-06-03 20:54:14 UTC
Ok, I think I've got this figured out.  I'm going to file a more detailed bug against rabbitmq-server in F20, but the general problem is this.

- rabbitmqctl queries epmd (erlang port mapper daemon) to determine what port the rabbitmq server is listening on

- epmd is started by the erlang runtime when the rabbitmq-server service starts

- because of the specific way that rabbitmq-server fork/exec()s and the way the systemd unit file is configured, systemd sends SIGKILL to epmd shortly after it is spawned.

- Thus rabbitmqctl cannot contact epmd, and reports (somewhat incorrectly) that the server node is down.  Really the port mapper is down.

Until this is corrected, you can ensure that the epmd process is running by running `rabbitmqctl status` before starting the rabbitmq-server service.

Comment 7 Fabio Massimo Di Nitto 2014-06-04 04:28:35 UTC
(In reply to John Eckersberg from comment #6)
> Ok, I think I've got this figured out.  I'm going to file a more detailed
> bug against rabbitmq-server in F20, but the general problem is this.
> 
> - rabbitmqctl queries epmd (erlang port mapper daemon) to determine what
> port the rabbitmq server is listening on
> 
> - epmd is started by the erlang runtime when the rabbitmq-server service
> starts
> 
> - because of the specific way that rabbitmq-server fork/exec()s and the way
> the systemd unit file is configured, systemd sends SIGKILL to epmd shortly
> after it is spawned.
> 
> - Thus rabbitmqctl cannot contact epmd, and reports (somewhat incorrectly)
> that the server node is down.  Really the port mapper is down.
> 
> Until this is corrected, you can ensure that the epmd process is running by
> running `rabbitmqctl status` before starting the rabbitmq-server service.

I can't execute that automatically in a cluster environment unfortunately. We need it fixed quickly. Workarounds are not an option :(

Comment 8 Alan Pevec 2014-06-04 10:52:42 UTC
(In reply to John Eckersberg from comment #6)
> Until this is corrected, you can ensure that the epmd process is running by
> running `rabbitmqctl status` before starting the rabbitmq-server service.

That should be executed in ExecStartPre (your fix for rhbz 1059913)

http://pkgs.fedoraproject.org/cgit/rabbitmq-server.git/tree/rabbitmq-server.service#n13

ExecStartPre=-/bin/sh -c "/usr/lib/rabbitmq/bin/rabbitmqctl status > /dev/null 2>&1"

Comment 9 John Eckersberg 2014-06-04 13:12:10 UTC
(In reply to Alan Pevec from comment #8)
> (In reply to John Eckersberg from comment #6)
> > Until this is corrected, you can ensure that the epmd process is running by
> > running `rabbitmqctl status` before starting the rabbitmq-server service.
> 
> That should be executed in ExecStartPre (your fix for rhbz 1059913)
> 
> http://pkgs.fedoraproject.org/cgit/rabbitmq-server.git/tree/rabbitmq-server.
> service#n13
> 
> ExecStartPre=-/bin/sh -c "/usr/lib/rabbitmq/bin/rabbitmqctl status >
> /dev/null 2>&1"

That doesn't help in this case, though.  Because that runs under the context of systemd, the epmd process that gets spawned by that command is also sent SIGKILL.  You need to get the daemon started outside of systemd, e.g. by running rabbitmqctl on the CLI beforehand.

In my example above where restarting it "fixes it", what actually happens is:

1. rabbitmq-server starts, registers to epmd, epmd is killed by systemd
2. Running `rabbitmqctl status` starts a new epmd process outside of systemd which survives
3. Restarting rabbitmq-server causes it to register with the epmd spawned by rabbitmqctl in step #2.  Now everything works.

Comment 10 John Eckersberg 2014-06-05 19:06:29 UTC
Here's what my plan is for RHOS5:

- Patch the rabbitmq-server package to remove the ExecStartPre and ExecStartPost lines from the systemd unit file.  This will fix the rabbitmqctl problem, but introduce a minor race condition where systemd will consider the service started before it is entirely up (since the waitpid from ExecStartPost is being removed).  This is an acceptable short-term tradeoff.  If an openstack service fails to connect due to the race, it will sleep and retry again when the service is up.

- Bug 1103524 plans to switch to using the notify service type for rabbitmq.  It depends on package review bug 1104604 for erlang-sd_notify, which I am going to take after adding the workaround above.

- Once we have erland-sd_notify available, we can update rabbitmq in Fedora to use it and switch the service type over to notify.  This will correct the problem properly.

- We'll grab the corrected Fedora packages and rebuild for RHOS5.

Comment 11 Fabio Massimo Di Nitto 2014-06-05 19:09:12 UTC
Ok how about you ship a shell script, initscript style that does everything that is necessary to start rabbitmq-server properly and link that into the systemd unit file?

Re-introducing a race is risky since pacemaker will rely on the service being up before starting other openstack services.

Comment 12 Alan Pevec 2014-06-05 19:18:59 UTC
openstack services keep try reconnecting to rabbitmq until connect succeeds so this race shouldn't be critical, it might just add some noise in logs

Comment 13 Fabio Massimo Di Nitto 2014-06-05 19:35:54 UTC
(In reply to Alan Pevec from comment #12)
> openstack services keep try reconnecting to rabbitmq until connect succeeds
> so this race shouldn't be critical, it might just add some noise in logs

I am more worried about pacemaker:

start -> monitor -> race? fault? -> restart -> loop.

I can add a monitor delay after a start as i am doing now for some services, but it´s not optimal.

Comment 14 John Eckersberg 2014-06-05 20:30:44 UTC
I'll try doing it with the shell script approach.  That will require switching the service type from "simple" to "forking" which may cause more problems.  We'll see.

Also a datapoint, I tested to see about how long the race is with the other method, and it is consistently about 0.9 seconds on my VM:

[root@control ~]# service rabbitmq-server start; while ! echo aaaaaaaaaaa | socat - tcp:localhost:amqp; do sleep 0.1; done
Redirecting to /bin/systemctl start  rabbitmq-server.service
2014/04/21 13:02:00 socat[21416] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused
2014/04/21 13:02:00 socat[21443] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused
2014/04/21 13:02:00 socat[21484] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused
2014/04/21 13:02:00 socat[21487] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused
2014/04/21 13:02:00 socat[21490] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused
2014/04/21 13:02:00 socat[21493] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused
2014/04/21 13:02:00 socat[21496] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused
2014/04/21 13:02:00 socat[21499] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused
2014/04/21 13:02:01 socat[21502] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused
AMQP    [root@control ~]#

Comment 15 John Eckersberg 2014-06-05 23:08:24 UTC
Ok, I've implemented the forking workaround, and verified it works for me:

[root@control ~]# rpm -q rabbitmq-server
rabbitmq-server-3.1.5-6.2.el7ost.noarch
[root@control ~]# service rabbitmq-server start && rabbitmqctl status
Redirecting to /bin/systemctl start  rabbitmq-server.service
Status of node rabbit@control ...
[{pid,25908},
 {running_applications,[{rabbit,"RabbitMQ","3.1.5"},
                        {os_mon,"CPO  CXC 138 46","2.2.14"},
                        {xmerl,"XML parser","1.3.5"},
                        {mnesia,"MNESIA  CXC 138 12","4.11"},
                        {sasl,"SASL  CXC 138 11","2.3.4"},
                        {stdlib,"ERTS  CXC 138 10","1.19.4"},
                        {kernel,"ERTS  CXC 138 10","2.16.4"}]},
 {os,{unix,linux}},
 {erlang_version,"Erlang R16B03 (erts-5.10.4) [source] [64-bit] [async-threads:30] [hipe] [kernel-poll:true]\n"},
 {memory,[{total,35430384},
          {connection_procs,5728},
          {queue_procs,5440},
          {plugins,0},
          {other_proc,13412048},
          {mnesia,57376},
          {mgmt_db,0},
          {msg_index,70424},
          {other_ets,721856},
          {binary,669856},
          {code,16434406},
          {atom,594537},
          {other_system,3458713}]},
 {vm_memory_high_watermark,0.4},
 {vm_memory_limit,772199219},
 {disk_free_limit,1000000000},
 {disk_free,8052510720},
 {file_descriptors,[{total_limit,924},
                    {total_used,3},
                    {sockets_limit,829},
                    {sockets_used,1}]},
 {processes,[{limit,1048576},{used,123}]},
 {run_queue,0},
 {uptime,1}]
...done.
[root@control ~]#

Comment 16 John Eckersberg 2014-06-13 14:04:25 UTC
Moving this back to ON_DEV since the built package is only a workaround, and there is more dev work to be done in order to have a proper package.

Comment 17 John Eckersberg 2014-06-17 19:59:47 UTC
New plan of action:

- Rebase the RHOS5 erlang spec against the latest in Fedora.  The Fedora one has some nifty systemd bits for socket activation of the epmd service (standalone from the one that rabbitmq-server tries to launch).

- Update the rabbitmq-server unit to require epmd.0.0.socket.  This avoids an issue where epmd isn't listening on the correct interface (see https://bugzilla.redhat.com/show_bug.cgi?id=1104843#c3)

Comment 21 Ami Jeain 2014-06-25 18:58:27 UTC
# rabbitmqctl status
Status of node rabbit@cougar13 ...
[{pid,16121},
 {running_applications,[{rabbit,"RabbitMQ","3.1.5"},
                        {mnesia,"MNESIA  CXC 138 12","4.11"},
                        {os_mon,"CPO  CXC 138 46","2.2.14"},
                        {xmerl,"XML parser","1.3.5"},
                        {sasl,"SASL  CXC 138 11","2.3.4"},
                        {stdlib,"ERTS  CXC 138 10","1.19.4"},
                        {kernel,"ERTS  CXC 138 10","2.16.4"}]},
 {os,{unix,linux}},
 {erlang_version,"Erlang R16B03 (erts-5.10.4) [source] [64-bit] [smp:6:6] [async-threads:30] [hipe] [kernel-poll:true]\n"},
 {memory,[{total,85658480},
          {connection_procs,3671616},
          {queue_procs,1366024},
          {plugins,0},
          {other_proc,13698456},
          {mnesia,309712},
          {mgmt_db,0},
          {msg_index,93048},
          {other_ets,897576},
          {binary,43758000},
          {code,16428070},
          {atom,594537},
          {other_system,4841441}]},
 {vm_memory_high_watermark,0.4},
 {vm_memory_limit,13425097113},
 {disk_free_limit,1000000000},
 {disk_free,444528570368},
 {file_descriptors,[{total_limit,924},
                    {total_used,79},
                    {sockets_limit,829},
                    {sockets_used,77}]},
 {processes,[{limit,1048576},{used,897}]},
 {run_queue,0},
 {uptime,21449}]
...done.

# rpm -qa |grep rabbit
rabbitmq-server-3.1.5-6.3.el7ost.noarch

Comment 23 errata-xmlrpc 2014-07-08 15:13:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0845.html