Bug 1104843

Summary:	RabbitMQ server shouldn't leave epmd running after being stopped
Product:	[Fedora] Fedora	Reporter:	John Eckersberg <jeckersb>
Component:	rabbitmq-server	Assignee:	Peter Lemenkov <lemenkov>
Status:	NEW ---	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	rawhide	CC:	apevec, erlang, fdinitto, hubert.plociniczak, jeckersb, lemenkov, lhh, plemenko, rjones, s
Target Milestone:	---	Keywords:	FutureFeature
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:	1104193	Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1104193
Bug Blocks:	1104199

Description John Eckersberg 2014-06-04 19:22:17 UTC

+++ This bug was initially created as a clone of Bug #1104193 +++

[root@rhos5-rabbitmq1 ~]# systemctl start rabbitmq-server
[root@rhos5-rabbitmq1 ~]# rabbitmqctl status
Status of node 'rabbit@rhos5-rabbitmq1' ...
Error: unable to connect to node 'rabbit@rhos5-rabbitmq1': nodedown

DIAGNOSTICS
===========

nodes in question: ['rabbit@rhos5-rabbitmq1']

hosts, their running nodes and ports:
- rhos5-rabbitmq1: [{rabbitmqctl2980,45861}]

current node details:
- node name: 'rabbitmqctl2980@rhos5-rabbitmq1'
- home dir: /var/lib/rabbitmq
- cookie hash: vzI5NUyusQhCrkRr6GAhpQ==


I am unable to use rabbitmqclt. This is a blocker for HA to configure rabbitmq ha policies.

--- Additional comment from RHEL Product and Program Management on 2014-06-03 09:46:33 EDT ---

Since this issue was entered in bugzilla, the release flag has been
set to ? to ensure that it is properly evaluated for this release.

--- Additional comment from John Eckersberg on 2014-06-03 10:32:29 EDT ---

Try restarting rabbitmq-server and then run rabbitmqctl again.  I've seen this happen sometimes where for some reason rabbitmqctl cannot contact the node the very first time it is started, but restarting it fixes it.  I need to debug further to see why this happens.

--- Additional comment from Fabio Massimo Di Nitto on 2014-06-03 10:34:02 EDT ---

(In reply to John Eckersberg from comment #2)
> Try restarting rabbitmq-server and then run rabbitmqctl again.  I've seen
> this happen sometimes where for some reason rabbitmqctl cannot contact the
> node the very first time it is started, but restarting it fixes it.  I need
> to debug further to see why this happens.

I already tested several restarts without success and I can't use a workaround on a cluster where services are restarted as necessary. I need a fix or i can't configure HA queues in rabbit as all how-to I found requires rabbitmqclt to do the config bits.

--- Additional comment from John Eckersberg on 2014-06-03 10:49:46 EDT ---

Here's the behavior I see, where a restart fixes it.  I will investigate the cause next, but wanted to post the symptom here first:


[root@control yum.repos.d]# service rabbitmq-server start
Redirecting to /bin/systemctl start  rabbitmq-server.service
[root@control yum.repos.d]# service rabbitmq-server status
Redirecting to /bin/systemctl status  rabbitmq-server.service
rabbitmq-server.service - RabbitMQ broker
   Loaded: loaded (/usr/lib/systemd/system/rabbitmq-server.service; disabled)
   Active: active (running) since Mon 2014-04-21 12:11:32 EDT; 2s ago
  Process: 19454 ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmqctl wait /var/run/rabbitmq/pid (code=exited, status=0/SUCCESS)
  Process: 19428 ExecStartPre=/bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl status > /dev/null 2>&1 (code=exited, status=2)
 Main PID: 19453 (beam)
   CGroup: /system.slice/rabbitmq-server.service
           ├─19453 /usr/lib64/erlang/erts-5.10.4/bin/beam -W w -K true -A30 -P 1048576 -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.1.5/sbin...
           ├─19547 inet_gethost 4
           └─19548 inet_gethost 4

Apr 21 12:11:30 control.example.org rabbitmqctl[19454]: pid is 19453 ...
Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: RabbitMQ 3.1.5. Copyright (C) 2007-2013 GoPivotal, Inc.
Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: ##  ##
Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: ##########  Logs: /var/log/rabbitmq/rabbit
Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: ######  ##        /var/log/rabbitmq/rabbit
Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: ##########
Apr 21 12:11:32 control.example.org rabbitmq-server[19453]: Starting broker... completed with 0 plugins.
Apr 21 12:11:32 control.example.org rabbitmqctl[19454]: ...done.
Apr 21 12:11:32 control.example.org systemd[1]: Started RabbitMQ broker.
[root@control yum.repos.d]# rabbitmqctl status
Status of node rabbit@control ...
Error: unable to connect to node rabbit@control: nodedown

DIAGNOSTICS
===========

nodes in question: [rabbit@control]

hosts, their running nodes and ports:
- control: [{rabbitmqctl19571,56593}]

current node details:
- node name: rabbitmqctl19571@control
- home dir: /var/lib/rabbitmq
- cookie hash: vDcXSX+o4/EfW12EcE8zWA==

[root@control yum.repos.d]# service rabbitmq-server restart
Redirecting to /bin/systemctl restart  rabbitmq-server.service
[root@control yum.repos.d]# rabbitmqctl status
Status of node rabbit@control ...
[{pid,19666},
 {running_applications,[{rabbit,"RabbitMQ","3.1.5"},
                        {os_mon,"CPO  CXC 138 46","2.2.14"},
                        {xmerl,"XML parser","1.3.5"},
                        {mnesia,"MNESIA  CXC 138 12","4.11"},
                        {sasl,"SASL  CXC 138 11","2.3.4"},
                        {stdlib,"ERTS  CXC 138 10","1.19.4"},
                        {kernel,"ERTS  CXC 138 10","2.16.4"}]},
 {os,{unix,linux}},
 {erlang_version,"Erlang R16B03 (erts-5.10.4) [source] [64-bit] [async-threads:30] [hipe] [kernel-poll:true]\n"},
 {memory,[{total,34667592},
          {connection_procs,2720},
          {queue_procs,5440},
          {plugins,0},
          {other_proc,13357816},
          {mnesia,57376},
          {mgmt_db,0},
          {msg_index,21512},
          {other_ets,727296},
          {binary,7512},
          {code,16434406},
          {atom,594537},
          {other_system,3458977}]},
 {vm_memory_high_watermark,0.4},
 {vm_memory_limit,772199219},
 {disk_free_limit,1000000000},
 {disk_free,8061767680},
 {file_descriptors,[{total_limit,924},
                    {total_used,3},
                    {sockets_limit,829},
                    {sockets_used,1}]},
 {processes,[{limit,1048576},{used,122}]},
 {run_queue,0},
 {uptime,4}]
...done.

--- Additional comment from John Eckersberg on 2014-06-03 12:40:23 EDT ---

I've verified this happens on F20 as well, so it's not something related to the way we've packaged rabbit/erlang for osp5

--- Additional comment from John Eckersberg on 2014-06-03 16:54:14 EDT ---

Ok, I think I've got this figured out.  I'm going to file a more detailed bug against rabbitmq-server in F20, but the general problem is this.

- rabbitmqctl queries epmd (erlang port mapper daemon) to determine what port the rabbitmq server is listening on

- epmd is started by the erlang runtime when the rabbitmq-server service starts

- because of the specific way that rabbitmq-server fork/exec()s and the way the systemd unit file is configured, systemd sends SIGKILL to epmd shortly after it is spawned.

- Thus rabbitmqctl cannot contact epmd, and reports (somewhat incorrectly) that the server node is down.  Really the port mapper is down.

Until this is corrected, you can ensure that the epmd process is running by running `rabbitmqctl status` before starting the rabbitmq-server service.

--- Additional comment from Fabio Massimo Di Nitto on 2014-06-04 00:28:35 EDT ---

(In reply to John Eckersberg from comment #6)
> Ok, I think I've got this figured out.  I'm going to file a more detailed
> bug against rabbitmq-server in F20, but the general problem is this.
> 
> - rabbitmqctl queries epmd (erlang port mapper daemon) to determine what
> port the rabbitmq server is listening on
> 
> - epmd is started by the erlang runtime when the rabbitmq-server service
> starts
> 
> - because of the specific way that rabbitmq-server fork/exec()s and the way
> the systemd unit file is configured, systemd sends SIGKILL to epmd shortly
> after it is spawned.
> 
> - Thus rabbitmqctl cannot contact epmd, and reports (somewhat incorrectly)
> that the server node is down.  Really the port mapper is down.
> 
> Until this is corrected, you can ensure that the epmd process is running by
> running `rabbitmqctl status` before starting the rabbitmq-server service.

I can't execute that automatically in a cluster environment unfortunately. We need it fixed quickly. Workarounds are not an option :(

--- Additional comment from Alan Pevec on 2014-06-04 06:52:42 EDT ---

(In reply to John Eckersberg from comment #6)
> Until this is corrected, you can ensure that the epmd process is running by
> running `rabbitmqctl status` before starting the rabbitmq-server service.

That should be executed in ExecStartPre (your fix for rhbz 1059913)

http://pkgs.fedoraproject.org/cgit/rabbitmq-server.git/tree/rabbitmq-server.service#n13

ExecStartPre=-/bin/sh -c "/usr/lib/rabbitmq/bin/rabbitmqctl status > /dev/null 2>&1"

--- Additional comment from John Eckersberg on 2014-06-04 09:12:10 EDT ---

(In reply to Alan Pevec from comment #8)
> (In reply to John Eckersberg from comment #6)
> > Until this is corrected, you can ensure that the epmd process is running by
> > running `rabbitmqctl status` before starting the rabbitmq-server service.
> 
> That should be executed in ExecStartPre (your fix for rhbz 1059913)
> 
> http://pkgs.fedoraproject.org/cgit/rabbitmq-server.git/tree/rabbitmq-server.
> service#n13
> 
> ExecStartPre=-/bin/sh -c "/usr/lib/rabbitmq/bin/rabbitmqctl status >
> /dev/null 2>&1"

That doesn't help in this case, though.  Because that runs under the context of systemd, the epmd process that gets spawned by that command is also sent SIGKILL.  You need to get the daemon started outside of systemd, e.g. by running rabbitmqctl on the CLI beforehand.

In my example above where restarting it "fixes it", what actually happens is:

1. rabbitmq-server starts, registers to epmd, epmd is killed by systemd
2. Running `rabbitmqctl status` starts a new epmd process outside of systemd which survives
3. Restarting rabbitmq-server causes it to register with the epmd spawned by rabbitmqctl in step #2.  Now everything works.

Comment 1 John Eckersberg 2014-06-04 20:42:10 UTC

Some more details on this.

I used systemtap to prove that systemd is explicitly killing the epmd process:

Wed Jun  4 20:18:09 2014 : sh (16202) is exec'ing "/usr/lib64/erlang/erts-5.10.4/bin/epmd"
Wed Jun  4 20:18:09 2014 : epmd (16202) created 16204
Wed Jun  4 20:18:09 2014 : epmd (16202) is exiting
Wed Jun  4 20:18:09 2014 : epmd (16204) created 16205
Wed Jun  4 20:18:09 2014 : epmd (16204) is exiting
Wed Jun  4 20:18:09 2014 : sh (16203) is exec'ing "/usr/lib64/erlang/erts-5.10.4/bin/epmd"
Wed Jun  4 20:18:09 2014 : epmd (16203) created 16206
Wed Jun  4 20:18:09 2014 : epmd (16203) is exiting
Wed Jun  4 20:18:09 2014 : epmd (16206) created 16209
Wed Jun  4 20:18:09 2014 : epmd (16206) is exiting
Wed Jun  4 20:18:09 2014 : epmd (16209) is exiting
Wed Jun  4 20:18:09 2014 : sh (16247) is exec'ing "/usr/lib64/erlang/erts-5.10.4/bin/epmd"
Wed Jun  4 20:18:09 2014 : epmd (16247) created 16248
Wed Jun  4 20:18:09 2014 : epmd (16247) is exiting
Wed Jun  4 20:18:09 2014 : epmd (16248) created 16250
Wed Jun  4 20:18:09 2014 : epmd (16248) is exiting
Wed Jun  4 20:18:09 2014 : epmd (16250) is exiting
SIGKILL was sent to epmd (pid:16205) by systemd (pid:1) uid:0
Wed Jun  4 20:18:12 2014 : epmd (16205) is exiting


I think this is resulting from a combination of:

- rabbitmq-server is of Type=simple in the systemd unit file
- It double forks off epmd, so it's no longer parented to the main process
- The unit file has an ExecStartPost line

Systemd does not allow long-running processes to be started from ExecStartPre/ExecStartPost so it will purposefully try to kill off anything that is hanging around.  It's kinda hard to see in the above log, but the epmd daemon gets started by rabbitmq-server[1], then the ExecStartPost runs to wait on the pidfile, and then finally systemd sends SIGKILL to the epmd process.  I believe systemd thinks the epmd process is from the ExecStartPost.  To further this theory, if I comment out the ExecStartPost line, systemd does *not* send SIGKILL and everything works as expected.

Fortunately, I think the correct fix is to make sure bug 1103524 gets fixed as soon as possible.  If the service type is changed to notify, then both the ExecStartPre and ExecStartPost lines can be removed, thus avoiding the undesirable kill behavior.

[1] I've explicitly commented out the ExecStartPre cookie race hack in my testing

Comment 2 Peter Lemenkov 2014-06-11 07:44:56 UTC

Hello All.
A small update on this - both F-20/Rawhide and EPEL7 Erlang builds are now containing epmd.socket file for socket-activated epmd. It's still not enabled by default though (just because I didn't dig into details regarding systemd presets).

I believe we should add the following lines to rabbitmq-server.service

After=epmd.socket
Requires=epmd.socket


Also I believe we can drop ExecStartPre now.

Comment 3 John Eckersberg 2014-06-13 21:30:19 UTC

(In reply to Peter Lemenkov from comment #2)
> Hello All.
> A small update on this - both F-20/Rawhide and EPEL7 Erlang builds are now
> containing epmd.socket file for socket-activated epmd. It's still not
> enabled by default though (just because I didn't dig into details regarding
> systemd presets).
> 
> I believe we should add the following lines to rabbitmq-server.service
> 
> After=epmd.socket
> Requires=epmd.socket
> 
> 
> Also I believe we can drop ExecStartPre now.

This does work, after making some tweaks.

As is, rabbitmq-server fails to start because epmd is set to bind only to localhost, and rabbit is trying to connect on the "public" interface.  Localhost is explicitly set in epmd.socket:

ListenStream=127.0.0.1:4369

And fails with the error:

Jun 13 21:03:08 jeckersb-f20 systemd[1]: Starting RabbitMQ broker...
Jun 13 21:03:09 jeckersb-f20 rabbitmqctl[1186]: Waiting for 'rabbit@jeckersb-f20' ...
Jun 13 21:03:09 jeckersb-f20 rabbitmqctl[1186]: pid is 1185 ...
Jun 13 21:03:09 jeckersb-f20 rabbitmq-server[1185]: ERROR: epmd error for host "jeckersb-f20": address (cannot connect to host/port)
Jun 13 21:03:09 jeckersb-f20 systemd[1]: rabbitmq-server.service: main process exited, code=exited, status=1/FAILURE
Jun 13 21:03:09 jeckersb-f20 rabbitmqctl[1186]: Error: process_not_running
Jun 13 21:03:09 jeckersb-f20 systemd[1]: rabbitmq-server.service: control process exited, code=exited status=2
Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: Stopping and halting node 'rabbit@jeckersb-f20' ...
Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: Error: unable to connect to node 'rabbit@jeckersb-f20': nodedown
Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: DIAGNOSTICS
Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: ===========
Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: nodes in question: ['rabbit@jeckersb-f20']
Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: hosts, their running nodes and ports:
Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: - unable to connect to epmd on jeckersb-f20: address (cannot connect to host/port)
Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: current node details:
Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: - node name: 'rabbitmqctl1253@jeckersb-f20'
Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: - home dir: /var/lib/rabbitmq
Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: - cookie hash: pjjhwhUNJ+O/cAmSUbP89w==
Jun 13 21:03:10 jeckersb-f20 systemd[1]: rabbitmq-server.service: control process exited, code=exited status=2
Jun 13 21:03:10 jeckersb-f20 systemd[1]: Failed to start RabbitMQ broker.
Jun 13 21:03:10 jeckersb-f20 systemd[1]: Unit rabbitmq-server.service entered failed state.


I tried changing this to be:

ListenStream=4369

Since that sounds like a sane dual-stack option, from systemd.socket(5):

"If the address string is a single number, it is read as port number to listen on via IPv6. Depending on the value of BindIPv6Only= (see below) this might result in the service being available via both IPv6 and IPv4 (default) or just via IPv6."

However this causes rabbitmq-server to crash like so:

Jun 13 21:10:20 jeckersb-f20 systemd[1]: Starting RabbitMQ broker...
Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},"Protocol: ~tp: register/listen error: ~tp~n",["inet_tcp",epmd_close]}
Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.21.0>},{registered_name,[]},{error_info,{exit,{error,b
Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,un
Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,{shutdown,{failed_to_start_child,net_ker
Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},crash_report,[[{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{pid,<0.9.0>
Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},std_info,[{application,kernel},{exited,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{
Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},"Protocol: ~tp: register/listen error: ~tp~n",["inet_tcp",epmd_close]}
Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.21.0>},{registered_name,[]},{error_info,{exit,{err
Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pi
Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,{shutdown,{failed_to_start_child,net
Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},crash_report,[[{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{pid,<0.
Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},std_info,[{application,kernel},{exited,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kern
Jun 13 21:10:22 jeckersb-f20 rabbitmqctl[1587]: {"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_ker
Jun 13 21:10:22 jeckersb-f20 rabbitmq-server[1586]: {"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net
Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1587]: Crash dump was written to: erl_crash.dump
Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1587]: Kernel pid terminated (application_controller) ({application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kern
Jun 13 21:10:23 jeckersb-f20 systemd[1]: rabbitmq-server.service: control process exited, code=exited status=1
Jun 13 21:10:23 jeckersb-f20 rabbitmq-server[1586]: Crash dump was written to: erl_crash.dump
Jun 13 21:10:23 jeckersb-f20 rabbitmq-server[1586]: Kernel pid terminated (application_controller) ({application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_
Jun 13 21:10:23 jeckersb-f20 systemd[1]: rabbitmq-server.service: main process exited, code=exited, status=1/FAILURE
Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},"Protocol: ~tp: register/listen error: ~tp~n",["inet_tcp",epmd_close]}
Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.21.0>},{registered_name,[]},{error_info,{exit,{error,b
Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,un
Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,{shutdown,{failed_to_start_child,net_ker
Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},crash_report,[[{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{pid,<0.9.0>
Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},std_info,[{application,kernel},{exited,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{
Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_ker
Jun 13 21:10:24 jeckersb-f20 rabbitmqctl[1658]: Crash dump was written to: erl_crash.dump
Jun 13 21:10:24 jeckersb-f20 rabbitmqctl[1658]: Kernel pid terminated (application_controller) ({application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kern
Jun 13 21:10:24 jeckersb-f20 systemd[1]: rabbitmq-server.service: control process exited, code=exited status=1
Jun 13 21:10:24 jeckersb-f20 systemd[1]: Failed to start RabbitMQ broker.
Jun 13 21:10:24 jeckersb-f20 systemd[1]: Unit rabbitmq-server.service entered failed state.

Finally, I forced it to listen IPv4 only, on all interfaces with:

ListenStream=0.0.0.0:4369

And this works as expected.  So I agree that we should use the systemd-managed epmd instance with all of its socket activation goodness, but only after the fix above.

Comment 4 John Eckersberg 2014-06-18 12:46:25 UTC

One other note on the standalone epmd.  I used the templated socket by adding:

Requires: epmd.0.0.socket

to the rabbitmq-server unit file, which works perfectly :)

Comment 5 Peter Lemenkov 2014-11-27 15:28:40 UTC

(In reply to John Eckersberg from comment #4)
> One other note on the standalone epmd.  I used the templated socket by
> adding:
> 
> Requires: epmd.0.0.socket
> 
> to the rabbitmq-server unit file, which works perfectly :)

I think we should finally add this to the service-file. John, are you aware of any issues with this additional Reqires so far?

My only concern is that if the administrator stops rabbitmq-server.service then epmd.0.0.socket still remains active. So this socket (tcp:0.0.0.0:4369) will remain opened.

I suspect we could spice it up with some systemd magic (StopWhenUnneeded +
BindsTo + RefuseManualStart) but so far I cant figure out how to do it properly. See my question in devel@ list:

* http://thread.gmane.org/gmane.linux.redhat.fedora.devel/201512

I suppose this issue is described here:

* https://bugzilla.redhat.com/1104199

Citing my own private email:

====================
Although this might be related to pre-systemd version of RabbitMQ it
raises a valid question. It seems that people don't like the idea of
epmd remaining online after the RabbitMQ shutdown. Frankly speaking it
isn't a bug - what to do with another systemd-service started as a
dependency is up to systemd administrator. However I believe this
should be adjusted by adding some combination of StopWhenUnneeded +
BindsTo + RefuseManualStart to epmd@.socket / epmd@.service. I've
tried quickly but failed to implement a required functionality.

So far I have the following considerations:

* epmd@.service must refuse manual activation. It can be activated
only by the corresponding socket unit. So maybe we have to add
RefuseManualStart to the service. Quite the contrary the socket unit
counterpart can be manually activated.

* Every Erlang service (RabbitMQ) must require socket-file. Something
like Requires=epmd.0.0.socket. Otherwise they will try to start
their own epmd instance.

* epmd@.service and epmd@.socket must be deactivated right after the
dependent services stopped.

As I said earlier I failed to glue all of these together and would
love to hear any comments / suggestions and maybe receive a bit of
systemd-related help :)
====================

Comment 6 John Eckersberg 2014-12-03 20:23:33 UTC

I think Lennart described in the fedora-devel thread why we can't really shut down epmd after it has been started via socket activation.

The problem here is that epmd, however it gets launched, is going to be shared among all of the erlang VMs running on a server.  So you have one of two cases:

(1) is what we have now, where each service as part of booting its VM will check for epmd, and start it if it is not present.  That's a problem though.  Consider you have two erlang services, A and B.  You start A, it starts epmd as part of its startup.  Now the epmd process is running under the cgroup of service A.  Then you start service B, it detects epmd is already running and starts using it.  Then service A is stopped and all of the processes under the cgroup, including epmd.  Now process B doesn't have an epmd to talk to, and you will have to restart it in order for a new one to get spawned.

(2) is this proposed case, where epmd is its own standalone service, running in its own cgroup, under socket activation.  This avoids the mess from (1), with the only downside that I see is that it does not automatically stop (it can't).

IMO, the downside to (1) is much worse than the downside to (2).  If the user/administrator does not want epmd to be running after rabbitmq-server is stopped, I don't think it's unreasonable to require it to be explicitly stopped.

Comment 7 Jaroslav Reznik 2015-03-03 15:52:13 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 22 development cycle.
Changing version to '22'.

More information and reason for this action is here:
https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora22

Comment 8 Peter Lemenkov 2015-10-21 15:07:53 UTC

Ok, adjusting title of the ticket according to the remainign issue and adding a dependent RHOS-related ticket.