Bug 1104843
Summary: | RabbitMQ server shouldn't leave epmd running after being stopped | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | John Eckersberg <jeckersb> |
Component: | rabbitmq-server | Assignee: | Peter Lemenkov <lemenkov> |
Status: | NEW --- | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | rawhide | CC: | apevec, erlang, fdinitto, hubert.plociniczak, jeckersb, lemenkov, lhh, plemenko, rjones, s |
Target Milestone: | --- | Keywords: | FutureFeature |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Enhancement | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1104193 | Environment: | |
Last Closed: | Type: | Bug | |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1104193 | ||
Bug Blocks: | 1104199 |
Description
John Eckersberg
2014-06-04 19:22:17 UTC
Some more details on this. I used systemtap to prove that systemd is explicitly killing the epmd process: Wed Jun 4 20:18:09 2014 : sh (16202) is exec'ing "/usr/lib64/erlang/erts-5.10.4/bin/epmd" Wed Jun 4 20:18:09 2014 : epmd (16202) created 16204 Wed Jun 4 20:18:09 2014 : epmd (16202) is exiting Wed Jun 4 20:18:09 2014 : epmd (16204) created 16205 Wed Jun 4 20:18:09 2014 : epmd (16204) is exiting Wed Jun 4 20:18:09 2014 : sh (16203) is exec'ing "/usr/lib64/erlang/erts-5.10.4/bin/epmd" Wed Jun 4 20:18:09 2014 : epmd (16203) created 16206 Wed Jun 4 20:18:09 2014 : epmd (16203) is exiting Wed Jun 4 20:18:09 2014 : epmd (16206) created 16209 Wed Jun 4 20:18:09 2014 : epmd (16206) is exiting Wed Jun 4 20:18:09 2014 : epmd (16209) is exiting Wed Jun 4 20:18:09 2014 : sh (16247) is exec'ing "/usr/lib64/erlang/erts-5.10.4/bin/epmd" Wed Jun 4 20:18:09 2014 : epmd (16247) created 16248 Wed Jun 4 20:18:09 2014 : epmd (16247) is exiting Wed Jun 4 20:18:09 2014 : epmd (16248) created 16250 Wed Jun 4 20:18:09 2014 : epmd (16248) is exiting Wed Jun 4 20:18:09 2014 : epmd (16250) is exiting SIGKILL was sent to epmd (pid:16205) by systemd (pid:1) uid:0 Wed Jun 4 20:18:12 2014 : epmd (16205) is exiting I think this is resulting from a combination of: - rabbitmq-server is of Type=simple in the systemd unit file - It double forks off epmd, so it's no longer parented to the main process - The unit file has an ExecStartPost line Systemd does not allow long-running processes to be started from ExecStartPre/ExecStartPost so it will purposefully try to kill off anything that is hanging around. It's kinda hard to see in the above log, but the epmd daemon gets started by rabbitmq-server[1], then the ExecStartPost runs to wait on the pidfile, and then finally systemd sends SIGKILL to the epmd process. I believe systemd thinks the epmd process is from the ExecStartPost. To further this theory, if I comment out the ExecStartPost line, systemd does *not* send SIGKILL and everything works as expected. Fortunately, I think the correct fix is to make sure bug 1103524 gets fixed as soon as possible. If the service type is changed to notify, then both the ExecStartPre and ExecStartPost lines can be removed, thus avoiding the undesirable kill behavior. [1] I've explicitly commented out the ExecStartPre cookie race hack in my testing Hello All. A small update on this - both F-20/Rawhide and EPEL7 Erlang builds are now containing epmd.socket file for socket-activated epmd. It's still not enabled by default though (just because I didn't dig into details regarding systemd presets). I believe we should add the following lines to rabbitmq-server.service After=epmd.socket Requires=epmd.socket Also I believe we can drop ExecStartPre now. (In reply to Peter Lemenkov from comment #2) > Hello All. > A small update on this - both F-20/Rawhide and EPEL7 Erlang builds are now > containing epmd.socket file for socket-activated epmd. It's still not > enabled by default though (just because I didn't dig into details regarding > systemd presets). > > I believe we should add the following lines to rabbitmq-server.service > > After=epmd.socket > Requires=epmd.socket > > > Also I believe we can drop ExecStartPre now. This does work, after making some tweaks. As is, rabbitmq-server fails to start because epmd is set to bind only to localhost, and rabbit is trying to connect on the "public" interface. Localhost is explicitly set in epmd.socket: ListenStream=127.0.0.1:4369 And fails with the error: Jun 13 21:03:08 jeckersb-f20 systemd[1]: Starting RabbitMQ broker... Jun 13 21:03:09 jeckersb-f20 rabbitmqctl[1186]: Waiting for 'rabbit@jeckersb-f20' ... Jun 13 21:03:09 jeckersb-f20 rabbitmqctl[1186]: pid is 1185 ... Jun 13 21:03:09 jeckersb-f20 rabbitmq-server[1185]: ERROR: epmd error for host "jeckersb-f20": address (cannot connect to host/port) Jun 13 21:03:09 jeckersb-f20 systemd[1]: rabbitmq-server.service: main process exited, code=exited, status=1/FAILURE Jun 13 21:03:09 jeckersb-f20 rabbitmqctl[1186]: Error: process_not_running Jun 13 21:03:09 jeckersb-f20 systemd[1]: rabbitmq-server.service: control process exited, code=exited status=2 Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: Stopping and halting node 'rabbit@jeckersb-f20' ... Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: Error: unable to connect to node 'rabbit@jeckersb-f20': nodedown Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: DIAGNOSTICS Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: =========== Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: nodes in question: ['rabbit@jeckersb-f20'] Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: hosts, their running nodes and ports: Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: - unable to connect to epmd on jeckersb-f20: address (cannot connect to host/port) Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: current node details: Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: - node name: 'rabbitmqctl1253@jeckersb-f20' Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: - home dir: /var/lib/rabbitmq Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: - cookie hash: pjjhwhUNJ+O/cAmSUbP89w== Jun 13 21:03:10 jeckersb-f20 systemd[1]: rabbitmq-server.service: control process exited, code=exited status=2 Jun 13 21:03:10 jeckersb-f20 systemd[1]: Failed to start RabbitMQ broker. Jun 13 21:03:10 jeckersb-f20 systemd[1]: Unit rabbitmq-server.service entered failed state. I tried changing this to be: ListenStream=4369 Since that sounds like a sane dual-stack option, from systemd.socket(5): "If the address string is a single number, it is read as port number to listen on via IPv6. Depending on the value of BindIPv6Only= (see below) this might result in the service being available via both IPv6 and IPv4 (default) or just via IPv6." However this causes rabbitmq-server to crash like so: Jun 13 21:10:20 jeckersb-f20 systemd[1]: Starting RabbitMQ broker... Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},"Protocol: ~tp: register/listen error: ~tp~n",["inet_tcp",epmd_close]} Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.21.0>},{registered_name,[]},{error_info,{exit,{error,b Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,un Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,{shutdown,{failed_to_start_child,net_ker Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},crash_report,[[{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{pid,<0.9.0> Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},std_info,[{application,kernel},{exited,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{ Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},"Protocol: ~tp: register/listen error: ~tp~n",["inet_tcp",epmd_close]} Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.21.0>},{registered_name,[]},{error_info,{exit,{err Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pi Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,{shutdown,{failed_to_start_child,net Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},crash_report,[[{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{pid,<0. Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},std_info,[{application,kernel},{exited,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kern Jun 13 21:10:22 jeckersb-f20 rabbitmqctl[1587]: {"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_ker Jun 13 21:10:22 jeckersb-f20 rabbitmq-server[1586]: {"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1587]: Crash dump was written to: erl_crash.dump Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1587]: Kernel pid terminated (application_controller) ({application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kern Jun 13 21:10:23 jeckersb-f20 systemd[1]: rabbitmq-server.service: control process exited, code=exited status=1 Jun 13 21:10:23 jeckersb-f20 rabbitmq-server[1586]: Crash dump was written to: erl_crash.dump Jun 13 21:10:23 jeckersb-f20 rabbitmq-server[1586]: Kernel pid terminated (application_controller) ({application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_ Jun 13 21:10:23 jeckersb-f20 systemd[1]: rabbitmq-server.service: main process exited, code=exited, status=1/FAILURE Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},"Protocol: ~tp: register/listen error: ~tp~n",["inet_tcp",epmd_close]} Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.21.0>},{registered_name,[]},{error_info,{exit,{error,b Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,un Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,{shutdown,{failed_to_start_child,net_ker Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},crash_report,[[{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{pid,<0.9.0> Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},std_info,[{application,kernel},{exited,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{ Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_ker Jun 13 21:10:24 jeckersb-f20 rabbitmqctl[1658]: Crash dump was written to: erl_crash.dump Jun 13 21:10:24 jeckersb-f20 rabbitmqctl[1658]: Kernel pid terminated (application_controller) ({application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kern Jun 13 21:10:24 jeckersb-f20 systemd[1]: rabbitmq-server.service: control process exited, code=exited status=1 Jun 13 21:10:24 jeckersb-f20 systemd[1]: Failed to start RabbitMQ broker. Jun 13 21:10:24 jeckersb-f20 systemd[1]: Unit rabbitmq-server.service entered failed state. Finally, I forced it to listen IPv4 only, on all interfaces with: ListenStream=0.0.0.0:4369 And this works as expected. So I agree that we should use the systemd-managed epmd instance with all of its socket activation goodness, but only after the fix above. One other note on the standalone epmd. I used the templated socket by adding: Requires: epmd.0.0.socket to the rabbitmq-server unit file, which works perfectly :) (In reply to John Eckersberg from comment #4) > One other note on the standalone epmd. I used the templated socket by > adding: > > Requires: epmd.0.0.socket > > to the rabbitmq-server unit file, which works perfectly :) I think we should finally add this to the service-file. John, are you aware of any issues with this additional Reqires so far? My only concern is that if the administrator stops rabbitmq-server.service then epmd.0.0.socket still remains active. So this socket (tcp:0.0.0.0:4369) will remain opened. I suspect we could spice it up with some systemd magic (StopWhenUnneeded + BindsTo + RefuseManualStart) but so far I cant figure out how to do it properly. See my question in devel@ list: * http://thread.gmane.org/gmane.linux.redhat.fedora.devel/201512 I suppose this issue is described here: * https://bugzilla.redhat.com/1104199 Citing my own private email: ==================== Although this might be related to pre-systemd version of RabbitMQ it raises a valid question. It seems that people don't like the idea of epmd remaining online after the RabbitMQ shutdown. Frankly speaking it isn't a bug - what to do with another systemd-service started as a dependency is up to systemd administrator. However I believe this should be adjusted by adding some combination of StopWhenUnneeded + BindsTo + RefuseManualStart to epmd@.socket / epmd@.service. I've tried quickly but failed to implement a required functionality. So far I have the following considerations: * epmd@.service must refuse manual activation. It can be activated only by the corresponding socket unit. So maybe we have to add RefuseManualStart to the service. Quite the contrary the socket unit counterpart can be manually activated. * Every Erlang service (RabbitMQ) must require socket-file. Something like Requires=epmd.0.0.socket. Otherwise they will try to start their own epmd instance. * epmd@.service and epmd@.socket must be deactivated right after the dependent services stopped. As I said earlier I failed to glue all of these together and would love to hear any comments / suggestions and maybe receive a bit of systemd-related help :) ==================== I think Lennart described in the fedora-devel thread why we can't really shut down epmd after it has been started via socket activation. The problem here is that epmd, however it gets launched, is going to be shared among all of the erlang VMs running on a server. So you have one of two cases: (1) is what we have now, where each service as part of booting its VM will check for epmd, and start it if it is not present. That's a problem though. Consider you have two erlang services, A and B. You start A, it starts epmd as part of its startup. Now the epmd process is running under the cgroup of service A. Then you start service B, it detects epmd is already running and starts using it. Then service A is stopped and all of the processes under the cgroup, including epmd. Now process B doesn't have an epmd to talk to, and you will have to restart it in order for a new one to get spawned. (2) is this proposed case, where epmd is its own standalone service, running in its own cgroup, under socket activation. This avoids the mess from (1), with the only downside that I see is that it does not automatically stop (it can't). IMO, the downside to (1) is much worse than the downside to (2). If the user/administrator does not want epmd to be running after rabbitmq-server is stopped, I don't think it's unreasonable to require it to be explicitly stopped. This bug appears to have been reported against 'rawhide' during the Fedora 22 development cycle. Changing version to '22'. More information and reason for this action is here: https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora22 Ok, adjusting title of the ticket according to the remainign issue and adding a dependent RHOS-related ticket. |