Bug 1104193
Summary: | rabbitmqctl doesn't work | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Fabio Massimo Di Nitto <fdinitto> | |
Component: | rabbitmq-server | Assignee: | John Eckersberg <jeckersb> | |
Status: | CLOSED ERRATA | QA Contact: | Ami Jeain <ajeain> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 5.0 (RHEL 7) | CC: | acathrow, apevec, jeckersb, lhh, yeylon | |
Target Milestone: | rc | |||
Target Release: | 5.0 (RHEL 7) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | erlang-R16B-03.7min.1.el7ost rabbitmq-server-3.1.5-6.3.el7ost | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1104843 (view as bug list) | Environment: | ||
Last Closed: | 2014-07-08 15:13:51 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1104843, 1083890 |
Description
Fabio Massimo Di Nitto
2014-06-03 13:22:48 UTC
Try restarting rabbitmq-server and then run rabbitmqctl again. I've seen this happen sometimes where for some reason rabbitmqctl cannot contact the node the very first time it is started, but restarting it fixes it. I need to debug further to see why this happens. (In reply to John Eckersberg from comment #2) > Try restarting rabbitmq-server and then run rabbitmqctl again. I've seen > this happen sometimes where for some reason rabbitmqctl cannot contact the > node the very first time it is started, but restarting it fixes it. I need > to debug further to see why this happens. I already tested several restarts without success and I can't use a workaround on a cluster where services are restarted as necessary. I need a fix or i can't configure HA queues in rabbit as all how-to I found requires rabbitmqclt to do the config bits. Here's the behavior I see, where a restart fixes it. I will investigate the cause next, but wanted to post the symptom here first: [root@control yum.repos.d]# service rabbitmq-server start Redirecting to /bin/systemctl start rabbitmq-server.service [root@control yum.repos.d]# service rabbitmq-server status Redirecting to /bin/systemctl status rabbitmq-server.service rabbitmq-server.service - RabbitMQ broker Loaded: loaded (/usr/lib/systemd/system/rabbitmq-server.service; disabled) Active: active (running) since Mon 2014-04-21 12:11:32 EDT; 2s ago Process: 19454 ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmqctl wait /var/run/rabbitmq/pid (code=exited, status=0/SUCCESS) Process: 19428 ExecStartPre=/bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl status > /dev/null 2>&1 (code=exited, status=2) Main PID: 19453 (beam) CGroup: /system.slice/rabbitmq-server.service ├─19453 /usr/lib64/erlang/erts-5.10.4/bin/beam -W w -K true -A30 -P 1048576 -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.1.5/sbin... ├─19547 inet_gethost 4 └─19548 inet_gethost 4 Apr 21 12:11:30 control.example.org rabbitmqctl[19454]: pid is 19453 ... Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: RabbitMQ 3.1.5. Copyright (C) 2007-2013 GoPivotal, Inc. Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: ## ## Licensed under the MPL. See http://www.rabbitmq.com/ Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: ## ## Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: ########## Logs: /var/log/rabbitmq/rabbit Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: ###### ## /var/log/rabbitmq/rabbit Apr 21 12:11:30 control.example.org rabbitmq-server[19453]: ########## Apr 21 12:11:32 control.example.org rabbitmq-server[19453]: Starting broker... completed with 0 plugins. Apr 21 12:11:32 control.example.org rabbitmqctl[19454]: ...done. Apr 21 12:11:32 control.example.org systemd[1]: Started RabbitMQ broker. [root@control yum.repos.d]# rabbitmqctl status Status of node rabbit@control ... Error: unable to connect to node rabbit@control: nodedown DIAGNOSTICS =========== nodes in question: [rabbit@control] hosts, their running nodes and ports: - control: [{rabbitmqctl19571,56593}] current node details: - node name: rabbitmqctl19571@control - home dir: /var/lib/rabbitmq - cookie hash: vDcXSX+o4/EfW12EcE8zWA== [root@control yum.repos.d]# service rabbitmq-server restart Redirecting to /bin/systemctl restart rabbitmq-server.service [root@control yum.repos.d]# rabbitmqctl status Status of node rabbit@control ... [{pid,19666}, {running_applications,[{rabbit,"RabbitMQ","3.1.5"}, {os_mon,"CPO CXC 138 46","2.2.14"}, {xmerl,"XML parser","1.3.5"}, {mnesia,"MNESIA CXC 138 12","4.11"}, {sasl,"SASL CXC 138 11","2.3.4"}, {stdlib,"ERTS CXC 138 10","1.19.4"}, {kernel,"ERTS CXC 138 10","2.16.4"}]}, {os,{unix,linux}}, {erlang_version,"Erlang R16B03 (erts-5.10.4) [source] [64-bit] [async-threads:30] [hipe] [kernel-poll:true]\n"}, {memory,[{total,34667592}, {connection_procs,2720}, {queue_procs,5440}, {plugins,0}, {other_proc,13357816}, {mnesia,57376}, {mgmt_db,0}, {msg_index,21512}, {other_ets,727296}, {binary,7512}, {code,16434406}, {atom,594537}, {other_system,3458977}]}, {vm_memory_high_watermark,0.4}, {vm_memory_limit,772199219}, {disk_free_limit,1000000000}, {disk_free,8061767680}, {file_descriptors,[{total_limit,924}, {total_used,3}, {sockets_limit,829}, {sockets_used,1}]}, {processes,[{limit,1048576},{used,122}]}, {run_queue,0}, {uptime,4}] ...done. I've verified this happens on F20 as well, so it's not something related to the way we've packaged rabbit/erlang for osp5 Ok, I think I've got this figured out. I'm going to file a more detailed bug against rabbitmq-server in F20, but the general problem is this. - rabbitmqctl queries epmd (erlang port mapper daemon) to determine what port the rabbitmq server is listening on - epmd is started by the erlang runtime when the rabbitmq-server service starts - because of the specific way that rabbitmq-server fork/exec()s and the way the systemd unit file is configured, systemd sends SIGKILL to epmd shortly after it is spawned. - Thus rabbitmqctl cannot contact epmd, and reports (somewhat incorrectly) that the server node is down. Really the port mapper is down. Until this is corrected, you can ensure that the epmd process is running by running `rabbitmqctl status` before starting the rabbitmq-server service. (In reply to John Eckersberg from comment #6) > Ok, I think I've got this figured out. I'm going to file a more detailed > bug against rabbitmq-server in F20, but the general problem is this. > > - rabbitmqctl queries epmd (erlang port mapper daemon) to determine what > port the rabbitmq server is listening on > > - epmd is started by the erlang runtime when the rabbitmq-server service > starts > > - because of the specific way that rabbitmq-server fork/exec()s and the way > the systemd unit file is configured, systemd sends SIGKILL to epmd shortly > after it is spawned. > > - Thus rabbitmqctl cannot contact epmd, and reports (somewhat incorrectly) > that the server node is down. Really the port mapper is down. > > Until this is corrected, you can ensure that the epmd process is running by > running `rabbitmqctl status` before starting the rabbitmq-server service. I can't execute that automatically in a cluster environment unfortunately. We need it fixed quickly. Workarounds are not an option :( (In reply to John Eckersberg from comment #6) > Until this is corrected, you can ensure that the epmd process is running by > running `rabbitmqctl status` before starting the rabbitmq-server service. That should be executed in ExecStartPre (your fix for rhbz 1059913) http://pkgs.fedoraproject.org/cgit/rabbitmq-server.git/tree/rabbitmq-server.service#n13 ExecStartPre=-/bin/sh -c "/usr/lib/rabbitmq/bin/rabbitmqctl status > /dev/null 2>&1" (In reply to Alan Pevec from comment #8) > (In reply to John Eckersberg from comment #6) > > Until this is corrected, you can ensure that the epmd process is running by > > running `rabbitmqctl status` before starting the rabbitmq-server service. > > That should be executed in ExecStartPre (your fix for rhbz 1059913) > > http://pkgs.fedoraproject.org/cgit/rabbitmq-server.git/tree/rabbitmq-server. > service#n13 > > ExecStartPre=-/bin/sh -c "/usr/lib/rabbitmq/bin/rabbitmqctl status > > /dev/null 2>&1" That doesn't help in this case, though. Because that runs under the context of systemd, the epmd process that gets spawned by that command is also sent SIGKILL. You need to get the daemon started outside of systemd, e.g. by running rabbitmqctl on the CLI beforehand. In my example above where restarting it "fixes it", what actually happens is: 1. rabbitmq-server starts, registers to epmd, epmd is killed by systemd 2. Running `rabbitmqctl status` starts a new epmd process outside of systemd which survives 3. Restarting rabbitmq-server causes it to register with the epmd spawned by rabbitmqctl in step #2. Now everything works. Here's what my plan is for RHOS5: - Patch the rabbitmq-server package to remove the ExecStartPre and ExecStartPost lines from the systemd unit file. This will fix the rabbitmqctl problem, but introduce a minor race condition where systemd will consider the service started before it is entirely up (since the waitpid from ExecStartPost is being removed). This is an acceptable short-term tradeoff. If an openstack service fails to connect due to the race, it will sleep and retry again when the service is up. - Bug 1103524 plans to switch to using the notify service type for rabbitmq. It depends on package review bug 1104604 for erlang-sd_notify, which I am going to take after adding the workaround above. - Once we have erland-sd_notify available, we can update rabbitmq in Fedora to use it and switch the service type over to notify. This will correct the problem properly. - We'll grab the corrected Fedora packages and rebuild for RHOS5. Ok how about you ship a shell script, initscript style that does everything that is necessary to start rabbitmq-server properly and link that into the systemd unit file? Re-introducing a race is risky since pacemaker will rely on the service being up before starting other openstack services. openstack services keep try reconnecting to rabbitmq until connect succeeds so this race shouldn't be critical, it might just add some noise in logs (In reply to Alan Pevec from comment #12) > openstack services keep try reconnecting to rabbitmq until connect succeeds > so this race shouldn't be critical, it might just add some noise in logs I am more worried about pacemaker: start -> monitor -> race? fault? -> restart -> loop. I can add a monitor delay after a start as i am doing now for some services, but it´s not optimal. I'll try doing it with the shell script approach. That will require switching the service type from "simple" to "forking" which may cause more problems. We'll see. Also a datapoint, I tested to see about how long the race is with the other method, and it is consistently about 0.9 seconds on my VM: [root@control ~]# service rabbitmq-server start; while ! echo aaaaaaaaaaa | socat - tcp:localhost:amqp; do sleep 0.1; done Redirecting to /bin/systemctl start rabbitmq-server.service 2014/04/21 13:02:00 socat[21416] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused 2014/04/21 13:02:00 socat[21443] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused 2014/04/21 13:02:00 socat[21484] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused 2014/04/21 13:02:00 socat[21487] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused 2014/04/21 13:02:00 socat[21490] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused 2014/04/21 13:02:00 socat[21493] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused 2014/04/21 13:02:00 socat[21496] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused 2014/04/21 13:02:00 socat[21499] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused 2014/04/21 13:02:01 socat[21502] E connect(3, AF=2 127.0.0.1:5672, 16): Connection refused AMQP [root@control ~]# Ok, I've implemented the forking workaround, and verified it works for me: [root@control ~]# rpm -q rabbitmq-server rabbitmq-server-3.1.5-6.2.el7ost.noarch [root@control ~]# service rabbitmq-server start && rabbitmqctl status Redirecting to /bin/systemctl start rabbitmq-server.service Status of node rabbit@control ... [{pid,25908}, {running_applications,[{rabbit,"RabbitMQ","3.1.5"}, {os_mon,"CPO CXC 138 46","2.2.14"}, {xmerl,"XML parser","1.3.5"}, {mnesia,"MNESIA CXC 138 12","4.11"}, {sasl,"SASL CXC 138 11","2.3.4"}, {stdlib,"ERTS CXC 138 10","1.19.4"}, {kernel,"ERTS CXC 138 10","2.16.4"}]}, {os,{unix,linux}}, {erlang_version,"Erlang R16B03 (erts-5.10.4) [source] [64-bit] [async-threads:30] [hipe] [kernel-poll:true]\n"}, {memory,[{total,35430384}, {connection_procs,5728}, {queue_procs,5440}, {plugins,0}, {other_proc,13412048}, {mnesia,57376}, {mgmt_db,0}, {msg_index,70424}, {other_ets,721856}, {binary,669856}, {code,16434406}, {atom,594537}, {other_system,3458713}]}, {vm_memory_high_watermark,0.4}, {vm_memory_limit,772199219}, {disk_free_limit,1000000000}, {disk_free,8052510720}, {file_descriptors,[{total_limit,924}, {total_used,3}, {sockets_limit,829}, {sockets_used,1}]}, {processes,[{limit,1048576},{used,123}]}, {run_queue,0}, {uptime,1}] ...done. [root@control ~]# Moving this back to ON_DEV since the built package is only a workaround, and there is more dev work to be done in order to have a proper package. New plan of action: - Rebase the RHOS5 erlang spec against the latest in Fedora. The Fedora one has some nifty systemd bits for socket activation of the epmd service (standalone from the one that rabbitmq-server tries to launch). - Update the rabbitmq-server unit to require epmd.0.0.socket. This avoids an issue where epmd isn't listening on the correct interface (see https://bugzilla.redhat.com/show_bug.cgi?id=1104843#c3) # rabbitmqctl status Status of node rabbit@cougar13 ... [{pid,16121}, {running_applications,[{rabbit,"RabbitMQ","3.1.5"}, {mnesia,"MNESIA CXC 138 12","4.11"}, {os_mon,"CPO CXC 138 46","2.2.14"}, {xmerl,"XML parser","1.3.5"}, {sasl,"SASL CXC 138 11","2.3.4"}, {stdlib,"ERTS CXC 138 10","1.19.4"}, {kernel,"ERTS CXC 138 10","2.16.4"}]}, {os,{unix,linux}}, {erlang_version,"Erlang R16B03 (erts-5.10.4) [source] [64-bit] [smp:6:6] [async-threads:30] [hipe] [kernel-poll:true]\n"}, {memory,[{total,85658480}, {connection_procs,3671616}, {queue_procs,1366024}, {plugins,0}, {other_proc,13698456}, {mnesia,309712}, {mgmt_db,0}, {msg_index,93048}, {other_ets,897576}, {binary,43758000}, {code,16428070}, {atom,594537}, {other_system,4841441}]}, {vm_memory_high_watermark,0.4}, {vm_memory_limit,13425097113}, {disk_free_limit,1000000000}, {disk_free,444528570368}, {file_descriptors,[{total_limit,924}, {total_used,79}, {sockets_limit,829}, {sockets_used,77}]}, {processes,[{limit,1048576},{used,897}]}, {run_queue,0}, {uptime,21449}] ...done. # rpm -qa |grep rabbit rabbitmq-server-3.1.5-6.3.el7ost.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0845.html |