Bug 1264901 - rabbitmq-server fails to start on boot with systemd notify error
rabbitmq-server fails to start on boot with systemd notify error
Status: CLOSED DUPLICATE of bug 1348700
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director (Show other bugs)
7.0 (Kilo)
x86_64 Linux
medium Severity high
: ---
: 10.0 (Newton)
Assigned To: Dmitry Tantsur
Shai Revivo
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-09-21 08:28 EDT by Andrew Blum
Modified: 2016-08-30 08:58 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-08-30 08:58:56 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
sosreport from RHEL-OSP7 director after rabbitmq-server timed out (6.82 MB, application/octet-stream)
2015-09-22 13:32 EDT, Andrew Blum
no flags Details
rabbit log (11.21 KB, text/plain)
2015-09-22 13:33 EDT, Andrew Blum
no flags Details

  None (edit)
Description Andrew Blum 2015-09-21 08:28:48 EDT
Description of problem:
rabbitmq-server fails to start on boot with the following:

Sep 18 08:22:00 director.uc.example.com systemd[1]: Starting RabbitMQ broker...
Sep 18 08:22:11 director.uc.example.com systemd[1]: rabbitmq-server.service: Got notification message from PID 1236, but reception only permitted for PID 1145
Sep 18 08:23:30 director.uc.example.com systemd[1]: rabbitmq-server.service operation timed out. Terminating.
Sep 18 08:23:30 director.uc.example.com systemd[1]: Failed to start RabbitMQ broker.
Sep 18 08:23:30 director.uc.example.com systemd[1]: Unit rabbitmq-server.service entered failed state.

Version-Release number of selected component (if applicable):
# rpm -q erlang-sd_notify
erlang-sd_notify-0.1-2.el7ost.x86_64

# rpm -q systemd
systemd-208-20.el7_1.6.x86_64

[root@director ~]# rpm -q rabbitmq-server
rabbitmq-server-3.3.5-4.el7ost.noarch


How reproducible:
Every time

Steps to Reproduce:
1. Install director per https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux_OpenStack_Platform/7/html/Director_Installation_and_Usage/chap-Installing_the_Undercloud.html
2. Reboot server


Additional info:
* restarting the rabbitmq-server.service unit after system boot successfully starts the service.  This isn't really a good workaround since other RHEL-OSP7 services depend upon rabbitmq on the director and they are left in a failed state.

* adding the NotifyAccess=all to a custom rabbitmq-server.service unit successfully works around the issue:

[root@director ~]# cat /etc/systemd/system/rabbitmq-server.service
[Unit]
Description=RabbitMQ broker
After=syslog.target network.target

[Service]
Type=notify
User=rabbitmq
Group=rabbitmq
WorkingDirectory=/var/lib/rabbitmq
ExecStart=/usr/lib/rabbitmq/bin/rabbitmq-server
ExecStop=/usr/lib/rabbitmq/bin/rabbitmqctl stop
NotifyAccess=all

[Install]
WantedBy=multi-user.target
Comment 2 Mike Burns 2015-09-21 09:02:31 EDT
Eck, any thoughts?
Comment 3 Andrew Blum 2015-09-21 15:40:35 EDT
I believe a slow system start (OVS bridge) is the cause of the problem.  I noticed there were several other notify services that were timing out.  So, I modified /etc/systemd/system.conf

[root@director ~]# grep ^DefaultTimeout /etc/systemd/system.conf
DefaultTimeoutStartSec=180s

After a reboot I found that only 1 service failed to start on boot - openstack-ironic-discoverd which was a Type=simple.  This service was failing to start with:

Sep 21 15:12:55 director.uc.example.com ironic-discoverd[549]: File "/usr/lib/python2.7/site-packages/keystoneclient/v2_0/client.py", line 196, in get_raw_token_from_identity_service
Sep 21 15:12:55 director.uc.example.com ironic-discoverd[549]: _("Authorization Failed: %s") % e)
Sep 21 15:12:55 director.uc.example.com ironic-discoverd[549]: keystoneclient.openstack.common.apiclient.exceptions.AuthorizationFailure: Authorization Failed: Unable to establish connection to http://10.200.0.1:5000/v2.0/tokens

To deal with openstack-ironic-discoverd I added the network-online.target dependency (since it doesn't make sense to try to connect to the API endpoint before networking is online: 

# mkdir -p /etc/systemd/system/openstack-ironic-discoverd.service.d
# cd /etc/systemd/system/openstack-ironic-discoverd.service.d
# echo -e '[Unit]\nAfter=network-online.target\nWants=network-online.target' > require-networking.conf
# systemctl daemon-reload
Comment 4 John Eckersberg 2015-09-22 12:26:05 EDT
Can you attach the rabbitmq log (/var/log/rabbitmq/rabbit@<node>.log) as well as the journal (journalctl -u rabbitmq-server)?
Comment 6 Andrew Blum 2015-09-22 13:31:04 EDT
Sure.  I'll attach an sosreport.  Here is the output from the journalctl command:

[root@director ~]# journalctl -u rabbitmq-server
-- Logs begin at Tue 2015-09-22 13:16:12 EDT, end at Tue 2015-09-22 13:29:50 EDT. --
Sep 22 13:17:13 director.uc.example.com systemd[1]: Starting RabbitMQ broker...
Sep 22 13:17:25 director.uc.example.com systemd[1]: rabbitmq-server.service: Got notification message from PID 1248, but reception only permitted for PID 1161
Sep 22 13:18:43 director.uc.example.com systemd[1]: rabbitmq-server.service operation timed out. Terminating.
Sep 22 13:18:44 director.uc.example.com systemd[1]: Failed to start RabbitMQ broker.
Sep 22 13:18:44 director.uc.example.com systemd[1]: Unit rabbitmq-server.service entered failed state.
[root@director ~]# 

Thanks for taking a look !

--
Andrew
Comment 7 Andrew Blum 2015-09-22 13:32 EDT
Created attachment 1075944 [details]
sosreport from RHEL-OSP7 director after rabbitmq-server timed out
Comment 8 Andrew Blum 2015-09-22 13:33 EDT
Created attachment 1075946 [details]
rabbit log
Comment 10 John Eckersberg 2015-09-22 15:39:35 EDT
After a brief glance, I'm not sure what the deal is.  There's nothing during the bootup in the rabbit log, and the journal normally shows the rabbitmq "splash screen" which is one of the first things the server does when it starts booting, but the splash is absent as well.  So either the process is not really being started at all by systemd, or it's hanging for some reason *very* early in the startup process.

I'll see if I can reproduce this in my dev environment.  Honestly I rarely if ever have rabbitmq-server set to start during bootup, instead relying on it to be managed by pacemaker once the system is up.
Comment 11 Mike Burns 2016-04-07 16:50:54 EDT
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.
Comment 13 Peter Lemenkov 2016-05-31 07:58:29 EDT
Just fyi - this was fixed in RabbitMQ service file since rabbitmq-server-3.6.1-2.el7ost (available in RHOS9). So no further override in /etc/systemd/system is required.

Could we just drop generating RabbitMQ service override in Director?
Comment 14 Dmitry Tantsur 2016-06-20 07:05:33 EDT
Hi, sorry, I don't know. Mike, John?
Comment 16 Fabio Massimo Di Nitto 2016-08-30 08:58:56 EDT

*** This bug has been marked as a duplicate of bug 1348700 ***

Note You need to log in before you can comment on or make changes to this bug.