Bug 1103524

Summary: Change rabbitmq-server systemd service to Type=notify
Product: [Fedora] Fedora Reporter: Peter Lemenkov <lemenkov>
Component: rabbitmq-serverAssignee: Peter Lemenkov <lemenkov>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: apevec, erlang, hubert.plociniczak, jeckersb, lemenkov, ohadlevy, rjones, s
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rabbitmq-server-3.1.5-9.fc21 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1059913 Environment:
Last Closed: 2014-07-02 16:05:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1059913, 1104604    
Bug Blocks: 1086146    
Attachments:
Description Flags
patch for rawhide branch none

Description Peter Lemenkov 2014-06-01 15:52:41 UTC
+++ This bug was initially created as a clone of Bug #1059913 +++

Description of problem:

There is a race condition when starting rabbitmq-server for the first time.

When the erlang runtime starts, it tries to read its cookie file (for rabbitmq, /var/lib/rabbitmq/.erlang.cookie) and if it doesn't already exist, it generates a new random cookie and creates the file.

The following two lines from the rabbitmq-service.service unit file are involved:

ExecStart=/usr/lib/rabbitmq/bin/rabbitmq-server
ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmqctl wait /var/run/rabbitmq/pid

The rabbitmq-server command returns before the service is up.  Therefor it is required to exec the additional rabbitmqctl wait in order to make sure the service starts all the way.  However both of these are erlang programs, and they share the cookie startup code described previously.

There is variance on the order of events and the eventual error.  But generally what happens is:

- ExecStart (rabbitmq-server) is run and exits.  The erlang runtime is now booting in the background.

- ExecStartPost (rabbitmqctl) is run.

- rabbitmq-server determines the cookie file is not present, and generates a new cookie.

- rabbitmqctl determines the cookie file is not present, and generates a new cookie.

- rabbitmq-server writes the new cookie to disk and sets the file to read-only

- rabbitmqctl tries to open the cookie file read/write in order to write its cookie, but errors with EACCESS because the file already exists and is read only.

- The erlang runtime for rabbitmqctl crashes and the command returns with a non-successful exit code.

- The entire service unit is marked as failed, and all of the processes are killed by systemd.


Version-Release number of selected component (if applicable):

rabbitmq-server-3.1.5-1.fc20.noarch
erlang-R16B-03.1.fc20.x86_64

How reproducible:

There is some variability since it's a race.  I've provided my reproducer below that works 100% of the time for me, inside a F20 VM.  In theory this behavior should still exist if starting the service from the cli instead of rebooting, but I can't reproduce it that way.

Steps to Reproduce:
1. install rabbitmq-server
2. systemctl enable rabbitmq-server.service
3. reboot

Actual results:
Service fails to start, see attachment of journalctl output for error

Expected results:
Service starts cleanly

Additional info:

This is really an erlang bug, but the workaround for rabbit is simple (I'll post a patch in a followup).  I'll run down the erlang bit separately but it will take longer, so it makes sense to apply the workaround here until erlang is fixed.

--- Additional comment from John Eckersberg on 2014-01-30 18:06:26 EST ---



--- Additional comment from John Eckersberg on 2014-01-31 09:33:31 EST ---

This patch updates the systemd service to run `rabbitmqctl status` before starting the rabbitmq-server process.  This ensures the erlang cookie is created before starting the service.

--- Additional comment from Richard W.M. Jones on 2014-04-01 13:58:41 EDT ---

I performed the steps given in the bug description and the patch
works for me in a Rawhide VM.

The patch also looks reasonable to me and low risk, so I'm going to
backport it to Fedora 20.

--- Additional comment from Fedora Update System on 2014-04-01 14:51:47 EDT ---

rabbitmq-server-3.1.5-5.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/rabbitmq-server-3.1.5-5.fc20

--- Additional comment from Fedora Update System on 2014-04-03 00:08:25 EDT ---

Package rabbitmq-server-3.1.5-5.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing rabbitmq-server-3.1.5-5.fc20'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-4722/rabbitmq-server-3.1.5-5.fc20
then log in and leave karma (feedback).

--- Additional comment from Sam Kottler on 2014-04-03 09:07:54 EDT ---

I'll merge rawhide into EPEL7 when I'm at a machine with my certs on it.

--- Additional comment from Alan Pevec on 2014-04-08 13:02:47 EDT ---

Instead of messing with ExecStartPost, it would be more reliable to switch systemd service to Type=notify just need someone familiar with rabbitmq to tell us where to put sd_notify to send notification when service is ready.

--- Additional comment from Fedora Update System on 2014-04-14 18:47:32 EDT ---

rabbitmq-server-3.1.5-5.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 1 John Eckersberg 2014-06-16 19:01:30 UTC
Do you have a patch to rabbitmq for this yet?  I've taken a stab at it but it doesn't seem to be working.  I can barely function in erlang so I've probably done something wrong.

https://github.com/jeckersb/rabbitmq-server/commit/13c5215b6fb3c13e71748bfe4b1d6563aa606a9a

Comment 2 Peter Lemenkov 2014-06-17 09:33:09 UTC
(In reply to John Eckersberg from comment #1)
> Do you have a patch to rabbitmq for this yet?  I've taken a stab at it but
> it doesn't seem to be working.  I can barely function in erlang so I've
> probably done something wrong.
> 
> https://github.com/jeckersb/rabbitmq-server/commit/
> 13c5215b6fb3c13e71748bfe4b1d6563aa606a9a

I'll review your patch shortly.

Comment 3 Peter Lemenkov 2014-06-17 10:11:31 UTC
(In reply to John Eckersberg from comment #1)
> Do you have a patch to rabbitmq for this yet?  I've taken a stab at it but
> it doesn't seem to be working.  I can barely function in erlang so I've
> probably done something wrong.
> 
> https://github.com/jeckersb/rabbitmq-server/commit/
> 13c5215b6fb3c13e71748bfe4b1d6563aa606a9a

Looks good to me. I've tried and it works quite fine. Perhaps you forgot to change Type=notify in rabbitmq-server.service and/or reload config with systemctl daemon-reload.

I've backported your patch to 3.1.5:

http://peter.fedorapeople.org/rabbitmq-server-0001-Add-systemd-notify-support.patch

Comment 4 John Eckersberg 2014-06-17 15:00:22 UTC
Thanks for the review.  I tried it again this morning on a fresh VM and it works fine for me now.  Must have been something wrong in my local setup.

Comment 5 John Eckersberg 2014-06-17 15:11:45 UTC
Created attachment 909630 [details]
patch for rawhide branch

Here's the patch for the spec, the patch, and the updated systemd service file for the master branch in git.  I don't have perms to push to it otherwise I'd do it myself.

Note that we'll need erland-sd_notify built before we do a new rabbitmq-server build, since the sd_notify package is now a Requires for rabbitmq-server.

Comment 6 Peter Lemenkov 2014-06-17 15:17:58 UTC
(In reply to John Eckersberg from comment #5)
> Created attachment 909630 [details]
> patch for rawhide branch
> 
> Here's the patch for the spec, the patch, and the updated systemd service
> file for the master branch in git.  I don't have perms to push to it
> otherwise I'd do it myself.

Just request commit access and I'll approve asap:

https://admin.fedoraproject.org/pkgdb/package/rabbitmq-server/

I personally love when more people are involved!

> Note that we'll need erland-sd_notify built before we do a new
> rabbitmq-server build, since the sd_notify package is now a Requires for
> rabbitmq-server.

Yes, I'm still waiting for the "process-git-requests" approval.

Comment 7 Peter Lemenkov 2014-06-17 16:02:59 UTC
(In reply to Peter Lemenkov from comment #6)
> (In reply to John Eckersberg from comment #5)
> > Created attachment 909630 [details]
> > patch for rawhide branch
> > 
> > Here's the patch for the spec, the patch, and the updated systemd service
> > file for the master branch in git.  I don't have perms to push to it
> > otherwise I'd do it myself.
> 
> Just request commit access and I'll approve asap:
> 
> https://admin.fedoraproject.org/pkgdb/package/rabbitmq-server/

Err, well, unexpected issue - it turned out that I'm not in charge there. So we have to have FAS approval first.

Comment 8 Peter Lemenkov 2014-07-01 18:52:58 UTC
(In reply to John Eckersberg from comment #5)
> Created attachment 909630 [details]
> patch for rawhide branch
> 
> Here's the patch for the spec, the patch, and the updated systemd service
> file for the master branch in git.  I don't have perms to push to it
> otherwise I'd do it myself.
> 
> Note that we'll need erland-sd_notify built before we do a new
> rabbitmq-server build, since the sd_notify package is now a Requires for
> rabbitmq-server.

Good news everyone! We've just regained control over RabbitMQ!

John, you have a go!

https://www.youtube.com/watch?v=odiMeEhfi9I