Bug 1059913 - Race condition creating .erlang.cookie
Summary: Race condition creating .erlang.cookie
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: rabbitmq-server
Version: 20
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Richard W.M. Jones
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 1103524
TreeView+ depends on / blocked
 
Reported: 2014-01-30 23:03 UTC by John Eckersberg
Modified: 2014-06-01 22:23 UTC (History)
6 users (show)

Fixed In Version: rabbitmq-server-3.1.5-5.fc20
Clone Of:
: 1103524 (view as bug list)
Environment:
Last Closed: 2014-04-14 22:47:32 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
journalctl output from failed rabbitmq-server startup (6.19 KB, text/plain)
2014-01-30 23:06 UTC, John Eckersberg
no flags Details
rabbitmq-server-cookie-race.patch (649 bytes, patch)
2014-01-31 14:33 UTC, John Eckersberg
no flags Details | Diff

Description John Eckersberg 2014-01-30 23:03:23 UTC
Description of problem:

There is a race condition when starting rabbitmq-server for the first time.

When the erlang runtime starts, it tries to read its cookie file (for rabbitmq, /var/lib/rabbitmq/.erlang.cookie) and if it doesn't already exist, it generates a new random cookie and creates the file.

The following two lines from the rabbitmq-service.service unit file are involved:

ExecStart=/usr/lib/rabbitmq/bin/rabbitmq-server
ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmqctl wait /var/run/rabbitmq/pid

The rabbitmq-server command returns before the service is up.  Therefor it is required to exec the additional rabbitmqctl wait in order to make sure the service starts all the way.  However both of these are erlang programs, and they share the cookie startup code described previously.

There is variance on the order of events and the eventual error.  But generally what happens is:

- ExecStart (rabbitmq-server) is run and exits.  The erlang runtime is now booting in the background.

- ExecStartPost (rabbitmqctl) is run.

- rabbitmq-server determines the cookie file is not present, and generates a new cookie.

- rabbitmqctl determines the cookie file is not present, and generates a new cookie.

- rabbitmq-server writes the new cookie to disk and sets the file to read-only

- rabbitmqctl tries to open the cookie file read/write in order to write its cookie, but errors with EACCESS because the file already exists and is read only.

- The erlang runtime for rabbitmqctl crashes and the command returns with a non-successful exit code.

- The entire service unit is marked as failed, and all of the processes are killed by systemd.


Version-Release number of selected component (if applicable):

rabbitmq-server-3.1.5-1.fc20.noarch
erlang-R16B-03.1.fc20.x86_64

How reproducible:

There is some variability since it's a race.  I've provided my reproducer below that works 100% of the time for me, inside a F20 VM.  In theory this behavior should still exist if starting the service from the cli instead of rebooting, but I can't reproduce it that way.

Steps to Reproduce:
1. install rabbitmq-server
2. systemctl enable rabbitmq-server.service
3. reboot

Actual results:
Service fails to start, see attachment of journalctl output for error

Expected results:
Service starts cleanly

Additional info:

This is really an erlang bug, but the workaround for rabbit is simple (I'll post a patch in a followup).  I'll run down the erlang bit separately but it will take longer, so it makes sense to apply the workaround here until erlang is fixed.

Comment 1 John Eckersberg 2014-01-30 23:06:26 UTC
Created attachment 857656 [details]
journalctl output from failed rabbitmq-server startup

Comment 2 John Eckersberg 2014-01-31 14:33:31 UTC
Created attachment 857863 [details]
rabbitmq-server-cookie-race.patch

This patch updates the systemd service to run `rabbitmqctl status` before starting the rabbitmq-server process.  This ensures the erlang cookie is created before starting the service.

Comment 3 Richard W.M. Jones 2014-04-01 17:58:41 UTC
I performed the steps given in the bug description and the patch
works for me in a Rawhide VM.

The patch also looks reasonable to me and low risk, so I'm going to
backport it to Fedora 20.

Comment 4 Fedora Update System 2014-04-01 18:51:47 UTC
rabbitmq-server-3.1.5-5.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/rabbitmq-server-3.1.5-5.fc20

Comment 5 Fedora Update System 2014-04-03 04:08:25 UTC
Package rabbitmq-server-3.1.5-5.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing rabbitmq-server-3.1.5-5.fc20'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-4722/rabbitmq-server-3.1.5-5.fc20
then log in and leave karma (feedback).

Comment 6 Sam Kottler 2014-04-03 13:07:54 UTC
I'll merge rawhide into EPEL7 when I'm at a machine with my certs on it.

Comment 7 Alan Pevec 2014-04-08 17:02:47 UTC
Instead of messing with ExecStartPost, it would be more reliable to switch systemd service to Type=notify just need someone familiar with rabbitmq to tell us where to put sd_notify to send notification when service is ready.

Comment 8 Fedora Update System 2014-04-14 22:47:32 UTC
rabbitmq-server-3.1.5-5.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.