Bug 1076994 - Full /var prevents pid file being written, but daemon starts anyway
Summary: Full /var prevents pid file being written, but daemon starts anyway
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 4.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: z4
: 4.0
Assignee: Miguel Angel Ajo
QA Contact: Nir Magnezi
URL:
Whiteboard:
: 1064109 1075570 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-03-17 00:47 UTC by Ian Wienand
Modified: 2022-07-09 06:16 UTC (History)
12 users (show)

Fixed In Version: openstack-neutron-2013.2.3-4.el6ost
Doc Type: Bug Fix
Doc Text:
Cause: The pid file is used by the init script to detect if the service is already running, to avoid starting it again, restart it, etc. The situation where the disk is full and the pid file is written, wasn't detected. Consequence: Later executions of the init.d script with start or restart would start the services several times, as in most cases they don't open a port for listening which would exclude several ones running together. Fix: Check if the actual process exists even if the pid file doesn't exist. Result: No more duplicated daemons are started when the /var directory is full.
Clone Of:
Environment:
Last Closed: 2014-05-29 20:19:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Check if daemon running before starting (9.32 KB, patch)
2014-03-17 03:16 UTC, Ian Wienand
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2014:0516 0 normal SHIPPED_LIVE Moderate: openstack-neutron security, bug fix, and enhancement update 2014-05-30 00:15:59 UTC

Description Ian Wienand 2014-03-17 00:47:37 UTC
Description of problem:

Several times in oslab we have ended up with multiple neutron daemons running at the same time.

I'm fairly sure the sequence of events starts with /var filling up, causing various odd high-level problems such as scheduling errors or networking failures; but nothing blatantly obvious as an out-of-disk error unless you're digging in the right logs on the right host at the right time.

So without noticing this, as part of trouble-shooting the admin tries restarting various daemons.  

At that point you probably see something like

---
[root@rhel init.d]# service neutron-dhcp-agent start
Starting neutron-dhcp-agent: bash: line 0: echo: write error: No space left on device
                                                           [FAILED]
---

which will probably lead you to realise the full disk situation, even if it could be much more helpful in the error output.

However, the neutron-dhcp-agent daemon was actually started.  The problem is that the daemon gets started under a 'daemon' function that returns the pid, e.g.

---
    echo -n $"Starting $prog: "
    daemon --user neutron --pidfile $pidfile "$exec --log-file /var/log/$proj/$plugin.log ${configs[@]/#/--config-file } &>/dev/null & echo \$! > $pidfile"
    retval=$?
---

Even if $pidfile can't be written, the daemon has started anyway.

So likely you restart the daemon again, which leads to replacing one problem with another, much harder to debug, one.  The two (or more) agents start racing with each other to consume RPC calls, etc.

Comment 1 Ian Wienand 2014-03-17 03:16:31 UTC
Created attachment 875311 [details]
Check if daemon running before starting

Comment 2 Miguel Angel Ajo 2014-03-26 08:25:44 UTC
We should get this included, as the result of a full /var is several agents starting together and racing to each other.

Comment 5 Ofer Blaut 2014-04-22 16:06:12 UTC
Hi Livnat we need reproduce steps here

Ofer

Comment 6 Alan Pevec 2014-04-24 09:13:14 UTC
To simulate full /var for neutron try this:
 mount -o size=1 -t tmpfs tmpfs /var/run/neutron/
 cat /dev/zero > /var/run/neutron/FILLME

# service neutron-dhcp-agent startStarting neutron-dhcp-agent: bash: line 0: echo: write error: No space left on device
                                                           [FAILED]
# service neutron-dhcp-agent start
neutron-dhcp-agent was running, but no pid file, check disk space

Comment 7 Amit Ugol 2014-04-24 11:31:00 UTC
Alan, won't this be easier to simulate by:
while running and as root:
chown root:root /var/run/neutron
chmod 0740 /var/run/neutron
mv /var/run/neutron/neutron-dhcp-client.pid /var/run/neutron/_neutron-dhcp-client.pid
and try to run /etc/init.d/neutron-dhcp-client start
?

Comment 8 Nir Magnezi 2014-04-24 14:51:37 UTC
Verified NVR: openstack-neutron-2013.2.3-4.el6ost.noarch

Followed Alan's steps to reproduce in Comment #6

Result:
=======
[root@rhel ~]# service neutron-dhcp-agent start
neutron-dhcp-agent was running, but no pid file, check disk space

Comment 9 Miguel Angel Ajo 2014-05-28 12:10:23 UTC
*** Bug 1064109 has been marked as a duplicate of this bug. ***

Comment 10 Miguel Angel Ajo 2014-05-29 08:24:09 UTC
*** Bug 1075570 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2014-05-29 20:19:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0516.html


Note You need to log in before you can comment on or make changes to this bug.