Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1076994 - Full /var prevents pid file being written, but daemon starts anyway
Full /var prevents pid file being written, but daemon starts anyway
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron (Show other bugs)
4.0
Unspecified Unspecified
medium Severity high
: z4
: 4.0
Assigned To: Miguel Angel Ajo
Nir Magnezi
: ZStream
: 1064109 1075570 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2014-03-16 20:47 EDT by Ian Wienand
Modified: 2016-04-26 09:23 EDT (History)
12 users (show)

See Also:
Fixed In Version: openstack-neutron-2013.2.3-4.el6ost
Doc Type: Bug Fix
Doc Text:
Cause: The pid file is used by the init script to detect if the service is already running, to avoid starting it again, restart it, etc. The situation where the disk is full and the pid file is written, wasn't detected. Consequence: Later executions of the init.d script with start or restart would start the services several times, as in most cases they don't open a port for listening which would exclude several ones running together. Fix: Check if the actual process exists even if the pid file doesn't exist. Result: No more duplicated daemons are started when the /var directory is full.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-05-29 16:19:20 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Check if daemon running before starting (9.32 KB, patch)
2014-03-16 23:16 EDT, Ian Wienand
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2014:0516 normal SHIPPED_LIVE Moderate: openstack-neutron security, bug fix, and enhancement update 2014-05-29 20:15:59 EDT

  None (edit)
Description Ian Wienand 2014-03-16 20:47:37 EDT
Description of problem:

Several times in oslab we have ended up with multiple neutron daemons running at the same time.

I'm fairly sure the sequence of events starts with /var filling up, causing various odd high-level problems such as scheduling errors or networking failures; but nothing blatantly obvious as an out-of-disk error unless you're digging in the right logs on the right host at the right time.

So without noticing this, as part of trouble-shooting the admin tries restarting various daemons.  

At that point you probably see something like

---
[root@rhel init.d]# service neutron-dhcp-agent start
Starting neutron-dhcp-agent: bash: line 0: echo: write error: No space left on device
                                                           [FAILED]
---

which will probably lead you to realise the full disk situation, even if it could be much more helpful in the error output.

However, the neutron-dhcp-agent daemon was actually started.  The problem is that the daemon gets started under a 'daemon' function that returns the pid, e.g.

---
    echo -n $"Starting $prog: "
    daemon --user neutron --pidfile $pidfile "$exec --log-file /var/log/$proj/$plugin.log ${configs[@]/#/--config-file } &>/dev/null & echo \$! > $pidfile"
    retval=$?
---

Even if $pidfile can't be written, the daemon has started anyway.

So likely you restart the daemon again, which leads to replacing one problem with another, much harder to debug, one.  The two (or more) agents start racing with each other to consume RPC calls, etc.
Comment 1 Ian Wienand 2014-03-16 23:16:31 EDT
Created attachment 875311 [details]
Check if daemon running before starting
Comment 2 Miguel Angel Ajo 2014-03-26 04:25:44 EDT
We should get this included, as the result of a full /var is several agents starting together and racing to each other.
Comment 5 Ofer Blaut 2014-04-22 12:06:12 EDT
Hi Livnat we need reproduce steps here

Ofer
Comment 6 Alan Pevec 2014-04-24 05:13:14 EDT
To simulate full /var for neutron try this:
 mount -o size=1 -t tmpfs tmpfs /var/run/neutron/
 cat /dev/zero > /var/run/neutron/FILLME

# service neutron-dhcp-agent startStarting neutron-dhcp-agent: bash: line 0: echo: write error: No space left on device
                                                           [FAILED]
# service neutron-dhcp-agent start
neutron-dhcp-agent was running, but no pid file, check disk space
Comment 7 Amit Ugol 2014-04-24 07:31:00 EDT
Alan, won't this be easier to simulate by:
while running and as root:
chown root:root /var/run/neutron
chmod 0740 /var/run/neutron
mv /var/run/neutron/neutron-dhcp-client.pid /var/run/neutron/_neutron-dhcp-client.pid
and try to run /etc/init.d/neutron-dhcp-client start
?
Comment 8 Nir Magnezi 2014-04-24 10:51:37 EDT
Verified NVR: openstack-neutron-2013.2.3-4.el6ost.noarch

Followed Alan's steps to reproduce in Comment #6

Result:
=======
[root@rhel ~]# service neutron-dhcp-agent start
neutron-dhcp-agent was running, but no pid file, check disk space
Comment 9 Miguel Angel Ajo 2014-05-28 08:10:23 EDT
*** Bug 1064109 has been marked as a duplicate of this bug. ***
Comment 10 Miguel Angel Ajo 2014-05-29 04:24:09 EDT
*** Bug 1075570 has been marked as a duplicate of this bug. ***
Comment 12 errata-xmlrpc 2014-05-29 16:19:20 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0516.html

Note You need to log in before you can comment on or make changes to this bug.