Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1076994

Summary: Full /var prevents pid file being written, but daemon starts anyway
Product: Red Hat OpenStack Reporter: Ian Wienand <iwienand>
Component: openstack-neutronAssignee: Miguel Angel Ajo <mangelajo>
Status: CLOSED ERRATA QA Contact: Nir Magnezi <nmagnezi>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: apevec, augol, breeler, chrisw, gdubreui, lpeer, majopela, mangelajo, mwagner, nyechiel, oblaut, yeylon
Target Milestone: z4Keywords: ZStream
Target Release: 4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-2013.2.3-4.el6ost Doc Type: Bug Fix
Doc Text:
Cause: The pid file is used by the init script to detect if the service is already running, to avoid starting it again, restart it, etc. The situation where the disk is full and the pid file is written, wasn't detected. Consequence: Later executions of the init.d script with start or restart would start the services several times, as in most cases they don't open a port for listening which would exclude several ones running together. Fix: Check if the actual process exists even if the pid file doesn't exist. Result: No more duplicated daemons are started when the /var directory is full.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-05-29 20:19:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Check if daemon running before starting none

Description Ian Wienand 2014-03-17 00:47:37 UTC
Description of problem:

Several times in oslab we have ended up with multiple neutron daemons running at the same time.

I'm fairly sure the sequence of events starts with /var filling up, causing various odd high-level problems such as scheduling errors or networking failures; but nothing blatantly obvious as an out-of-disk error unless you're digging in the right logs on the right host at the right time.

So without noticing this, as part of trouble-shooting the admin tries restarting various daemons.  

At that point you probably see something like

---
[root@rhel init.d]# service neutron-dhcp-agent start
Starting neutron-dhcp-agent: bash: line 0: echo: write error: No space left on device
                                                           [FAILED]
---

which will probably lead you to realise the full disk situation, even if it could be much more helpful in the error output.

However, the neutron-dhcp-agent daemon was actually started.  The problem is that the daemon gets started under a 'daemon' function that returns the pid, e.g.

---
    echo -n $"Starting $prog: "
    daemon --user neutron --pidfile $pidfile "$exec --log-file /var/log/$proj/$plugin.log ${configs[@]/#/--config-file } &>/dev/null & echo \$! > $pidfile"
    retval=$?
---

Even if $pidfile can't be written, the daemon has started anyway.

So likely you restart the daemon again, which leads to replacing one problem with another, much harder to debug, one.  The two (or more) agents start racing with each other to consume RPC calls, etc.

Comment 1 Ian Wienand 2014-03-17 03:16:31 UTC
Created attachment 875311 [details]
Check if daemon running before starting

Comment 2 Miguel Angel Ajo 2014-03-26 08:25:44 UTC
We should get this included, as the result of a full /var is several agents starting together and racing to each other.

Comment 5 Ofer Blaut 2014-04-22 16:06:12 UTC
Hi Livnat we need reproduce steps here

Ofer

Comment 6 Alan Pevec 2014-04-24 09:13:14 UTC
To simulate full /var for neutron try this:
 mount -o size=1 -t tmpfs tmpfs /var/run/neutron/
 cat /dev/zero > /var/run/neutron/FILLME

# service neutron-dhcp-agent startStarting neutron-dhcp-agent: bash: line 0: echo: write error: No space left on device
                                                           [FAILED]
# service neutron-dhcp-agent start
neutron-dhcp-agent was running, but no pid file, check disk space

Comment 7 Amit Ugol 2014-04-24 11:31:00 UTC
Alan, won't this be easier to simulate by:
while running and as root:
chown root:root /var/run/neutron
chmod 0740 /var/run/neutron
mv /var/run/neutron/neutron-dhcp-client.pid /var/run/neutron/_neutron-dhcp-client.pid
and try to run /etc/init.d/neutron-dhcp-client start
?

Comment 8 Nir Magnezi 2014-04-24 14:51:37 UTC
Verified NVR: openstack-neutron-2013.2.3-4.el6ost.noarch

Followed Alan's steps to reproduce in Comment #6

Result:
=======
[root@rhel ~]# service neutron-dhcp-agent start
neutron-dhcp-agent was running, but no pid file, check disk space

Comment 9 Miguel Angel Ajo 2014-05-28 12:10:23 UTC
*** Bug 1064109 has been marked as a duplicate of this bug. ***

Comment 10 Miguel Angel Ajo 2014-05-29 08:24:09 UTC
*** Bug 1075570 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2014-05-29 20:19:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0516.html