Cause: The pid file is used by the init script to detect if the service is already running, to avoid starting it again, restart it, etc. The situation where the disk is full and the pid file is written, wasn't detected.
Consequence: Later executions of the init.d script with start or restart would start the services several times, as in most cases they don't open a port for listening which would exclude several ones running together.
Fix: Check if the actual process exists even if the pid file doesn't exist.
Result: No more duplicated daemons are started when the /var directory is full.
Description of problem:
Several times in oslab we have ended up with multiple neutron daemons running at the same time.
I'm fairly sure the sequence of events starts with /var filling up, causing various odd high-level problems such as scheduling errors or networking failures; but nothing blatantly obvious as an out-of-disk error unless you're digging in the right logs on the right host at the right time.
So without noticing this, as part of trouble-shooting the admin tries restarting various daemons.
At that point you probably see something like
---
[root@rhel init.d]# service neutron-dhcp-agent start
Starting neutron-dhcp-agent: bash: line 0: echo: write error: No space left on device
[FAILED]
---
which will probably lead you to realise the full disk situation, even if it could be much more helpful in the error output.
However, the neutron-dhcp-agent daemon was actually started. The problem is that the daemon gets started under a 'daemon' function that returns the pid, e.g.
---
echo -n $"Starting $prog: "
daemon --user neutron --pidfile $pidfile "$exec --log-file /var/log/$proj/$plugin.log ${configs[@]/#/--config-file } &>/dev/null & echo \$! > $pidfile"
retval=$?
---
Even if $pidfile can't be written, the daemon has started anyway.
So likely you restart the daemon again, which leads to replacing one problem with another, much harder to debug, one. The two (or more) agents start racing with each other to consume RPC calls, etc.
To simulate full /var for neutron try this:
mount -o size=1 -t tmpfs tmpfs /var/run/neutron/
cat /dev/zero > /var/run/neutron/FILLME
# service neutron-dhcp-agent startStarting neutron-dhcp-agent: bash: line 0: echo: write error: No space left on device
[FAILED]
# service neutron-dhcp-agent start
neutron-dhcp-agent was running, but no pid file, check disk space
Alan, won't this be easier to simulate by:
while running and as root:
chown root:root /var/run/neutron
chmod 0740 /var/run/neutron
mv /var/run/neutron/neutron-dhcp-client.pid /var/run/neutron/_neutron-dhcp-client.pid
and try to run /etc/init.d/neutron-dhcp-client start
?
Verified NVR: openstack-neutron-2013.2.3-4.el6ost.noarch
Followed Alan's steps to reproduce in Comment #6
Result:
=======
[root@rhel ~]# service neutron-dhcp-agent start
neutron-dhcp-agent was running, but no pid file, check disk space
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
http://rhn.redhat.com/errata/RHSA-2014-0516.html
Description of problem: Several times in oslab we have ended up with multiple neutron daemons running at the same time. I'm fairly sure the sequence of events starts with /var filling up, causing various odd high-level problems such as scheduling errors or networking failures; but nothing blatantly obvious as an out-of-disk error unless you're digging in the right logs on the right host at the right time. So without noticing this, as part of trouble-shooting the admin tries restarting various daemons. At that point you probably see something like --- [root@rhel init.d]# service neutron-dhcp-agent start Starting neutron-dhcp-agent: bash: line 0: echo: write error: No space left on device [FAILED] --- which will probably lead you to realise the full disk situation, even if it could be much more helpful in the error output. However, the neutron-dhcp-agent daemon was actually started. The problem is that the daemon gets started under a 'daemon' function that returns the pid, e.g. --- echo -n $"Starting $prog: " daemon --user neutron --pidfile $pidfile "$exec --log-file /var/log/$proj/$plugin.log ${configs[@]/#/--config-file } &>/dev/null & echo \$! > $pidfile" retval=$? --- Even if $pidfile can't be written, the daemon has started anyway. So likely you restart the daemon again, which leads to replacing one problem with another, much harder to debug, one. The two (or more) agents start racing with each other to consume RPC calls, etc.