Description of problem: Several times in oslab we have ended up with multiple neutron daemons running at the same time. I'm fairly sure the sequence of events starts with /var filling up, causing various odd high-level problems such as scheduling errors or networking failures; but nothing blatantly obvious as an out-of-disk error unless you're digging in the right logs on the right host at the right time. So without noticing this, as part of trouble-shooting the admin tries restarting various daemons. At that point you probably see something like --- [root@rhel init.d]# service neutron-dhcp-agent start Starting neutron-dhcp-agent: bash: line 0: echo: write error: No space left on device [FAILED] --- which will probably lead you to realise the full disk situation, even if it could be much more helpful in the error output. However, the neutron-dhcp-agent daemon was actually started. The problem is that the daemon gets started under a 'daemon' function that returns the pid, e.g. --- echo -n $"Starting $prog: " daemon --user neutron --pidfile $pidfile "$exec --log-file /var/log/$proj/$plugin.log ${configs[@]/#/--config-file } &>/dev/null & echo \$! > $pidfile" retval=$? --- Even if $pidfile can't be written, the daemon has started anyway. So likely you restart the daemon again, which leads to replacing one problem with another, much harder to debug, one. The two (or more) agents start racing with each other to consume RPC calls, etc.
Created attachment 875311 [details] Check if daemon running before starting
We should get this included, as the result of a full /var is several agents starting together and racing to each other.
Hi Livnat we need reproduce steps here Ofer
To simulate full /var for neutron try this: mount -o size=1 -t tmpfs tmpfs /var/run/neutron/ cat /dev/zero > /var/run/neutron/FILLME # service neutron-dhcp-agent startStarting neutron-dhcp-agent: bash: line 0: echo: write error: No space left on device [FAILED] # service neutron-dhcp-agent start neutron-dhcp-agent was running, but no pid file, check disk space
Alan, won't this be easier to simulate by: while running and as root: chown root:root /var/run/neutron chmod 0740 /var/run/neutron mv /var/run/neutron/neutron-dhcp-client.pid /var/run/neutron/_neutron-dhcp-client.pid and try to run /etc/init.d/neutron-dhcp-client start ?
Verified NVR: openstack-neutron-2013.2.3-4.el6ost.noarch Followed Alan's steps to reproduce in Comment #6 Result: ======= [root@rhel ~]# service neutron-dhcp-agent start neutron-dhcp-agent was running, but no pid file, check disk space
*** Bug 1064109 has been marked as a duplicate of this bug. ***
*** Bug 1075570 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0516.html