Description of problem: If a child process (which provides services to tenant network) dies, or gets killed neutron-dhcp-agent won't restart the process. That means that metadata or DHCP could stop working of something goes wrong with the child process, but neutron believes everything is up and running. How reproducible: Always Steps to Reproduce: 1. Create an isolated network + subnet 2. Attach a VM to this network only, and make sure it's started 3. A dnsmasq & neutron-ns-metadata-proxy child processes must have been created now in the host where neutron-dhcp-agent is running. 4. Kill those processes manually Actual results: The dnsmasq process and the neutron-ns-metadata-proxy are not respawned. Expected results: The underlaying processes may be respawned, or at least we may log / notice somewhere that the child is gone. Additional info:
I have found nodes where dnsmasq had died for some reason, and the network was left unattended. But, probably, if the child died, there are high chances that it will die again. So respawn retry limit + logging could make sense.
Created attachment 879889 [details] Neutron agent watch code
Created attachment 879890 [details] dhcp agent init script
Created attachment 879891 [details] l3 agent init script
Created attachment 879892 [details] agent watch config
Created attachment 879893 [details] openstack-neutron.spec patch
verified on RHEL6.5 python-neutron-2013.2.3-4.el6ost.noarch python-neutronclient-2.3.4-1.el6ost.noarch openstack-neutron-openvswitch-2013.2.3-4.el6ost.noarch openstack-neutron-2013.2.3-4.el6ost.noarch behavior is as described in comment 15 However, trying to delete the "damaged" network prints to agent_watch.log: 2014-04-23 09:35:51.169 114897 ERROR root [-] Unexpected exception occurred 50 time(s)... retrying. 2014-04-23 09:35:51.169 114897 TRACE root Traceback (most recent call last): 2014-04-23 09:35:51.169 114897 TRACE root File "/usr/lib/python2.6/site-packages/neutron/openstack/common/excutils.py", line 62, in inner_func 2014-04-23 09:35:51.169 114897 TRACE root return infunc(*args, **kwargs) 2014-04-23 09:35:51.169 114897 TRACE root File "/usr/bin/neutron-agent-watch", line 627, in run 2014-04-23 09:35:51.169 114897 TRACE root watcher.run(context) 2014-04-23 09:35:51.169 114897 TRACE root File "/usr/bin/neutron-agent-watch", line 426, in run 2014-04-23 09:35:51.169 114897 TRACE root self._run() # run method implemented in child class 2014-04-23 09:35:51.169 114897 TRACE root File "/usr/bin/neutron-agent-watch", line 515, in _run 2014-04-23 09:35:51.169 114897 TRACE root self._remove_old_known_pidfiles(expected_pid_files) 2014-04-23 09:35:51.169 114897 TRACE root File "/usr/bin/neutron-agent-watch", line 388, in _remove_old_known_pidfiles 2014-04-23 09:35:51.169 114897 TRACE root self._remove_expected_pid_file(known) 2014-04-23 09:35:51.169 114897 TRACE root AttributeError: 'DhcpAgentWatcher' object has no attribute '_remove_expected_pid_file' and # /etc/init.d/neutron-dhcp-agent status ; echo $? returns: neutron-dhcp-agent (pid 20898) is running... neutron-dhcp-agent health is not good 150 even after network and router have been deleted. this is only resolved after restarting agent-watch service
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0516.html