Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1051444 - [neutron]: neutron-dhcp-agent and neutron-l3-agent won't respawn child processes if something goes wrong
[neutron]: neutron-dhcp-agent and neutron-l3-agent won't respawn child proces...
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron (Show other bugs)
4.0
Unspecified Unspecified
urgent Severity urgent
: z4
: 4.0
Assigned To: Miguel Angel Ajo
yfried
: ZStream
Depends On:
Blocks: RHEL-OSP_Neutron_HA
  Show dependency treegraph
 
Reported: 2014-01-10 05:00 EST by Miguel Angel Ajo
Modified: 2016-04-26 11:02 EDT (History)
6 users (show)

See Also:
Fixed In Version: openstack-neutron-2013.2.2-9.el6ost
Doc Type: Bug Fix
Doc Text:
Cause: The neutron-*-agent code doesn't detect when a child process dies, and doesn't respawn it or log any error. Consequence: The service to a tenant network could be interrupted without any ability to supervise it or fix it automatically. Fix: Created the neutron-agent-watch to watch over those child processes, until an upstream solution is merged. Result: Now the agent status can be polled to find out the general status of the agent, included the child process status.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-05-29 16:18:28 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Neutron agent watch code (25.12 KB, text/x-python)
2014-03-28 10:59 EDT, Miguel Angel Ajo
no flags Details
dhcp agent init script (2.16 KB, application/x-shellscript)
2014-03-28 11:00 EDT, Miguel Angel Ajo
no flags Details
l3 agent init script (2.19 KB, application/x-shellscript)
2014-03-28 11:00 EDT, Miguel Angel Ajo
no flags Details
agent watch config (996 bytes, text/plain)
2014-03-28 11:01 EDT, Miguel Angel Ajo
no flags Details
openstack-neutron.spec patch (4.17 KB, patch)
2014-03-28 11:02 EDT, Miguel Angel Ajo
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1257524 None None None Never
Red Hat Product Errata RHSA-2014:0516 normal SHIPPED_LIVE Moderate: openstack-neutron security, bug fix, and enhancement update 2014-05-29 20:15:59 EDT

  None (edit)
Description Miguel Angel Ajo 2014-01-10 05:00:10 EST
Description of problem:

  If a child process (which provides services to tenant network) dies, or gets killed neutron-dhcp-agent won't restart the process.

  That means that metadata or DHCP could stop working of something goes wrong with the child process, but neutron believes everything is up and running.

How reproducible:

   Always  


Steps to Reproduce:
1. Create an isolated network + subnet
2. Attach a VM to this network only, and make sure it's started
3. A dnsmasq & neutron-ns-metadata-proxy child processes must have been created now in the host where neutron-dhcp-agent is running.
4. Kill those processes manually

Actual results:

  The dnsmasq process and the neutron-ns-metadata-proxy are not respawned.

Expected results:

  The underlaying processes may be respawned, or at least we may log / notice somewhere that the child is gone.

Additional info:
Comment 1 Miguel Angel Ajo 2014-01-10 05:04:34 EST
I have found nodes where dnsmasq had died for some reason, and the network was left unattended. But, probably, if the child died, there are high chances that it will die again.

So respawn retry limit + logging could make sense.
Comment 10 Miguel Angel Ajo 2014-03-28 10:59:43 EDT
Created attachment 879889 [details]
Neutron agent watch code
Comment 11 Miguel Angel Ajo 2014-03-28 11:00:15 EDT
Created attachment 879890 [details]
dhcp agent init script
Comment 12 Miguel Angel Ajo 2014-03-28 11:00:58 EDT
Created attachment 879891 [details]
l3 agent init script
Comment 13 Miguel Angel Ajo 2014-03-28 11:01:53 EDT
Created attachment 879892 [details]
agent watch config
Comment 14 Miguel Angel Ajo 2014-03-28 11:02:27 EDT
Created attachment 879893 [details]
openstack-neutron.spec patch
Comment 18 yfried 2014-04-23 02:43:09 EDT
verified on RHEL6.5

python-neutron-2013.2.3-4.el6ost.noarch
python-neutronclient-2.3.4-1.el6ost.noarch
openstack-neutron-openvswitch-2013.2.3-4.el6ost.noarch
openstack-neutron-2013.2.3-4.el6ost.noarch

behavior is as described in comment 15

However, trying to delete the "damaged" network prints to agent_watch.log:

2014-04-23 09:35:51.169 114897 ERROR root [-] Unexpected exception occurred 50 time(s)... retrying.
2014-04-23 09:35:51.169 114897 TRACE root Traceback (most recent call last):
2014-04-23 09:35:51.169 114897 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/excutils.py", line 62, in inner_func
2014-04-23 09:35:51.169 114897 TRACE root     return infunc(*args, **kwargs)
2014-04-23 09:35:51.169 114897 TRACE root   File "/usr/bin/neutron-agent-watch", line 627, in run
2014-04-23 09:35:51.169 114897 TRACE root     watcher.run(context)
2014-04-23 09:35:51.169 114897 TRACE root   File "/usr/bin/neutron-agent-watch", line 426, in run
2014-04-23 09:35:51.169 114897 TRACE root     self._run()  # run method implemented in child class
2014-04-23 09:35:51.169 114897 TRACE root   File "/usr/bin/neutron-agent-watch", line 515, in _run
2014-04-23 09:35:51.169 114897 TRACE root     self._remove_old_known_pidfiles(expected_pid_files)
2014-04-23 09:35:51.169 114897 TRACE root   File "/usr/bin/neutron-agent-watch", line 388, in _remove_old_known_pidfiles
2014-04-23 09:35:51.169 114897 TRACE root     self._remove_expected_pid_file(known)
2014-04-23 09:35:51.169 114897 TRACE root AttributeError: 'DhcpAgentWatcher' object has no attribute '_remove_expected_pid_file'

and
# /etc/init.d/neutron-dhcp-agent status ; echo $?
returns:
neutron-dhcp-agent (pid  20898) is running...
neutron-dhcp-agent health is not good
150

even after network and router have been deleted.

this is only resolved after restarting agent-watch service
Comment 20 errata-xmlrpc 2014-05-29 16:18:28 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0516.html

Note You need to log in before you can comment on or make changes to this bug.