Bug 1051444 - [neutron]: neutron-dhcp-agent and neutron-l3-agent won't respawn child processes if something goes wrong
Summary: [neutron]: neutron-dhcp-agent and neutron-l3-agent won't respawn child proces...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 4.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z4
: 4.0
Assignee: Miguel Angel Ajo
QA Contact: yfried
URL:
Whiteboard:
Depends On:
Blocks: RHEL-OSP_Neutron_HA
TreeView+ depends on / blocked
 
Reported: 2014-01-10 10:00 UTC by Miguel Angel Ajo
Modified: 2022-07-09 06:16 UTC (History)
6 users (show)

Fixed In Version: openstack-neutron-2013.2.2-9.el6ost
Doc Type: Bug Fix
Doc Text:
Cause: The neutron-*-agent code doesn't detect when a child process dies, and doesn't respawn it or log any error. Consequence: The service to a tenant network could be interrupted without any ability to supervise it or fix it automatically. Fix: Created the neutron-agent-watch to watch over those child processes, until an upstream solution is merged. Result: Now the agent status can be polled to find out the general status of the agent, included the child process status.
Clone Of:
Environment:
Last Closed: 2014-05-29 20:18:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Neutron agent watch code (25.12 KB, text/x-python)
2014-03-28 14:59 UTC, Miguel Angel Ajo
no flags Details
dhcp agent init script (2.16 KB, application/x-shellscript)
2014-03-28 15:00 UTC, Miguel Angel Ajo
no flags Details
l3 agent init script (2.19 KB, application/x-shellscript)
2014-03-28 15:00 UTC, Miguel Angel Ajo
no flags Details
agent watch config (996 bytes, text/plain)
2014-03-28 15:01 UTC, Miguel Angel Ajo
no flags Details
openstack-neutron.spec patch (4.17 KB, patch)
2014-03-28 15:02 UTC, Miguel Angel Ajo
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1257524 0 None None None Never
Red Hat Product Errata RHSA-2014:0516 0 normal SHIPPED_LIVE Moderate: openstack-neutron security, bug fix, and enhancement update 2014-05-30 00:15:59 UTC

Description Miguel Angel Ajo 2014-01-10 10:00:10 UTC
Description of problem:

  If a child process (which provides services to tenant network) dies, or gets killed neutron-dhcp-agent won't restart the process.

  That means that metadata or DHCP could stop working of something goes wrong with the child process, but neutron believes everything is up and running.

How reproducible:

   Always  


Steps to Reproduce:
1. Create an isolated network + subnet
2. Attach a VM to this network only, and make sure it's started
3. A dnsmasq & neutron-ns-metadata-proxy child processes must have been created now in the host where neutron-dhcp-agent is running.
4. Kill those processes manually

Actual results:

  The dnsmasq process and the neutron-ns-metadata-proxy are not respawned.

Expected results:

  The underlaying processes may be respawned, or at least we may log / notice somewhere that the child is gone.

Additional info:

Comment 1 Miguel Angel Ajo 2014-01-10 10:04:34 UTC
I have found nodes where dnsmasq had died for some reason, and the network was left unattended. But, probably, if the child died, there are high chances that it will die again.

So respawn retry limit + logging could make sense.

Comment 10 Miguel Angel Ajo 2014-03-28 14:59:43 UTC
Created attachment 879889 [details]
Neutron agent watch code

Comment 11 Miguel Angel Ajo 2014-03-28 15:00:15 UTC
Created attachment 879890 [details]
dhcp agent init script

Comment 12 Miguel Angel Ajo 2014-03-28 15:00:58 UTC
Created attachment 879891 [details]
l3 agent init script

Comment 13 Miguel Angel Ajo 2014-03-28 15:01:53 UTC
Created attachment 879892 [details]
agent watch config

Comment 14 Miguel Angel Ajo 2014-03-28 15:02:27 UTC
Created attachment 879893 [details]
openstack-neutron.spec patch

Comment 18 yfried 2014-04-23 06:43:09 UTC
verified on RHEL6.5

python-neutron-2013.2.3-4.el6ost.noarch
python-neutronclient-2.3.4-1.el6ost.noarch
openstack-neutron-openvswitch-2013.2.3-4.el6ost.noarch
openstack-neutron-2013.2.3-4.el6ost.noarch

behavior is as described in comment 15

However, trying to delete the "damaged" network prints to agent_watch.log:

2014-04-23 09:35:51.169 114897 ERROR root [-] Unexpected exception occurred 50 time(s)... retrying.
2014-04-23 09:35:51.169 114897 TRACE root Traceback (most recent call last):
2014-04-23 09:35:51.169 114897 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/excutils.py", line 62, in inner_func
2014-04-23 09:35:51.169 114897 TRACE root     return infunc(*args, **kwargs)
2014-04-23 09:35:51.169 114897 TRACE root   File "/usr/bin/neutron-agent-watch", line 627, in run
2014-04-23 09:35:51.169 114897 TRACE root     watcher.run(context)
2014-04-23 09:35:51.169 114897 TRACE root   File "/usr/bin/neutron-agent-watch", line 426, in run
2014-04-23 09:35:51.169 114897 TRACE root     self._run()  # run method implemented in child class
2014-04-23 09:35:51.169 114897 TRACE root   File "/usr/bin/neutron-agent-watch", line 515, in _run
2014-04-23 09:35:51.169 114897 TRACE root     self._remove_old_known_pidfiles(expected_pid_files)
2014-04-23 09:35:51.169 114897 TRACE root   File "/usr/bin/neutron-agent-watch", line 388, in _remove_old_known_pidfiles
2014-04-23 09:35:51.169 114897 TRACE root     self._remove_expected_pid_file(known)
2014-04-23 09:35:51.169 114897 TRACE root AttributeError: 'DhcpAgentWatcher' object has no attribute '_remove_expected_pid_file'

and
# /etc/init.d/neutron-dhcp-agent status ; echo $?
returns:
neutron-dhcp-agent (pid  20898) is running...
neutron-dhcp-agent health is not good
150

even after network and router have been deleted.

this is only resolved after restarting agent-watch service

Comment 20 errata-xmlrpc 2014-05-29 20:18:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0516.html


Note You need to log in before you can comment on or make changes to this bug.