Bug 1051444
| Summary: | [neutron]: neutron-dhcp-agent and neutron-l3-agent won't respawn child processes if something goes wrong | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Miguel Angel Ajo <majopela> | ||||||||||||
| Component: | openstack-neutron | Assignee: | Miguel Angel Ajo <mangelajo> | ||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | yfried | ||||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||||
| Priority: | urgent | ||||||||||||||
| Version: | 4.0 | CC: | breeler, chrisw, fdinitto, lpeer, majopela, yeylon | ||||||||||||
| Target Milestone: | z4 | Keywords: | ZStream | ||||||||||||
| Target Release: | 4.0 | ||||||||||||||
| Hardware: | Unspecified | ||||||||||||||
| OS: | Unspecified | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | openstack-neutron-2013.2.2-9.el6ost | Doc Type: | Bug Fix | ||||||||||||
| Doc Text: |
Cause: The neutron-*-agent code doesn't detect when a child process dies, and doesn't respawn it or log any error.
Consequence: The service to a tenant network could be interrupted without any ability to supervise it or fix it automatically.
Fix: Created the neutron-agent-watch to watch over those child processes, until an upstream solution is merged.
Result: Now the agent status can be polled to find out the general status of the agent, included the child process status.
|
Story Points: | --- | ||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2014-05-29 20:18:28 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Bug Depends On: | |||||||||||||||
| Bug Blocks: | 1080561 | ||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Miguel Angel Ajo
2014-01-10 10:00:10 UTC
I have found nodes where dnsmasq had died for some reason, and the network was left unattended. But, probably, if the child died, there are high chances that it will die again. So respawn retry limit + logging could make sense. Created attachment 879889 [details]
Neutron agent watch code
Created attachment 879890 [details]
dhcp agent init script
Created attachment 879891 [details]
l3 agent init script
Created attachment 879892 [details]
agent watch config
Created attachment 879893 [details]
openstack-neutron.spec patch
verified on RHEL6.5 python-neutron-2013.2.3-4.el6ost.noarch python-neutronclient-2.3.4-1.el6ost.noarch openstack-neutron-openvswitch-2013.2.3-4.el6ost.noarch openstack-neutron-2013.2.3-4.el6ost.noarch behavior is as described in comment 15 However, trying to delete the "damaged" network prints to agent_watch.log: 2014-04-23 09:35:51.169 114897 ERROR root [-] Unexpected exception occurred 50 time(s)... retrying. 2014-04-23 09:35:51.169 114897 TRACE root Traceback (most recent call last): 2014-04-23 09:35:51.169 114897 TRACE root File "/usr/lib/python2.6/site-packages/neutron/openstack/common/excutils.py", line 62, in inner_func 2014-04-23 09:35:51.169 114897 TRACE root return infunc(*args, **kwargs) 2014-04-23 09:35:51.169 114897 TRACE root File "/usr/bin/neutron-agent-watch", line 627, in run 2014-04-23 09:35:51.169 114897 TRACE root watcher.run(context) 2014-04-23 09:35:51.169 114897 TRACE root File "/usr/bin/neutron-agent-watch", line 426, in run 2014-04-23 09:35:51.169 114897 TRACE root self._run() # run method implemented in child class 2014-04-23 09:35:51.169 114897 TRACE root File "/usr/bin/neutron-agent-watch", line 515, in _run 2014-04-23 09:35:51.169 114897 TRACE root self._remove_old_known_pidfiles(expected_pid_files) 2014-04-23 09:35:51.169 114897 TRACE root File "/usr/bin/neutron-agent-watch", line 388, in _remove_old_known_pidfiles 2014-04-23 09:35:51.169 114897 TRACE root self._remove_expected_pid_file(known) 2014-04-23 09:35:51.169 114897 TRACE root AttributeError: 'DhcpAgentWatcher' object has no attribute '_remove_expected_pid_file' and # /etc/init.d/neutron-dhcp-agent status ; echo $? returns: neutron-dhcp-agent (pid 20898) is running... neutron-dhcp-agent health is not good 150 even after network and router have been deleted. this is only resolved after restarting agent-watch service Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0516.html |