Bug 1106489

Summary: neutron-*-agent child processes can die unnoticed
Product: Red Hat OpenStack Reporter: Miguel Angel Ajo <majopela>
Component: openstack-neutronAssignee: Miguel Angel Ajo <majopela>
Status: CLOSED ERRATA QA Contact: Ofer Blaut <oblaut>
Severity: medium Docs Contact:
Priority: high    
Version: 5.0 (RHEL 7)CC: chrisw, dron, lpeer, nyechiel, sclewis, stoner, yeylon
Target Milestone: z2Keywords: Regression, ZStream
Target Release: 5.0 (RHEL 7)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-2014.1.3-4.el7ost Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-11-03 08:38:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1065172, 1106457    
Bug Blocks: 1083890    

Description Miguel Angel Ajo 2014-06-09 12:54:40 UTC
Description of problem:

  If neutron-*-agent child processes die, the agent's won't notice it, in rhel6 we have neutron-agent-watch to handle this. But with systemd that can't be used.

  I'm pushing this implementation in oslo: https://review.openstack.org/#/c/97748/ to get systemd reporting back to neutron, but systemd seems to have ERRNO NOTIFY_SOCKET handling and reporting unimplemented (bz#1106457)

Version-Release number of selected component (if applicable):
openstack-neutron-2014.1-26.el7ost.noarch

How reproducible:

100%

Steps to Reproduce:
1. Login to neutron network node
2. killall dnsmasq
3. 

Actual results:

Check that neutron-dhcp-agent won't notice it, until the affected networks are changed and the dnsmasq child process is restarted to pickup a new configuration for a tenant network.

Expected results:

Neutron-*-agent provides an error condition via systemctl status or quits.

Additional info:

Comment 4 Miguel Angel Ajo 2014-10-09 09:14:25 UTC
How to test this:

1) With a working deployment, modify l3_agent.ini and dhcp_agent.ini to include:

check_child_processes_action = respawn
check_child_processes_interval = 5

2) restart the l3 & dhcp agent.

3) Spawn resources (a VM connected to a private tenant network)

4) tail -f /var/log/neutron/dhcp_agent.log & \
   tail -f /var/log/neutron/l3_agent.log &

5) sudo killall dnsmasq

you should see then, something like:

2014-10-09 04:31:46.434 9651 ERROR neutron.agent.linux.external_process [-] dnsmasq for dhcp with uuid 67f3c1d9-5861-4466-899f-f166aa97a173 not found. The process should not have died
2014-10-09 04:31:46.434 9651 ERROR neutron.agent.linux.external_process [-] respawning dnsmasq for uuid 67f3c1d9-5861-4466-899f-f166aa97a173


6) sudo killall neutron-ns-metadata-proxy

you should see something like:

2014-10-09 04:33:06.564 9656 ERROR neutron.agent.linux.external_process [-] default-service for router with uuid a539a2f8-a6ec-41d1-91b0-bf2ca780b644 not found. The process should not have died
2014-10-09 04:33:06.564 9656 ERROR neutron.agent.linux.external_process [-] respawning None for uuid a539a2f8-a6ec-41d1-91b0-bf2ca780b644


7) modify l3_agent.ini and dhcp_agent.ini to include:

check_child_processes_action = exit
check_child_processes_interval = 5

8) repeat 4-6, but in this case agent should exit.

9) repeat all above with check_child_processes_interval = 0 , and nothing will happen no service will be restarted automatically, or message will be provided.

Comment 7 Sean Toner 2014-10-09 15:52:25 UTC
In between step 7 and 8, it should say to restart the neutron-l3-agent and neutron-dhcp-agent.

Otherwise, I ran through these steps and verified the expected behavior.

Comment 8 Miguel Angel Ajo 2014-10-10 08:26:29 UTC
(In reply to Sean Toner from comment #7)
> In between step 7 and 8, it should say to restart the neutron-l3-agent and
> neutron-dhcp-agent.
> 
> Otherwise, I ran through these steps and verified the expected behavior.

Correct, I forgot to mention that step.

Thank you for testing!.

Comment 10 errata-xmlrpc 2014-11-03 08:38:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2014-1786.html