Description of problem: I am seeing a lot of dnsmasq processes being killed on my neutron dhcp-agent with zero context. The log file just has --- 2014-03-12 11:19:16.058 4442 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-9', '3907'] create_process /usr/lib/python2.6/site-packages/neutron/agent/linux/utils.py:47 --- In an effort to see what was going on, I put in a debug traceback before the kill. Here's a log extract --- 2014-03-12 11:42:05.473 4442 DEBUG neutron.agent.linux.dhcp [-] [' File "/usr/lib/python2.6/site-packages/eventlet/greenpool.py", line 80, in _spawn_n_impl\n func(*args, **kwargs)\n', ' File "/usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py", line 196, in safe_configure_dhcp_for_network\n self.configure_dhcp_for_network(network)\n', ' File "/usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py", line 207, in configure_dhcp_for_network\n if self.call_driver(\'enable\', network):\n', ' File "/usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py", line 128, in call_driver\n getattr(driver, action)(**action_kwargs)\n', ' File "/usr/lib/python2.6/site-packages/neutron/agent/linux/dhcp.py", line 170, in enable\n self.restart()\n', ' File "/usr/lib/python2.6/site-packages/neutron/agent/linux/dhcp.py", line 126, in restart\n self.disable(retain_port=True)\n', ' File "/usr/lib/python2.6/site-packages/neutron/agent/linux/dhcp.py", line 180, in disable\n LOG.debug(traceback.format_stack())\n'] disable /usr/lib/python2.6/site-packages/neutron/agent/linux/dhcp.py:180 2014-03-12 11:42:05.473 4442 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-9', '8614'] create_process /usr/lib/python2.6/site-packages/neutron/agent/linux/utils.py:47 --- It seems the path is _periodic_resync_helper calls sync_state(), which calls safe_configure_dhcp_for_network() which ends up calling self.call_driver('enable', network), which ends up in [2] which restarts the network, which means killing and restarting dnsmasq Is this really the intention, to force a hard kill like this? Watching my logs, it was called at least 7 times within 20 seconds. It's doing *a lot* of restarting... --- Wed Mar 12 11:51:20 UTC 2014 [root@host03 neutron]# cat dhcp-agent.log | grep -v qpid | grep traceback | wc -l 603 [root@host03 neutron]# date Wed Mar 12 11:51:40 UTC 2014 [root@host03 neutron]# cat dhcp-agent.log | grep -v qpid | grep traceback | wc -l 610 --- [1] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp_agent.py#L173 [2] https://github.com/openstack/neutron/blob/master/neutron/agent/linux/dhcp.py#L163 Version-Release number of selected component (if applicable): [root@host03 neutron]# rpm -qa | grep openstack-neutron openstack-neutron-openvswitch-2013.2.2-5.el6ost.noarch openstack-neutron-2013.2.2-5.el6ost.noarch
Further investigation has revealed several things bug#1077487 shows an issue with an ipv6 network. the failure of dnsmasq when enabling this network results in the the resync logic being triggered, which tries to resync the ipv6 network, which fails, which triggers the resync ... and so on. The other thing causing resyncs is some sort of db error --- 2014-03-18 00:42:34.052 20326 ERROR neutron.agent.dhcp_agent [-] Unable to sync network state (Remote error: DBDuplicateEntry (IntegrityError) (1062, "Duplicate entry 'ccbc77be-c388-4f3a-b4ee-8ae9cba59997-d3e3ac5b-9962-428a-a9f8-6b2' for key 'PRIMARY'") 'INSERT INTO networkdhcpagentbindings (network_id, dhcp_agent_id) VALUES (%s, %s)' ('ccbc77be-c388-4f3a-b4ee-8ae9cba59997', 'd3e3ac5b-9962-428a-a9f8-6b2a1aba48d8') --- I still need to look into that
Symptoms look like a mix of bz#1077484 (dnsmasq IPv6) with bz#1076994 (full var). The duplicate key problem is also another know bug of neutron-server with race conditions on agent registration, which could happen for two reason: 1) duplicated agents on the same host and 2) another bug that was fixed in A3 related AMQP which affected multiple running neutron-servers (they all got the same AMQP message around the same time making them go into the race condition with DB which has not an UNIQUE constraint for the table). I will mark as a duplicate of 1076994 that probably is the main reason of the pain here, but feel free to re-open if you see it. *** This bug has been marked as a duplicate of bug 1076994 ***