Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1075570

Summary: Many dnsmaq kill's in dhcp-agent log file
Product: Red Hat OpenStack Reporter: Ian Wienand <iwienand>
Component: openstack-neutronAssignee: RHOS Maint <rhos-maint>
Status: CLOSED DUPLICATE QA Contact: Ofer Blaut <oblaut>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.0CC: chrisw, majopela, nyechiel, yeylon
Target Milestone: ---   
Target Release: 5.0 (RHEL 7)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-05-29 08:24:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1077487    
Bug Blocks:    

Description Ian Wienand 2014-03-12 11:55:53 UTC
Description of problem:

I am seeing a lot of dnsmasq processes being killed on my neutron dhcp-agent with zero context.  The log file just has

---
2014-03-12 11:19:16.058 4442 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-9', '3907'] create_process /usr/lib/python2.6/site-packages/neutron/agent/linux/utils.py:47
---

In an effort to see what was going on, I put in a debug traceback before the kill.

Here's a log extract

---

2014-03-12 11:42:05.473 4442 DEBUG neutron.agent.linux.dhcp [-] ['  File "/usr/lib/python2.6/site-packages/eventlet/greenpool.py", line 80, in _spawn_n_impl\n    func(*args, **kwargs)\n', '  File "/usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py", line 196, in safe_configure_dhcp_for_network\n    self.configure_dhcp_for_network(network)\n', '  File "/usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py", line 207, in configure_dhcp_for_network\n    if self.call_driver(\'enable\', network):\n', '  File "/usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py", line 128, in call_driver\n    getattr(driver, action)(**action_kwargs)\n', '  File "/usr/lib/python2.6/site-packages/neutron/agent/linux/dhcp.py", line 170, in enable\n    self.restart()\n', '  File "/usr/lib/python2.6/site-packages/neutron/agent/linux/dhcp.py", line 126, in restart\n    self.disable(retain_port=True)\n', '  File "/usr/lib/python2.6/site-packages/neutron/agent/linux/dhcp.py", line 180, in disable\n    LOG.debug(traceback.format_stack())\n'] disable /usr/lib/python2.6/site-packages/neutron/agent/linux/dhcp.py:180
2014-03-12 11:42:05.473 4442 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-9', '8614'] create_process /usr/lib/python2.6/site-packages/neutron/agent/linux/utils.py:47
---

It seems the path is  _periodic_resync_helper calls sync_state(), which calls safe_configure_dhcp_for_network() which ends up calling self.call_driver('enable', network), which ends up in [2] which restarts the network, which means killing and restarting dnsmasq

Is this really the intention, to force a hard kill like this?  Watching my logs, it was called at least 7 times within 20 seconds.  It's doing *a lot* of restarting...

---
Wed Mar 12 11:51:20 UTC 2014
[root@host03 neutron]# cat dhcp-agent.log | grep -v qpid | grep  traceback | wc -l
603
[root@host03 neutron]# date
Wed Mar 12 11:51:40 UTC 2014
[root@host03 neutron]# cat dhcp-agent.log | grep -v qpid | grep  traceback | wc -l
610
---

[1] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp_agent.py#L173
[2] https://github.com/openstack/neutron/blob/master/neutron/agent/linux/dhcp.py#L163

Version-Release number of selected component (if applicable):

[root@host03 neutron]# rpm -qa | grep openstack-neutron
openstack-neutron-openvswitch-2013.2.2-5.el6ost.noarch
openstack-neutron-2013.2.2-5.el6ost.noarch

Comment 2 Ian Wienand 2014-03-18 20:27:37 UTC
Further investigation has revealed several things

bug#1077487 shows an issue with an ipv6 network.  the failure of dnsmasq when enabling this network results in the the resync logic being triggered, which tries to resync the ipv6 network, which fails, which triggers the resync ... and so on.

The other thing causing resyncs is some sort of db error

---
2014-03-18 00:42:34.052 20326 ERROR neutron.agent.dhcp_agent [-] Unable to sync network state (Remote error: DBDuplicateEntry (IntegrityError) (1062, "Duplicate entry 'ccbc77be-c388-4f3a-b4ee-8ae9cba59997-d3e3ac5b-9962-428a-a9f8-6b2' for key 'PRIMARY'") 'INSERT INTO networkdhcpagentbindings (network_id, dhcp_agent_id) VALUES (%s, %s)' ('ccbc77be-c388-4f3a-b4ee-8ae9cba59997', 'd3e3ac5b-9962-428a-a9f8-6b2a1aba48d8')
---

I still need to look into that

Comment 3 Miguel Angel Ajo 2014-05-29 08:24:09 UTC
Symptoms look like a mix of bz#1077484 (dnsmasq IPv6) with bz#1076994 (full var).

The duplicate key problem is also another know bug of neutron-server with race conditions on agent registration, which could happen for two reason: 
1) duplicated agents on the same host and 
2) another bug that was fixed in A3 related AMQP which affected multiple running neutron-servers (they all got the same AMQP message around the same time making them go into the race condition with DB which has not an UNIQUE constraint for the table).

I will mark as a duplicate of 1076994 that probably is the main reason of the pain here, but feel free to re-open if you see it.

*** This bug has been marked as a duplicate of bug 1076994 ***