1075570 – Many dnsmaq kill's in dhcp-agent log file

Bug 1075570 - Many dnsmaq kill's in dhcp-agent log file

Summary: Many dnsmaq kill's in dhcp-agent log file

Keywords:
Status:	CLOSED DUPLICATE of bug 1076994
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	5.0 (RHEL 7)
Assignee:	RHOS Maint
QA Contact:	Ofer Blaut
Docs Contact:
URL:
Whiteboard:
Depends On:	1077487
Blocks:
TreeView+	depends on / blocked

Reported:	2014-03-12 11:55 UTC by Ian Wienand
Modified:	2023-05-04 01:02 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-05-29 08:24:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-24734	0	None	None	None	2023-05-04 01:02:21 UTC

Description Ian Wienand 2014-03-12 11:55:53 UTC

Description of problem:

I am seeing a lot of dnsmasq processes being killed on my neutron dhcp-agent with zero context.  The log file just has

---
2014-03-12 11:19:16.058 4442 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-9', '3907'] create_process /usr/lib/python2.6/site-packages/neutron/agent/linux/utils.py:47
---

In an effort to see what was going on, I put in a debug traceback before the kill.

Here's a log extract

---

2014-03-12 11:42:05.473 4442 DEBUG neutron.agent.linux.dhcp [-] ['  File "/usr/lib/python2.6/site-packages/eventlet/greenpool.py", line 80, in _spawn_n_impl\n    func(*args, **kwargs)\n', '  File "/usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py", line 196, in safe_configure_dhcp_for_network\n    self.configure_dhcp_for_network(network)\n', '  File "/usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py", line 207, in configure_dhcp_for_network\n    if self.call_driver(\'enable\', network):\n', '  File "/usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py", line 128, in call_driver\n    getattr(driver, action)(**action_kwargs)\n', '  File "/usr/lib/python2.6/site-packages/neutron/agent/linux/dhcp.py", line 170, in enable\n    self.restart()\n', '  File "/usr/lib/python2.6/site-packages/neutron/agent/linux/dhcp.py", line 126, in restart\n    self.disable(retain_port=True)\n', '  File "/usr/lib/python2.6/site-packages/neutron/agent/linux/dhcp.py", line 180, in disable\n    LOG.debug(traceback.format_stack())\n'] disable /usr/lib/python2.6/site-packages/neutron/agent/linux/dhcp.py:180
2014-03-12 11:42:05.473 4442 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-9', '8614'] create_process /usr/lib/python2.6/site-packages/neutron/agent/linux/utils.py:47
---

It seems the path is  _periodic_resync_helper calls sync_state(), which calls safe_configure_dhcp_for_network() which ends up calling self.call_driver('enable', network), which ends up in [2] which restarts the network, which means killing and restarting dnsmasq

Is this really the intention, to force a hard kill like this?  Watching my logs, it was called at least 7 times within 20 seconds.  It's doing *a lot* of restarting...

---
Wed Mar 12 11:51:20 UTC 2014
[root@host03 neutron]# cat dhcp-agent.log | grep -v qpid | grep  traceback | wc -l
603
[root@host03 neutron]# date
Wed Mar 12 11:51:40 UTC 2014
[root@host03 neutron]# cat dhcp-agent.log | grep -v qpid | grep  traceback | wc -l
610
---

[1] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp_agent.py#L173
[2] https://github.com/openstack/neutron/blob/master/neutron/agent/linux/dhcp.py#L163

Version-Release number of selected component (if applicable):

[root@host03 neutron]# rpm -qa | grep openstack-neutron
openstack-neutron-openvswitch-2013.2.2-5.el6ost.noarch
openstack-neutron-2013.2.2-5.el6ost.noarch

Comment 2 Ian Wienand 2014-03-18 20:27:37 UTC

Further investigation has revealed several things

bug#1077487 shows an issue with an ipv6 network.  the failure of dnsmasq when enabling this network results in the the resync logic being triggered, which tries to resync the ipv6 network, which fails, which triggers the resync ... and so on.

The other thing causing resyncs is some sort of db error

---
2014-03-18 00:42:34.052 20326 ERROR neutron.agent.dhcp_agent [-] Unable to sync network state (Remote error: DBDuplicateEntry (IntegrityError) (1062, "Duplicate entry 'ccbc77be-c388-4f3a-b4ee-8ae9cba59997-d3e3ac5b-9962-428a-a9f8-6b2' for key 'PRIMARY'") 'INSERT INTO networkdhcpagentbindings (network_id, dhcp_agent_id) VALUES (%s, %s)' ('ccbc77be-c388-4f3a-b4ee-8ae9cba59997', 'd3e3ac5b-9962-428a-a9f8-6b2a1aba48d8')
---

I still need to look into that

Comment 3 Miguel Angel Ajo 2014-05-29 08:24:09 UTC

Symptoms look like a mix of bz#1077484 (dnsmasq IPv6) with bz#1076994 (full var).

The duplicate key problem is also another know bug of neutron-server with race conditions on agent registration, which could happen for two reason: 
1) duplicated agents on the same host and 
2) another bug that was fixed in A3 related AMQP which affected multiple running neutron-servers (they all got the same AMQP message around the same time making them go into the race condition with DB which has not an UNIQUE constraint for the table).

I will mark as a duplicate of 1076994 that probably is the main reason of the pain here, but feel free to re-open if you see it.

*** This bug has been marked as a duplicate of bug 1076994 ***

Note You need to log in before you can comment on or make changes to this bug.