Bug 1112068 - wrong gateway_ip setting in subnet constantly resyncs dhcp-agent, causing DHCP NAK dropping connection in guests
Summary: wrong gateway_ip setting in subnet constantly resyncs dhcp-agent, causing DHC...
Keywords:
Status: CLOSED DUPLICATE of bug 1090553
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 4.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 5.0 (RHEL 7)
Assignee: RHOS Maint
QA Contact: Ofer Blaut
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-06-23 05:54 UTC by Ian Wienand
Modified: 2016-04-26 14:47 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-06-23 08:33:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ian Wienand 2014-06-23 05:54:39 UTC
Description of problem:

I have a CI test running that does a lot of git clones, pypi pulls, etc.  It has recently started failing randomly.

Various things will fail like this.  The symptom is usually something like

---
2014-06-23 07:54:43.760 | + timeout=300
2014-06-23 07:54:43.760 | + timeout -s SIGINT 300 git clone git://git.openstack.org/openstack/python-keystoneclient.git /opt/stack/python-keystoneclient
2014-06-23 07:54:43.765 | fatal: Unable to look up git.openstack.org (port 9418) (Name or service not known)
2014-06-23 07:54:43.766 | Initialized empty Git repository in /opt/stack/python-keystoneclient/.git/
---

In an effort figure this out, I ran the following test

---
[stack@host-192-168-0-2 tmp]$ cat test.sh
#!/bin/bash

while [ 1 ]
do
	sudo tcpdump -i eth0 -w /tmp/trace.log port not 22&
	pid=$!
	echo $pid
	sleep 2 

	rm -rf python-keystoneclient
	strace -f -o /tmp/trace.txt git clone git://git.openstack.org/openstack/python-keystoneclient.git 
	s=$?
	sleep 2
	sudo kill -INT $pid

	if [[ $s -ne 0 ]]; then
		exit
	fi

	sudo rm /tmp/trace.txt
	sudo rm /tmp/trace.log

done
---

eventually this failed.

---
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
Initialized empty Git repository in /tmp/python-keystoneclient/.git/
fatal: Unable to look up git.openstack.org (port 9418) (Name or service not known)
10 packets captured
10 packets received by filter
0 packets dropped by kernel
---

What I found was the git process trying to do the lookup for git.openstack.org and getting connection refused from the network

---
22597 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
22597 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.16.17.200")}, 16) = -1 ENETUNREACH (Network is unreachable)
22597 close(3)                          = 0
22597 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
22597 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.16.17.200")}, 16) = -1 ENETUNREACH (Network is unreachable)
22597 close(3)                          = 0
22597 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
22597 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.16.17.200")}, 16) = -1 ENETUNREACH (Network is unreachable)
22597 close(3)                          = 0
22597 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
22597 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.16.17.200")}, 16) = -1 ENETUNREACH (Network is unreachable)
22597 close(3)                          = 0
22597 write(2, "fatal: Unable to look up git.ope"..., 83) = 83
---

10.16.17.200 is the dns server

tcpdump during this time only capture 10 packets; all relating to a DHCP transaction

---
[iwienand@fedora19 bug2]$ tcpdump -v -r trace.out.txt.pcap
reading from file trace.out.txt.pcap, link-type EN10MB (Ethernet)
19:23:16.013752 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 328)
    192.168.0.2.bootpc > 192.168.0.32.bootps: BOOTP/DHCP, Request from fa:16:3e:74:e6:5c (oui Unknown), length 300, xid 0xfa7e0771, Flags [none]
	  Client-IP 192.168.0.2
	  Client-Ethernet-Address fa:16:3e:74:e6:5c (oui Unknown)
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Request
	    Parameter-Request Option 55, length 13: 
	      Subnet-Mask, BR, Time-Zone, Classless-Static-Route
	      Domain-Name, Domain-Name-Server, Hostname, YD
	      YS, NTP, MTU, Option 119
	      Default-Gateway
19:23:16.015167 IP (tos 0x0, ttl 64, id 13492, offset 0, flags [none], proto UDP (17), length 328)
    192.168.0.32.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 300, xid 0xfa7e0771, Flags [Broadcast]
	  Client-Ethernet-Address fa:16:3e:74:e6:5c (oui Unknown)
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: NACK
	    Server-ID Option 54, length 4: 192.168.0.32
	    MSG Option 56, length 15: "lease not found"
19:23:16.073829 IP (tos 0x10, ttl 128, id 0, offset 0, flags [none], proto UDP (17), length 328)
    0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from fa:16:3e:74:e6:5c (oui Unknown), length 300, xid 0x66016b4b, Flags [none]
	  Client-Ethernet-Address fa:16:3e:74:e6:5c (oui Unknown)
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Discover
	    Parameter-Request Option 55, length 13: 
	      Subnet-Mask, BR, Time-Zone, Classless-Static-Route
	      Domain-Name, Domain-Name-Server, Hostname, YD
	      YS, NTP, MTU, Option 119
	      Default-Gateway
19:23:16.074798 IP (tos 0x0, ttl 64, id 47590, offset 0, flags [none], proto UDP (17), length 358)
    192.168.0.32.bootps > 192.168.0.2.bootpc: BOOTP/DHCP, Reply, length 330, xid 0x66016b4b, Flags [none]
	  Your-IP 192.168.0.2
	  Server-IP 192.168.0.32
	  Client-Ethernet-Address fa:16:3e:74:e6:5c (oui Unknown)
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Offer
	    Server-ID Option 54, length 4: 192.168.0.32
	    Lease-Time Option 51, length 4: 120
	    RN Option 58, length 4: 60
	    RB Option 59, length 4: 105
	    Subnet-Mask Option 1, length 4: 255.255.255.0
	    BR Option 28, length 4: 192.168.0.255
	    Domain-Name Option 15, length 14: "openstacklocal"
	    Hostname Option 12, length 16: "host-192-168-0-2"
	    Default-Gateway Option 3, length 4: 192.168.0.1
	    Domain-Name-Server Option 6, length 4: ip-10-16-17-200.oslab.openstack.engineering.redhat.com
	    MTU Option 26, length 2: 1454
19:23:16.074906 IP (tos 0x10, ttl 128, id 0, offset 0, flags [none], proto UDP (17), length 328)
    0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from fa:16:3e:74:e6:5c (oui Unknown), length 300, xid 0x66016b4b, Flags [none]
	  Client-Ethernet-Address fa:16:3e:74:e6:5c (oui Unknown)
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Request
	    Server-ID Option 54, length 4: 192.168.0.32
	    Requested-IP Option 50, length 4: 192.168.0.2
	    Parameter-Request Option 55, length 13: 
	      Subnet-Mask, BR, Time-Zone, Classless-Static-Route
	      Domain-Name, Domain-Name-Server, Hostname, YD
	      YS, NTP, MTU, Option 119
	      Default-Gateway
19:23:16.075204 IP (tos 0x0, ttl 64, id 47591, offset 0, flags [none], proto UDP (17), length 358)
    192.168.0.32.bootps > 192.168.0.2.bootpc: BOOTP/DHCP, Reply, length 330, xid 0x66016b4b, Flags [none]
	  Your-IP 192.168.0.2
	  Server-IP 192.168.0.32
	  Client-Ethernet-Address fa:16:3e:74:e6:5c (oui Unknown)
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: ACK
	    Server-ID Option 54, length 4: 192.168.0.32
	    Lease-Time Option 51, length 4: 120
	    RN Option 58, length 4: 60
	    RB Option 59, length 4: 105
	    Subnet-Mask Option 1, length 4: 255.255.255.0
	    BR Option 28, length 4: 192.168.0.255
	    Domain-Name Option 15, length 14: "openstacklocal"
	    Hostname Option 12, length 16: "host-192-168-0-2"
	    Default-Gateway Option 3, length 4: 192.168.0.1
	    Domain-Name-Server Option 6, length 4: ip-10-16-17-200.oslab.openstack.engineering.redhat.com
	    MTU Option 26, length 2: 1454
19:23:16.106895 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.2 (Broadcast) tell 0.0.0.0, length 28
19:23:17.107078 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.2 (Broadcast) tell 0.0.0.0, length 28
19:23:19.351386 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.1 tell 192.168.0.2, length 28
19:23:19.352276 ARP, Ethernet (len 6), IPv4 (len 4), Reply 192.168.0.1 is-at fa:16:3e:6f:c3:20 (oui Unknown), length 28
---

So it looks like a request went out, a NAK came back "lease not found", the lease was requested again and we got back the same ip

This would seem fairly consistent with the symptoms of random failures every few minutes.  This is across several guest OS's (rhel6.5,centos6.5,f20,rhel7) so the problem isn't on the guest side.

On the dhcp-agent side, it looks like my dnsmasq is being restarted constantly

---
2014-06-23 05:46:11.606 2420 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2', 'env', 'NEUTRON_NETWORK_ID=20803109-5bea-4638-8b02-bcdbace3f0b2', 'dnsmasq', '--no-hosts', '--no-resolv', '--strict-order', '--bind-interfaces', '--interface=tap253af9f9-da', '--except-interface=lo', '--pid-file=/var/lib/neutron/dhcp/20803109-5bea-4638-8b02-bcdbace3f0b2/pid', '--dhcp-hostsfile=/var/lib/neutron/dhcp/20803109-5bea-4638-8b02-bcdbace3f0b2/host', '--dhcp-optsfile=/var/lib/neutron/dhcp/20803109-5bea-4638-8b02-bcdbace3f0b2/opts', '--leasefile-ro', '--dhcp-range=tag0,192.168.0.0,static,120s', '--dhcp-lease-max=256', '--conf-file=/etc/neutron/dnsmasq-neutron.conf', '--server=10.16.17.200', '--domain=openstacklocal'] create_process /usr/lib/python2.6/site-packages/neutron/agent/linux/utils.py:47
Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2', 'env', 'NEUTRON_NETWORK_ID=20803109-5bea-4638-8b02-bcdbace3f0b2', 'dnsmasq', '--no-hosts', '--no-resolv', '--strict-order', '--bind-interfaces', '--interface=tap253af9f9-da', '--except-interface=lo', '--pid-file=/var/lib/neutron/dhcp/20803109-5bea-4638-8b02-bcdbace3f0b2/pid', '--dhcp-hostsfile=/var/lib/neutron/dhcp/20803109-5bea-4638-8b02-bcdbace3f0b2/host', '--dhcp-optsfile=/var/lib/neutron/dhcp/20803109-5bea-4638-8b02-bcdbace3f0b2/opts', '--leasefile-ro', '--dhcp-range=tag0,192.168.0.0,static,120s', '--dhcp-lease-max=256', '--conf-file=/etc/neutron/dnsmasq-neutron.conf', '--server=10.16.17.200', '--domain=openstacklocal']
2014-06-23 05:46:11.999 2420 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2', 'ip', 'route', 'list', 'dev', 'tap253af9f9-da'] create_process /usr/lib/python2.6/site-packages/neutron/agent/linux/utils.py:47
---

---
[root@host03 neutron]# cat dhcp-agent.log | grep 208031 | grep dnsmasq | awk '{print $1 $2}' | grep '2014'
014-06-2305:26:29.116
2014-06-2305:27:43.433
2014-06-2305:28:57.412
2014-06-2305:30:13.599
2014-06-2305:31:27.678
2014-06-2305:32:41.938
2014-06-2305:33:56.821
2014-06-2305:35:13.788
2014-06-2305:36:28.207
2014-06-2305:37:43.666
2014-06-2305:38:57.453
2014-06-2305:40:13.163
2014-06-2305:41:57.580
2014-06-2305:43:40.929
2014-06-2305:44:56.178
[root@host03 neutron]# date
Mon Jun 23 05:45:37 UTC 2014
---

looking in /var/log/messages, we can see the bad history for this mac address

---
Jun 23 05:46:32 host03 dnsmasq-dhcp[19481]: DHCPNAK(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c lease not found
Jun 23 05:46:32 host03 dnsmasq-dhcp[19481]: DHCPDISCOVER(tap253af9f9-da) fa:16:3e:74:e6:5c 
Jun 23 05:46:32 host03 dnsmasq-dhcp[19481]: DHCPOFFER(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c 
Jun 23 05:46:32 host03 dnsmasq-dhcp[19481]: DHCPREQUEST(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c 
Jun 23 05:46:32 host03 dnsmasq-dhcp[19481]: DHCPACK(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c host-192-168-0-2
Jun 23 05:47:31 host03 dnsmasq-dhcp[24363]: DHCPREQUEST(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c 
Jun 23 05:47:31 host03 dnsmasq-dhcp[24363]: DHCPNAK(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c lease not found
Jun 23 05:47:31 host03 dnsmasq-dhcp[24363]: DHCPDISCOVER(tap253af9f9-da) fa:16:3e:74:e6:5c 
Jun 23 05:47:31 host03 dnsmasq-dhcp[24363]: DHCPOFFER(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c 
Jun 23 05:47:31 host03 dnsmasq-dhcp[24363]: DHCPREQUEST(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c 
Jun 23 05:47:31 host03 dnsmasq-dhcp[24363]: DHCPACK(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c host-192-168-0-2
Jun 23 05:48:24 host03 dnsmasq-dhcp[24363]: DHCPREQUEST(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c 
Jun 23 05:48:24 host03 dnsmasq-dhcp[24363]: DHCPACK(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c host-192-168-0-2
Jun 23 05:49:09 host03 dnsmasq-dhcp[29218]: DHCPREQUEST(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c 
Jun 23 05:49:09 host03 dnsmasq-dhcp[29218]: DHCPNAK(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c lease not found
Jun 23 05:49:09 host03 dnsmasq-dhcp[29218]: DHCPDISCOVER(tap253af9f9-da) fa:16:3e:74:e6:5c 
Jun 23 05:49:09 host03 dnsmasq-dhcp[29218]: DHCPOFFER(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c 
Jun 23 05:49:09 host03 dnsmasq-dhcp[29218]: DHCPREQUEST(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c 
Jun 23 05:49:09 host03 dnsmasq-dhcp[29218]: DHCPACK(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c host-192-168-0-2
---

different pids, and the occasional "lease not found"

neutron processes are killing the box too with all this reloading

---
 2484 neutron   20   0  336m  95m 4640 S 37.2  0.6   1051:34 neutron-penvsw
 2420 neutron   20   0  320m  80m 4656 R 33.2  0.5   1005:43 neutron-dhcp-ag
 1285 root      10 -10 43732 4868 1228 S  4.0  0.0 286:19.97 ovsdb-server
---

Comment 2 Ian Wienand 2014-06-23 06:13:05 UTC
Ok, I applied my patch [1] to see what's going on with the resyncs


---
[root@host03 neutron]# cat dhcp-agent.log | grep -A5 resync:
2014-06-23 06:01:33.857 3112 DEBUG neutron.agent.dhcp_agent [-] resync: 
Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-1c8f57e6-fa66-4bba-90e3-1a735cab79e2', 'dhcp_release', 'None', '192.168.1.24', 'fa:16:3e:a3:c9:af']
Exit code: 1
Stdout: ''
Stderr: 'cannot setup interface: No such device\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181
2014-06-23 06:01:33.858 3112 DEBUG neutron.agent.dhcp_agent [-] resync: 
Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9', 'ip', 'route', 'replace', 'default', 'via', '192.168.1.1', 'dev', 'tap9331f355-0a']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: No such process\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181
2014-06-23 06:01:33.859 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168
--
2014-06-23 06:02:47.711 3112 DEBUG neutron.agent.dhcp_agent [-] resync: 
Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9', 'ip', 'route', 'replace', 'default', 'via', '192.168.1.1', 'dev', 'tap9331f355-0a']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: No such process\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181
2014-06-23 06:02:47.712 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168
--
2014-06-23 06:03:18.223 3112 DEBUG neutron.agent.dhcp_agent [-] resync: Remote error: DBDuplicateEntry (IntegrityError) (1062, "Duplicate entry 'bf2fe64d-9bb6-44cd-baef-e277de16c0aa-d3e3ac5b-9962-428a-a9f8-6b2' for key 'PRIMARY'") 'INSERT INTO networkdhcpagentbindings (network_id, dhcp_agent_id) VALUES (%s, %s)' ('bf2fe64d-9bb6-44cd-baef-e277de16c0aa', 'd3e3ac5b-9962-428a-a9f8-6b2a1aba48d8')
[u'Traceback (most recent call last):\n', u'  File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/amqp.py", line 438, in _process_data\n    **args)\n', u'  File "/usr/lib/python2.6/site-packages/neutron/common/rpc.py", line 45, in dispatch\n    neutron_ctxt, version, method, namespace, **kwargs)\n', u'  File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch\n    result = getattr(proxyobj, method)(ctxt, **kwargs)\n', u'  File "/usr/lib/python2.6/site-packages/neutron/db/dhcp_rpc_base.py", line 63, in get_active_networks_info\n    networks = self._get_active_networks(context, **kwargs)\n', u'  File "/usr/lib/python2.6/site-packages/neutron/db/dhcp_rpc_base.py", line 41, in _get_active_networks\n    plugin.auto_schedule_networks(context, host)\n', u'  File "/usr/lib/python2.6/site-packages/neutron/db/agentschedulers_db.py", line 211, in auto_schedule_networks\n    self.network_scheduler.auto_schedule_networks(self, context, host)\n', u'  File "/usr/lib/python2.6/site-packages/neutron/scheduler/dhcp_agent_scheduler.py", line 112, in auto_schedule_networks\n    context, [net_id], active=True)\n', u'  File "/usr/lib/python2.6/site-packages/neutron/db/agentschedulers_db.py", line 123, in get_dhcp_agents_hosting_networks\n    for binding in query\n', u'  File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/orm/query.py", line 2226, in __iter__\n    self.session._autoflush()\n', u'  File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/orm/session.py", line 1012, in _autoflush\n    self.flush()\n', u'  File "/usr/lib/python2.6/site-packages/neutron/openstack/common/db/sqlalchemy/session.py", line 541, in _wrap\n    _raise_if_duplicate_entry_error(e, get_engine().name)\n', u'  File "/usr/lib/python2.6/site-packages/neutron/openstack/common/db/sqlalchemy/session.py", line 492, in _raise_if_duplicate_entry_error\n    raise exception.DBDuplicateEntry(columns, integrity_error)\n', u'DBDuplicateEntry: (IntegrityError) (1062, "Duplicate entry \'bf2fe64d-9bb6-44cd-baef-e277de16c0aa-d3e3ac5b-9962-428a-a9f8-6b2\' for key \'PRIMARY\'") \'INSERT INTO networkdhcpagentbindings (network_id, dhcp_agent_id) VALUES (%s, %s)\' (\'bf2fe64d-9bb6-44cd-baef-e277de16c0aa\', \'d3e3ac5b-9962-428a-a9f8-6b2a1aba48d8\')\n']. _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181
2014-06-23 06:03:18.224 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168
2014-06-23 06:03:18.254 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore / lock "sync_state" inner /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:248
2014-06-23 06:03:18.254 3112 INFO neutron.agent.dhcp_agent [-] Synchronizing state
2014-06-23 06:03:18.254 3112 DEBUG neutron.openstack.common.rpc.amqp [-] Making synchronous call on q-plugin ... multicall /usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/amqp.py:530
--
2014-06-23 06:04:32.307 3112 DEBUG neutron.agent.dhcp_agent [-] resync: 
Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9', 'ip', 'route', 'replace', 'default', 'via', '192.168.1.1', 'dev', 'tap9331f355-0a']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: No such process\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181
2014-06-23 06:04:32.307 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168
--
2014-06-23 06:05:48.008 3112 DEBUG neutron.agent.dhcp_agent [-] resync: 
Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9', 'ip', 'route', 'replace', 'default', 'via', '192.168.1.1', 'dev', 'tap9331f355-0a']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: No such process\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181
2014-06-23 06:05:48.008 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168
--
2014-06-23 06:07:01.602 3112 DEBUG neutron.agent.dhcp_agent [-] resync: 
Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9', 'ip', 'route', 'replace', 'default', 'via', '192.168.1.1', 'dev', 'tap9331f355-0a']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: No such process\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181
2014-06-23 06:07:01.602 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168
--
2014-06-23 06:08:15.339 3112 DEBUG neutron.agent.dhcp_agent [-] resync: 
Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9', 'ip', 'route', 'replace', 'default', 'via', '192.168.1.1', 'dev', 'tap9331f355-0a']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: No such process\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181
2014-06-23 06:08:15.340 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168
--
2014-06-23 06:08:46.044 3112 DEBUG neutron.agent.dhcp_agent [-] resync: Remote error: DBDuplicateEntry (IntegrityError) (1062, "Duplicate entry '32e47848-764e-49e5-8d66-b44d90e45d73-d3e3ac5b-9962-428a-a9f8-6b2' for key 'PRIMARY'") 'INSERT INTO networkdhcpagentbindings (network_id, dhcp_agent_id) VALUES (%s, %s)' ('32e47848-764e-49e5-8d66-b44d90e45d73', 'd3e3ac5b-9962-428a-a9f8-6b2a1aba48d8')
[u'Traceback (most recent call last):\n', u'  File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/amqp.py", line 438, in _process_data\n    **args)\n', u'  File "/usr/lib/python2.6/site-packages/neutron/common/rpc.py", line 45, in dispatch\n    neutron_ctxt, version, method, namespace, **kwargs)\n', u'  File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch\n    result = getattr(proxyobj, method)(ctxt, **kwargs)\n', u'  File "/usr/lib/python2.6/site-packages/neutron/db/dhcp_rpc_base.py", line 63, in get_active_networks_info\n    networks = self._get_active_networks(context, **kwargs)\n', u'  File "/usr/lib/python2.6/site-packages/neutron/db/dhcp_rpc_base.py", line 41, in _get_active_networks\n    plugin.auto_schedule_networks(context, host)\n', u'  File "/usr/lib/python2.6/site-packages/neutron/db/agentschedulers_db.py", line 211, in auto_schedule_networks\n    self.network_scheduler.auto_schedule_networks(self, context, host)\n', u'  File "/usr/lib/python2.6/site-packages/neutron/scheduler/dhcp_agent_scheduler.py", line 112, in auto_schedule_networks\n    context, [net_id], active=True)\n', u'  File "/usr/lib/python2.6/site-packages/neutron/db/agentschedulers_db.py", line 123, in get_dhcp_agents_hosting_networks\n    for binding in query\n', u'  File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/orm/query.py", line 2226, in __iter__\n    self.session._autoflush()\n', u'  File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/orm/session.py", line 1012, in _autoflush\n    self.flush()\n', u'  File "/usr/lib/python2.6/site-packages/neutron/openstack/common/db/sqlalchemy/session.py", line 541, in _wrap\n    _raise_if_duplicate_entry_error(e, get_engine().name)\n', u'  File "/usr/lib/python2.6/site-packages/neutron/openstack/common/db/sqlalchemy/session.py", line 492, in _raise_if_duplicate_entry_error\n    raise exception.DBDuplicateEntry(columns, integrity_error)\n', u'DBDuplicateEntry: (IntegrityError) (1062, "Duplicate entry \'32e47848-764e-49e5-8d66-b44d90e45d73-d3e3ac5b-9962-428a-a9f8-6b2\' for key \'PRIMARY\'") \'INSERT INTO networkdhcpagentbindings (network_id, dhcp_agent_id) VALUES (%s, %s)\' (\'32e47848-764e-49e5-8d66-b44d90e45d73\', \'d3e3ac5b-9962-428a-a9f8-6b2a1aba48d8\')\n']. _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181
2014-06-23 06:08:46.044 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168
2014-06-23 06:08:46.045 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore / lock "sync_state" inner /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:248
2014-06-23 06:08:46.045 3112 INFO neutron.agent.dhcp_agent [-] Synchronizing state
2014-06-23 06:08:46.045 3112 DEBUG neutron.openstack.common.rpc.amqp [-] Making synchronous call on q-plugin ... multicall /usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/amqp.py:530
--
2014-06-23 06:10:02.194 3112 DEBUG neutron.agent.dhcp_agent [-] resync: 
Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9', 'ip', 'route', 'replace', 'default', 'via', '192.168.1.1', 'dev', 'tap9331f355-0a']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: No such process\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181
2014-06-23 06:10:02.195 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168
--
2014-06-23 06:11:16.144 3112 DEBUG neutron.agent.dhcp_agent [-] resync: 
Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9', 'ip', 'route', 'replace', 'default', 'via', '192.168.1.1', 'dev', 'tap9331f355-0a']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: No such process\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181
2014-06-23 06:11:16.144 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168
---

looks like something has got out of sync?


[1] https://review.openstack.org/#/c/81173/

Comment 3 Ian Wienand 2014-06-23 06:20:52 UTC
not sure what's going on with the ip route failures in the above message -- it's trying to set the default route to 192.168.1.1 it seems, but the interface is 192.168.0.1

---
[root@host03 log]# ip netns exec qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9 ip addr
737: tap9331f355-0a: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:0e:c9:28 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.1/24 brd 192.168.0.255 scope global tap9331f355-0a
    inet6 fe80::f816:3eff:fe0e:c928/64 scope link 
       valid_lft forever preferred_lft forever
738: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever

[root@host03 log]# ip netns exec qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9 ip route
192.168.0.0/24 dev tap9331f355-0a  proto kernel  scope link  src 192.168.0.1 
---

Comment 4 Ian Wienand 2014-06-23 07:06:33 UTC
Here's the network details for that network (604...)

---
[root@host02 keystonerc(keystone_admin)]$ neutron net-show 604f42a4-2390-46fc-9d7e-47e20120c8a9
+---------------------------+--------------------------------------+
| Field                     | Value                                |
+---------------------------+--------------------------------------+
| admin_state_up            | True                                 |
| id                        | 604f42a4-2390-46fc-9d7e-47e20120c8a9 |
| name                      | internal1                            |
| provider:network_type     | gre                                  |
| provider:physical_network |                                      |
| provider:segmentation_id  | 187                                  |
| router:external           | False                                |
| shared                    | False                                |
| status                    | ACTIVE                               |
| subnets                   | e88bc379-6621-4f8a-aac2-244af27ae7a0 |
| tenant_id                 | 500ba2ec229740efbc8296dcc8462179     |
+---------------------------+--------------------------------------+
[root@host02 keystonerc(keystone_admin)]$ neutron subnet-show e88bc379-6621-4f8a-aac2-244af27ae7a0
+------------------+--------------------------------------------------+
| Field            | Value                                            |
+------------------+--------------------------------------------------+
| allocation_pools | {"start": "192.168.0.1", "end": "192.168.0.254"} |
| cidr             | 192.168.0.0/24                                   |
| dns_nameservers  |                                                  |
| enable_dhcp      | True                                             |
| gateway_ip       | 192.168.1.1                                      |
| host_routes      |                                                  |
| id               | e88bc379-6621-4f8a-aac2-244af27ae7a0             |
| ip_version       | 4                                                |
| name             | lan1                                             |
| network_id       | 604f42a4-2390-46fc-9d7e-47e20120c8a9             |
| tenant_id        | 500ba2ec229740efbc8296dcc8462179                 |
+------------------+--------------------------------------------------+
---

Comment 5 Miguel Angel Ajo 2014-06-23 07:55:51 UTC
Ian, your guess looks very reasonable, if I run ip route
locally trying to set a gw out of my subnet, I get a similar
error (different kernel).

And generally, an unexpected error like this triggers an agent resync.


[majopela@redcylon]$ sudo ip route replace default via 192.168.33.1 dev wlp3s0
RTNETLINK answers: Network is unreachable


I would

1) Fix that subnet e88bc379-6621-4f8a-aac2-244af27ae7a0 which is causing the problem (set the gateway to 192.168.0.1) to fix the deployment.

2) Investigate if this is fixed in RHOS5 / icehose.

  a) if it's not fixed: fill a bug upstream, and propose a solution, for example: not setting the gateway if it's out of the subnet, and/or preventing this kind of setting

  b) if it's fixed, consider backporting the fix to RHOS4 / havana.

Comment 6 lpeer 2014-06-23 08:33:50 UTC
Ian,
We are aware of this issue and it is being tracked in BZ 1090553.
You can turn the validation of the gateway by configuring 'force_gateway_on_subnet' to True.

The reason that this validation is not done by default is that IPv6 gateway doesn't have to be within the subnet.

Assaf should send a patch to u/s that validation would be done automatically if IPv4 is used, for tracking that we use BZ 1090553.

Livnat

*** This bug has been marked as a duplicate of bug 1090553 ***


Note You need to log in before you can comment on or make changes to this bug.