Description of problem: I have a CI test running that does a lot of git clones, pypi pulls, etc. It has recently started failing randomly. Various things will fail like this. The symptom is usually something like --- 2014-06-23 07:54:43.760 | + timeout=300 2014-06-23 07:54:43.760 | + timeout -s SIGINT 300 git clone git://git.openstack.org/openstack/python-keystoneclient.git /opt/stack/python-keystoneclient 2014-06-23 07:54:43.765 | fatal: Unable to look up git.openstack.org (port 9418) (Name or service not known) 2014-06-23 07:54:43.766 | Initialized empty Git repository in /opt/stack/python-keystoneclient/.git/ --- In an effort figure this out, I ran the following test --- [stack@host-192-168-0-2 tmp]$ cat test.sh #!/bin/bash while [ 1 ] do sudo tcpdump -i eth0 -w /tmp/trace.log port not 22& pid=$! echo $pid sleep 2 rm -rf python-keystoneclient strace -f -o /tmp/trace.txt git clone git://git.openstack.org/openstack/python-keystoneclient.git s=$? sleep 2 sudo kill -INT $pid if [[ $s -ne 0 ]]; then exit fi sudo rm /tmp/trace.txt sudo rm /tmp/trace.log done --- eventually this failed. --- tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes Initialized empty Git repository in /tmp/python-keystoneclient/.git/ fatal: Unable to look up git.openstack.org (port 9418) (Name or service not known) 10 packets captured 10 packets received by filter 0 packets dropped by kernel --- What I found was the git process trying to do the lookup for git.openstack.org and getting connection refused from the network --- 22597 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3 22597 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.16.17.200")}, 16) = -1 ENETUNREACH (Network is unreachable) 22597 close(3) = 0 22597 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3 22597 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.16.17.200")}, 16) = -1 ENETUNREACH (Network is unreachable) 22597 close(3) = 0 22597 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3 22597 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.16.17.200")}, 16) = -1 ENETUNREACH (Network is unreachable) 22597 close(3) = 0 22597 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3 22597 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.16.17.200")}, 16) = -1 ENETUNREACH (Network is unreachable) 22597 close(3) = 0 22597 write(2, "fatal: Unable to look up git.ope"..., 83) = 83 --- 10.16.17.200 is the dns server tcpdump during this time only capture 10 packets; all relating to a DHCP transaction --- [iwienand@fedora19 bug2]$ tcpdump -v -r trace.out.txt.pcap reading from file trace.out.txt.pcap, link-type EN10MB (Ethernet) 19:23:16.013752 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 328) 192.168.0.2.bootpc > 192.168.0.32.bootps: BOOTP/DHCP, Request from fa:16:3e:74:e6:5c (oui Unknown), length 300, xid 0xfa7e0771, Flags [none] Client-IP 192.168.0.2 Client-Ethernet-Address fa:16:3e:74:e6:5c (oui Unknown) Vendor-rfc1048 Extensions Magic Cookie 0x63825363 DHCP-Message Option 53, length 1: Request Parameter-Request Option 55, length 13: Subnet-Mask, BR, Time-Zone, Classless-Static-Route Domain-Name, Domain-Name-Server, Hostname, YD YS, NTP, MTU, Option 119 Default-Gateway 19:23:16.015167 IP (tos 0x0, ttl 64, id 13492, offset 0, flags [none], proto UDP (17), length 328) 192.168.0.32.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 300, xid 0xfa7e0771, Flags [Broadcast] Client-Ethernet-Address fa:16:3e:74:e6:5c (oui Unknown) Vendor-rfc1048 Extensions Magic Cookie 0x63825363 DHCP-Message Option 53, length 1: NACK Server-ID Option 54, length 4: 192.168.0.32 MSG Option 56, length 15: "lease not found" 19:23:16.073829 IP (tos 0x10, ttl 128, id 0, offset 0, flags [none], proto UDP (17), length 328) 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from fa:16:3e:74:e6:5c (oui Unknown), length 300, xid 0x66016b4b, Flags [none] Client-Ethernet-Address fa:16:3e:74:e6:5c (oui Unknown) Vendor-rfc1048 Extensions Magic Cookie 0x63825363 DHCP-Message Option 53, length 1: Discover Parameter-Request Option 55, length 13: Subnet-Mask, BR, Time-Zone, Classless-Static-Route Domain-Name, Domain-Name-Server, Hostname, YD YS, NTP, MTU, Option 119 Default-Gateway 19:23:16.074798 IP (tos 0x0, ttl 64, id 47590, offset 0, flags [none], proto UDP (17), length 358) 192.168.0.32.bootps > 192.168.0.2.bootpc: BOOTP/DHCP, Reply, length 330, xid 0x66016b4b, Flags [none] Your-IP 192.168.0.2 Server-IP 192.168.0.32 Client-Ethernet-Address fa:16:3e:74:e6:5c (oui Unknown) Vendor-rfc1048 Extensions Magic Cookie 0x63825363 DHCP-Message Option 53, length 1: Offer Server-ID Option 54, length 4: 192.168.0.32 Lease-Time Option 51, length 4: 120 RN Option 58, length 4: 60 RB Option 59, length 4: 105 Subnet-Mask Option 1, length 4: 255.255.255.0 BR Option 28, length 4: 192.168.0.255 Domain-Name Option 15, length 14: "openstacklocal" Hostname Option 12, length 16: "host-192-168-0-2" Default-Gateway Option 3, length 4: 192.168.0.1 Domain-Name-Server Option 6, length 4: ip-10-16-17-200.oslab.openstack.engineering.redhat.com MTU Option 26, length 2: 1454 19:23:16.074906 IP (tos 0x10, ttl 128, id 0, offset 0, flags [none], proto UDP (17), length 328) 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from fa:16:3e:74:e6:5c (oui Unknown), length 300, xid 0x66016b4b, Flags [none] Client-Ethernet-Address fa:16:3e:74:e6:5c (oui Unknown) Vendor-rfc1048 Extensions Magic Cookie 0x63825363 DHCP-Message Option 53, length 1: Request Server-ID Option 54, length 4: 192.168.0.32 Requested-IP Option 50, length 4: 192.168.0.2 Parameter-Request Option 55, length 13: Subnet-Mask, BR, Time-Zone, Classless-Static-Route Domain-Name, Domain-Name-Server, Hostname, YD YS, NTP, MTU, Option 119 Default-Gateway 19:23:16.075204 IP (tos 0x0, ttl 64, id 47591, offset 0, flags [none], proto UDP (17), length 358) 192.168.0.32.bootps > 192.168.0.2.bootpc: BOOTP/DHCP, Reply, length 330, xid 0x66016b4b, Flags [none] Your-IP 192.168.0.2 Server-IP 192.168.0.32 Client-Ethernet-Address fa:16:3e:74:e6:5c (oui Unknown) Vendor-rfc1048 Extensions Magic Cookie 0x63825363 DHCP-Message Option 53, length 1: ACK Server-ID Option 54, length 4: 192.168.0.32 Lease-Time Option 51, length 4: 120 RN Option 58, length 4: 60 RB Option 59, length 4: 105 Subnet-Mask Option 1, length 4: 255.255.255.0 BR Option 28, length 4: 192.168.0.255 Domain-Name Option 15, length 14: "openstacklocal" Hostname Option 12, length 16: "host-192-168-0-2" Default-Gateway Option 3, length 4: 192.168.0.1 Domain-Name-Server Option 6, length 4: ip-10-16-17-200.oslab.openstack.engineering.redhat.com MTU Option 26, length 2: 1454 19:23:16.106895 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.2 (Broadcast) tell 0.0.0.0, length 28 19:23:17.107078 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.2 (Broadcast) tell 0.0.0.0, length 28 19:23:19.351386 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.1 tell 192.168.0.2, length 28 19:23:19.352276 ARP, Ethernet (len 6), IPv4 (len 4), Reply 192.168.0.1 is-at fa:16:3e:6f:c3:20 (oui Unknown), length 28 --- So it looks like a request went out, a NAK came back "lease not found", the lease was requested again and we got back the same ip This would seem fairly consistent with the symptoms of random failures every few minutes. This is across several guest OS's (rhel6.5,centos6.5,f20,rhel7) so the problem isn't on the guest side. On the dhcp-agent side, it looks like my dnsmasq is being restarted constantly --- 2014-06-23 05:46:11.606 2420 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2', 'env', 'NEUTRON_NETWORK_ID=20803109-5bea-4638-8b02-bcdbace3f0b2', 'dnsmasq', '--no-hosts', '--no-resolv', '--strict-order', '--bind-interfaces', '--interface=tap253af9f9-da', '--except-interface=lo', '--pid-file=/var/lib/neutron/dhcp/20803109-5bea-4638-8b02-bcdbace3f0b2/pid', '--dhcp-hostsfile=/var/lib/neutron/dhcp/20803109-5bea-4638-8b02-bcdbace3f0b2/host', '--dhcp-optsfile=/var/lib/neutron/dhcp/20803109-5bea-4638-8b02-bcdbace3f0b2/opts', '--leasefile-ro', '--dhcp-range=tag0,192.168.0.0,static,120s', '--dhcp-lease-max=256', '--conf-file=/etc/neutron/dnsmasq-neutron.conf', '--server=10.16.17.200', '--domain=openstacklocal'] create_process /usr/lib/python2.6/site-packages/neutron/agent/linux/utils.py:47 Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2', 'env', 'NEUTRON_NETWORK_ID=20803109-5bea-4638-8b02-bcdbace3f0b2', 'dnsmasq', '--no-hosts', '--no-resolv', '--strict-order', '--bind-interfaces', '--interface=tap253af9f9-da', '--except-interface=lo', '--pid-file=/var/lib/neutron/dhcp/20803109-5bea-4638-8b02-bcdbace3f0b2/pid', '--dhcp-hostsfile=/var/lib/neutron/dhcp/20803109-5bea-4638-8b02-bcdbace3f0b2/host', '--dhcp-optsfile=/var/lib/neutron/dhcp/20803109-5bea-4638-8b02-bcdbace3f0b2/opts', '--leasefile-ro', '--dhcp-range=tag0,192.168.0.0,static,120s', '--dhcp-lease-max=256', '--conf-file=/etc/neutron/dnsmasq-neutron.conf', '--server=10.16.17.200', '--domain=openstacklocal'] 2014-06-23 05:46:11.999 2420 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2', 'ip', 'route', 'list', 'dev', 'tap253af9f9-da'] create_process /usr/lib/python2.6/site-packages/neutron/agent/linux/utils.py:47 --- --- [root@host03 neutron]# cat dhcp-agent.log | grep 208031 | grep dnsmasq | awk '{print $1 $2}' | grep '2014' 014-06-2305:26:29.116 2014-06-2305:27:43.433 2014-06-2305:28:57.412 2014-06-2305:30:13.599 2014-06-2305:31:27.678 2014-06-2305:32:41.938 2014-06-2305:33:56.821 2014-06-2305:35:13.788 2014-06-2305:36:28.207 2014-06-2305:37:43.666 2014-06-2305:38:57.453 2014-06-2305:40:13.163 2014-06-2305:41:57.580 2014-06-2305:43:40.929 2014-06-2305:44:56.178 [root@host03 neutron]# date Mon Jun 23 05:45:37 UTC 2014 --- looking in /var/log/messages, we can see the bad history for this mac address --- Jun 23 05:46:32 host03 dnsmasq-dhcp[19481]: DHCPNAK(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c lease not found Jun 23 05:46:32 host03 dnsmasq-dhcp[19481]: DHCPDISCOVER(tap253af9f9-da) fa:16:3e:74:e6:5c Jun 23 05:46:32 host03 dnsmasq-dhcp[19481]: DHCPOFFER(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c Jun 23 05:46:32 host03 dnsmasq-dhcp[19481]: DHCPREQUEST(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c Jun 23 05:46:32 host03 dnsmasq-dhcp[19481]: DHCPACK(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c host-192-168-0-2 Jun 23 05:47:31 host03 dnsmasq-dhcp[24363]: DHCPREQUEST(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c Jun 23 05:47:31 host03 dnsmasq-dhcp[24363]: DHCPNAK(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c lease not found Jun 23 05:47:31 host03 dnsmasq-dhcp[24363]: DHCPDISCOVER(tap253af9f9-da) fa:16:3e:74:e6:5c Jun 23 05:47:31 host03 dnsmasq-dhcp[24363]: DHCPOFFER(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c Jun 23 05:47:31 host03 dnsmasq-dhcp[24363]: DHCPREQUEST(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c Jun 23 05:47:31 host03 dnsmasq-dhcp[24363]: DHCPACK(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c host-192-168-0-2 Jun 23 05:48:24 host03 dnsmasq-dhcp[24363]: DHCPREQUEST(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c Jun 23 05:48:24 host03 dnsmasq-dhcp[24363]: DHCPACK(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c host-192-168-0-2 Jun 23 05:49:09 host03 dnsmasq-dhcp[29218]: DHCPREQUEST(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c Jun 23 05:49:09 host03 dnsmasq-dhcp[29218]: DHCPNAK(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c lease not found Jun 23 05:49:09 host03 dnsmasq-dhcp[29218]: DHCPDISCOVER(tap253af9f9-da) fa:16:3e:74:e6:5c Jun 23 05:49:09 host03 dnsmasq-dhcp[29218]: DHCPOFFER(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c Jun 23 05:49:09 host03 dnsmasq-dhcp[29218]: DHCPREQUEST(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c Jun 23 05:49:09 host03 dnsmasq-dhcp[29218]: DHCPACK(tap253af9f9-da) 192.168.0.2 fa:16:3e:74:e6:5c host-192-168-0-2 --- different pids, and the occasional "lease not found" neutron processes are killing the box too with all this reloading --- 2484 neutron 20 0 336m 95m 4640 S 37.2 0.6 1051:34 neutron-penvsw 2420 neutron 20 0 320m 80m 4656 R 33.2 0.5 1005:43 neutron-dhcp-ag 1285 root 10 -10 43732 4868 1228 S 4.0 0.0 286:19.97 ovsdb-server ---
Ok, I applied my patch [1] to see what's going on with the resyncs --- [root@host03 neutron]# cat dhcp-agent.log | grep -A5 resync: 2014-06-23 06:01:33.857 3112 DEBUG neutron.agent.dhcp_agent [-] resync: Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-1c8f57e6-fa66-4bba-90e3-1a735cab79e2', 'dhcp_release', 'None', '192.168.1.24', 'fa:16:3e:a3:c9:af'] Exit code: 1 Stdout: '' Stderr: 'cannot setup interface: No such device\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181 2014-06-23 06:01:33.858 3112 DEBUG neutron.agent.dhcp_agent [-] resync: Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9', 'ip', 'route', 'replace', 'default', 'via', '192.168.1.1', 'dev', 'tap9331f355-0a'] Exit code: 2 Stdout: '' Stderr: 'RTNETLINK answers: No such process\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181 2014-06-23 06:01:33.859 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168 -- 2014-06-23 06:02:47.711 3112 DEBUG neutron.agent.dhcp_agent [-] resync: Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9', 'ip', 'route', 'replace', 'default', 'via', '192.168.1.1', 'dev', 'tap9331f355-0a'] Exit code: 2 Stdout: '' Stderr: 'RTNETLINK answers: No such process\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181 2014-06-23 06:02:47.712 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168 -- 2014-06-23 06:03:18.223 3112 DEBUG neutron.agent.dhcp_agent [-] resync: Remote error: DBDuplicateEntry (IntegrityError) (1062, "Duplicate entry 'bf2fe64d-9bb6-44cd-baef-e277de16c0aa-d3e3ac5b-9962-428a-a9f8-6b2' for key 'PRIMARY'") 'INSERT INTO networkdhcpagentbindings (network_id, dhcp_agent_id) VALUES (%s, %s)' ('bf2fe64d-9bb6-44cd-baef-e277de16c0aa', 'd3e3ac5b-9962-428a-a9f8-6b2a1aba48d8') [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/amqp.py", line 438, in _process_data\n **args)\n', u' File "/usr/lib/python2.6/site-packages/neutron/common/rpc.py", line 45, in dispatch\n neutron_ctxt, version, method, namespace, **kwargs)\n', u' File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch\n result = getattr(proxyobj, method)(ctxt, **kwargs)\n', u' File "/usr/lib/python2.6/site-packages/neutron/db/dhcp_rpc_base.py", line 63, in get_active_networks_info\n networks = self._get_active_networks(context, **kwargs)\n', u' File "/usr/lib/python2.6/site-packages/neutron/db/dhcp_rpc_base.py", line 41, in _get_active_networks\n plugin.auto_schedule_networks(context, host)\n', u' File "/usr/lib/python2.6/site-packages/neutron/db/agentschedulers_db.py", line 211, in auto_schedule_networks\n self.network_scheduler.auto_schedule_networks(self, context, host)\n', u' File "/usr/lib/python2.6/site-packages/neutron/scheduler/dhcp_agent_scheduler.py", line 112, in auto_schedule_networks\n context, [net_id], active=True)\n', u' File "/usr/lib/python2.6/site-packages/neutron/db/agentschedulers_db.py", line 123, in get_dhcp_agents_hosting_networks\n for binding in query\n', u' File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/orm/query.py", line 2226, in __iter__\n self.session._autoflush()\n', u' File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/orm/session.py", line 1012, in _autoflush\n self.flush()\n', u' File "/usr/lib/python2.6/site-packages/neutron/openstack/common/db/sqlalchemy/session.py", line 541, in _wrap\n _raise_if_duplicate_entry_error(e, get_engine().name)\n', u' File "/usr/lib/python2.6/site-packages/neutron/openstack/common/db/sqlalchemy/session.py", line 492, in _raise_if_duplicate_entry_error\n raise exception.DBDuplicateEntry(columns, integrity_error)\n', u'DBDuplicateEntry: (IntegrityError) (1062, "Duplicate entry \'bf2fe64d-9bb6-44cd-baef-e277de16c0aa-d3e3ac5b-9962-428a-a9f8-6b2\' for key \'PRIMARY\'") \'INSERT INTO networkdhcpagentbindings (network_id, dhcp_agent_id) VALUES (%s, %s)\' (\'bf2fe64d-9bb6-44cd-baef-e277de16c0aa\', \'d3e3ac5b-9962-428a-a9f8-6b2a1aba48d8\')\n']. _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181 2014-06-23 06:03:18.224 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168 2014-06-23 06:03:18.254 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore / lock "sync_state" inner /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:248 2014-06-23 06:03:18.254 3112 INFO neutron.agent.dhcp_agent [-] Synchronizing state 2014-06-23 06:03:18.254 3112 DEBUG neutron.openstack.common.rpc.amqp [-] Making synchronous call on q-plugin ... multicall /usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/amqp.py:530 -- 2014-06-23 06:04:32.307 3112 DEBUG neutron.agent.dhcp_agent [-] resync: Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9', 'ip', 'route', 'replace', 'default', 'via', '192.168.1.1', 'dev', 'tap9331f355-0a'] Exit code: 2 Stdout: '' Stderr: 'RTNETLINK answers: No such process\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181 2014-06-23 06:04:32.307 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168 -- 2014-06-23 06:05:48.008 3112 DEBUG neutron.agent.dhcp_agent [-] resync: Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9', 'ip', 'route', 'replace', 'default', 'via', '192.168.1.1', 'dev', 'tap9331f355-0a'] Exit code: 2 Stdout: '' Stderr: 'RTNETLINK answers: No such process\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181 2014-06-23 06:05:48.008 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168 -- 2014-06-23 06:07:01.602 3112 DEBUG neutron.agent.dhcp_agent [-] resync: Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9', 'ip', 'route', 'replace', 'default', 'via', '192.168.1.1', 'dev', 'tap9331f355-0a'] Exit code: 2 Stdout: '' Stderr: 'RTNETLINK answers: No such process\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181 2014-06-23 06:07:01.602 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168 -- 2014-06-23 06:08:15.339 3112 DEBUG neutron.agent.dhcp_agent [-] resync: Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9', 'ip', 'route', 'replace', 'default', 'via', '192.168.1.1', 'dev', 'tap9331f355-0a'] Exit code: 2 Stdout: '' Stderr: 'RTNETLINK answers: No such process\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181 2014-06-23 06:08:15.340 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168 -- 2014-06-23 06:08:46.044 3112 DEBUG neutron.agent.dhcp_agent [-] resync: Remote error: DBDuplicateEntry (IntegrityError) (1062, "Duplicate entry '32e47848-764e-49e5-8d66-b44d90e45d73-d3e3ac5b-9962-428a-a9f8-6b2' for key 'PRIMARY'") 'INSERT INTO networkdhcpagentbindings (network_id, dhcp_agent_id) VALUES (%s, %s)' ('32e47848-764e-49e5-8d66-b44d90e45d73', 'd3e3ac5b-9962-428a-a9f8-6b2a1aba48d8') [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/amqp.py", line 438, in _process_data\n **args)\n', u' File "/usr/lib/python2.6/site-packages/neutron/common/rpc.py", line 45, in dispatch\n neutron_ctxt, version, method, namespace, **kwargs)\n', u' File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch\n result = getattr(proxyobj, method)(ctxt, **kwargs)\n', u' File "/usr/lib/python2.6/site-packages/neutron/db/dhcp_rpc_base.py", line 63, in get_active_networks_info\n networks = self._get_active_networks(context, **kwargs)\n', u' File "/usr/lib/python2.6/site-packages/neutron/db/dhcp_rpc_base.py", line 41, in _get_active_networks\n plugin.auto_schedule_networks(context, host)\n', u' File "/usr/lib/python2.6/site-packages/neutron/db/agentschedulers_db.py", line 211, in auto_schedule_networks\n self.network_scheduler.auto_schedule_networks(self, context, host)\n', u' File "/usr/lib/python2.6/site-packages/neutron/scheduler/dhcp_agent_scheduler.py", line 112, in auto_schedule_networks\n context, [net_id], active=True)\n', u' File "/usr/lib/python2.6/site-packages/neutron/db/agentschedulers_db.py", line 123, in get_dhcp_agents_hosting_networks\n for binding in query\n', u' File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/orm/query.py", line 2226, in __iter__\n self.session._autoflush()\n', u' File "/usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg/sqlalchemy/orm/session.py", line 1012, in _autoflush\n self.flush()\n', u' File "/usr/lib/python2.6/site-packages/neutron/openstack/common/db/sqlalchemy/session.py", line 541, in _wrap\n _raise_if_duplicate_entry_error(e, get_engine().name)\n', u' File "/usr/lib/python2.6/site-packages/neutron/openstack/common/db/sqlalchemy/session.py", line 492, in _raise_if_duplicate_entry_error\n raise exception.DBDuplicateEntry(columns, integrity_error)\n', u'DBDuplicateEntry: (IntegrityError) (1062, "Duplicate entry \'32e47848-764e-49e5-8d66-b44d90e45d73-d3e3ac5b-9962-428a-a9f8-6b2\' for key \'PRIMARY\'") \'INSERT INTO networkdhcpagentbindings (network_id, dhcp_agent_id) VALUES (%s, %s)\' (\'32e47848-764e-49e5-8d66-b44d90e45d73\', \'d3e3ac5b-9962-428a-a9f8-6b2a1aba48d8\')\n']. _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181 2014-06-23 06:08:46.044 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168 2014-06-23 06:08:46.045 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore / lock "sync_state" inner /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:248 2014-06-23 06:08:46.045 3112 INFO neutron.agent.dhcp_agent [-] Synchronizing state 2014-06-23 06:08:46.045 3112 DEBUG neutron.openstack.common.rpc.amqp [-] Making synchronous call on q-plugin ... multicall /usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/amqp.py:530 -- 2014-06-23 06:10:02.194 3112 DEBUG neutron.agent.dhcp_agent [-] resync: Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9', 'ip', 'route', 'replace', 'default', 'via', '192.168.1.1', 'dev', 'tap9331f355-0a'] Exit code: 2 Stdout: '' Stderr: 'RTNETLINK answers: No such process\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181 2014-06-23 06:10:02.195 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168 -- 2014-06-23 06:11:16.144 3112 DEBUG neutron.agent.dhcp_agent [-] resync: Command: ['sudo', 'ip', 'netns', 'exec', 'qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9', 'ip', 'route', 'replace', 'default', 'via', '192.168.1.1', 'dev', 'tap9331f355-0a'] Exit code: 2 Stdout: '' Stderr: 'RTNETLINK answers: No such process\n' _periodic_resync_helper /usr/lib/python2.6/site-packages/neutron/agent/dhcp_agent.py:181 2014-06-23 06:11:16.144 3112 DEBUG neutron.openstack.common.lockutils [-] Got semaphore "dhcp-agent" lock /usr/lib/python2.6/site-packages/neutron/openstack/common/lockutils.py:168 --- looks like something has got out of sync? [1] https://review.openstack.org/#/c/81173/
not sure what's going on with the ip route failures in the above message -- it's trying to set the default route to 192.168.1.1 it seems, but the interface is 192.168.0.1 --- [root@host03 log]# ip netns exec qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9 ip addr 737: tap9331f355-0a: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether fa:16:3e:0e:c9:28 brd ff:ff:ff:ff:ff:ff inet 192.168.0.1/24 brd 192.168.0.255 scope global tap9331f355-0a inet6 fe80::f816:3eff:fe0e:c928/64 scope link valid_lft forever preferred_lft forever 738: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever [root@host03 log]# ip netns exec qdhcp-604f42a4-2390-46fc-9d7e-47e20120c8a9 ip route 192.168.0.0/24 dev tap9331f355-0a proto kernel scope link src 192.168.0.1 ---
Here's the network details for that network (604...) --- [root@host02 keystonerc(keystone_admin)]$ neutron net-show 604f42a4-2390-46fc-9d7e-47e20120c8a9 +---------------------------+--------------------------------------+ | Field | Value | +---------------------------+--------------------------------------+ | admin_state_up | True | | id | 604f42a4-2390-46fc-9d7e-47e20120c8a9 | | name | internal1 | | provider:network_type | gre | | provider:physical_network | | | provider:segmentation_id | 187 | | router:external | False | | shared | False | | status | ACTIVE | | subnets | e88bc379-6621-4f8a-aac2-244af27ae7a0 | | tenant_id | 500ba2ec229740efbc8296dcc8462179 | +---------------------------+--------------------------------------+ [root@host02 keystonerc(keystone_admin)]$ neutron subnet-show e88bc379-6621-4f8a-aac2-244af27ae7a0 +------------------+--------------------------------------------------+ | Field | Value | +------------------+--------------------------------------------------+ | allocation_pools | {"start": "192.168.0.1", "end": "192.168.0.254"} | | cidr | 192.168.0.0/24 | | dns_nameservers | | | enable_dhcp | True | | gateway_ip | 192.168.1.1 | | host_routes | | | id | e88bc379-6621-4f8a-aac2-244af27ae7a0 | | ip_version | 4 | | name | lan1 | | network_id | 604f42a4-2390-46fc-9d7e-47e20120c8a9 | | tenant_id | 500ba2ec229740efbc8296dcc8462179 | +------------------+--------------------------------------------------+ ---
Ian, your guess looks very reasonable, if I run ip route locally trying to set a gw out of my subnet, I get a similar error (different kernel). And generally, an unexpected error like this triggers an agent resync. [majopela@redcylon]$ sudo ip route replace default via 192.168.33.1 dev wlp3s0 RTNETLINK answers: Network is unreachable I would 1) Fix that subnet e88bc379-6621-4f8a-aac2-244af27ae7a0 which is causing the problem (set the gateway to 192.168.0.1) to fix the deployment. 2) Investigate if this is fixed in RHOS5 / icehose. a) if it's not fixed: fill a bug upstream, and propose a solution, for example: not setting the gateway if it's out of the subnet, and/or preventing this kind of setting b) if it's fixed, consider backporting the fix to RHOS4 / havana.
Ian, We are aware of this issue and it is being tracked in BZ 1090553. You can turn the validation of the gateway by configuring 'force_gateway_on_subnet' to True. The reason that this validation is not done by default is that IPv6 gateway doesn't have to be within the subnet. Assaf should send a patch to u/s that validation would be done automatically if IPv4 is used, for tracking that we use BZ 1090553. Livnat *** This bug has been marked as a duplicate of bug 1090553 ***