Description of problem: Periodically dhcp server node is not handing out the address to instances with below error. Sep 13 12:54:54 gs-prod-control dnsmasq-dhcp[3047]: DHCPDISCOVER(eth0) 172.18.81.102 fa:16:3e:2b:2d:e1 no address available Sep 13 12:54:54 gs-prod-control dnsmasq-dhcp[4225]: DHCPDISCOVER(ns-2cce40b9-5d) 172.18.81.102 fa:16:3e:2b:2d:e1 no address available Even though it is configured to do so [root@gs-prod-control ~]# grep fa:16:3e:2b:2d:e1 /var/lib/quantum/dhcp/5272bb12-becf-4056-93c6-f947c5c7c244/host fa:16:3e:2b:2d:e1,host-172-18-81-160.openstacklocal,172.18.81.160 Restarting quantum-dhcp-agent fixes the problem. This has been reported upstream already at https://bugs.launchpad.net/neutron/+bug/1227237 <snip> NB This isn't an issue in Havana since neutron doesn't use a DHCP lease update script (i.e., neutron runs dnsmasq without the --dhcp-script=xxx option). Every week or so on our OpenStack cluster of modest size (about a dozen nodes, dozens to hundreds of VMs being booted per day), instances stop receiving replies to their DHCP requests. When we observe the instances failing to communicate with the DHCP server, traces of network traffic show that the DHCP requests are making it to the tap device that dnsmasq is listening on but that dnsmasq isn't sending any replies. Furthermore, dnsmasq seems to be configured to send the replies (i.e., the appropriate entries are in its DHCP hosts file and dnsmasq has been sent SIGHUP). Once the dnsmasq process stops sending replies, it seems to have stopped indefinitely. Killing the dnsmasq process and restarting quantum-dhcp-agent, which starts a new dnsmasq process, gets replies going again. The dnsmasq process seems to have stopped sending replies because of its DHCP script quantum-dhcp-agent-dnsmasq-lease-update (i.e., from the option --dhcp-script=quantum-...). According to man dnsmasq(8), dnsmasq runs one copy of the lease update script at a time (i.e., executes it serially) and some request processing blocks on execution of the lease update script. Assuming the blocked request processing includes replying to DHCP requests, then a single deadlocked instance of quantum-dhcp-agent-dnsmasq-lease-update explains why no DHCP responses are being made. Indeed, the script has deadlocked. Here's a snippet of output from 'ps auxf' showing the relevant dnsmasq and quantum-dhcp-agent-dnsmasq-lease-update and quantum-dhcp-agent processes: quantum 16347 0.6 0.0 198756 56084 ? Ss Sep10 71:50 python /usr/bin/quantum-dhcp-agent --config-file=/etc/quantum/quantum.conf --config-file=/etc/quantum/dhcp_agent.ini --log-file=/var/log/quantum/dhcp-agent.log ... nobody 19945 0.0 0.0 28820 1156 ? S Sep10 6:47 dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=tap8591954d-46 --except-interface=lo --pid-file=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/pid --dhcp-hostsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/host --dhcp-optsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/opts --dhcp-script=/usr/bin/quantum-dhcp-agent-dnsmasq-lease-update --leasefile-ro --dhcp-range=set:tag0,172.16.0.0,static,120s --conf-file= --domain=openstacklocal root 19948 0.0 0.0 28792 472 ? S Sep10 3:02 \_ dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=tap8591954d-46 --except-interface=lo --pid-file=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/pid --dhcp-hostsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/host --dhcp-optsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/opts --dhcp-script=/usr/bin/quantum-dhcp-agent-dnsmasq-lease-update --leasefile-ro --dhcp-range=set:tag0,172.16.0.0,static,120s --conf-file= --domain=openstacklocal root 24722 0.0 0.0 4292 352 ? S Sep17 0:00 \_ quantum-dhcp-agent-dnsmasq-lease-update old fa:16:3e:17:5f:46 172.16.0.129 172-16-0-129 Strace shows that 19945 is continuing to do stuff, but 19948 is blocked in a wait4 system call waiting for its child, 24722, to exit: peter@gremlin ~/ssst % sudo strace -p 19948 Process 19948 attached - interrupt to quit wait4(-1, ^C <unfinished ...> But 24722 is blocked waiting to connect to the UDS that uses to communicate with quantum-dhcp-agent: peter@gremlin ~/ssst % sudo strace -p 24722 Process 24722 attached - interrupt to quit connect(3, {sa_family=AF_FILE, path="/var/lib/quantum/dhcp/lease_relay"}, 110 But quantum-dhcp-agent, pid 16347, is spinning epoll-ing on some epoll instance: peter@gremlin ~/ssst % sudo strace -p 16347 Process 16347 attached - interrupt to quit epoll_wait(5, {}, 1023, 1549) = 0 read(4, "2+\214~\346\235,\6\264\320\27\"\\\17?\354", 16) = 16 gettid() = 16347 epoll_wait(5, {}, 1023, 826) = 0 epoll_wait(5, {}, 1023, 3999) = 0 epoll_wait(5, {}, 1023, 0) = 0 epoll_wait(5, {}, 1023, 3999) = 0 epoll_wait(5, {}, 1023, 3999) = 0 epoll_wait(5, Clearly that epoll instance doesn't include the UDS that the lease update script is trying to connect to in spite of having the file open: peter@gremlin ~/ssst % sudo lsof | grep 16347 | grep lease_relay python 16347 quantum 8u unix 0xffff880070c86e80 0t0 224377723 /var/lib/quantum/dhcp/lease_relay I haven't looked into why quantum-dhcp-agent isn't epoll-ing on the UDS (fd 8). Presumably the epoll is the internal eventlet epoll. We've observed this on Canonical's 2013.1-0ubuntu2~cloud0 package (which is based on 2013.1 AFAIK) on Ubuntu 12.04 and 2013.1.2 on CentOS 6.4 from a recent RDO installation. Although I haven't tested 2013.1.3 or the latest stable/grizzly code, I suspect the bug still exists because no similar bugs. </snip>
Sadique, can you please attach /var/log/quantum/dhcp-agent.log from node that has running quantum-dhcp-agent?