Bug 1009906 - dnsqmasq process does not give out ip to instances due to deadlock
dnsqmasq process does not give out ip to instances due to deadlock
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron (Show other bugs)
Unspecified Unspecified
medium Severity medium
: async
: 3.0
Assigned To: Jakub Libosvar
Ofer Blaut
: ZStream
Depends On:
  Show dependency treegraph
Reported: 2013-09-19 09:21 EDT by Sadique Puthen
Modified: 2016-04-26 20:14 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2013-11-14 03:05:10 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1227237 None None None Never

  None (edit)
Description Sadique Puthen 2013-09-19 09:21:54 EDT
Description of problem:

Periodically dhcp server node is not handing out the address to instances with below error.

Sep 13 12:54:54 gs-prod-control dnsmasq-dhcp[3047]: DHCPDISCOVER(eth0) fa:16:3e:2b:2d:e1 no address available
Sep 13 12:54:54 gs-prod-control dnsmasq-dhcp[4225]: DHCPDISCOVER(ns-2cce40b9-5d) fa:16:3e:2b:2d:e1 no address available

Even though it is configured to do so

[root@gs-prod-control ~]# grep fa:16:3e:2b:2d:e1 /var/lib/quantum/dhcp/5272bb12-becf-4056-93c6-f947c5c7c244/host

Restarting quantum-dhcp-agent fixes the problem.

This has been reported upstream already at https://bugs.launchpad.net/neutron/+bug/1227237


NB This isn't an issue in Havana since neutron doesn't use a DHCP lease update script (i.e., neutron runs dnsmasq without the --dhcp-script=xxx option).

Every week or so on our OpenStack cluster of modest size (about a dozen nodes, dozens to hundreds of VMs being booted per day), instances stop receiving replies to their DHCP requests. When we observe the instances failing to communicate with the DHCP server, traces of network traffic show that the DHCP requests are making it to the tap device that dnsmasq is listening on but that dnsmasq isn't sending any replies. Furthermore, dnsmasq seems to be configured to send the replies (i.e., the appropriate entries are in its DHCP hosts file and dnsmasq has been sent SIGHUP). Once the dnsmasq process stops sending replies, it seems to have stopped indefinitely. Killing the dnsmasq process and restarting quantum-dhcp-agent, which starts a new dnsmasq process, gets replies going again.

The dnsmasq process seems to have stopped sending replies because of its DHCP script quantum-dhcp-agent-dnsmasq-lease-update (i.e., from the option --dhcp-script=quantum-...). According to man dnsmasq(8), dnsmasq runs one copy of the lease update script at a time (i.e., executes it serially) and some request processing blocks on execution of the lease update script. Assuming the blocked request processing includes replying to DHCP requests, then a single deadlocked instance of quantum-dhcp-agent-dnsmasq-lease-update explains why no DHCP responses are being made.

Indeed, the script has deadlocked. Here's a snippet of output from 'ps auxf' showing the relevant dnsmasq and quantum-dhcp-agent-dnsmasq-lease-update and quantum-dhcp-agent processes:

quantum 16347 0.6 0.0 198756 56084 ? Ss Sep10 71:50 python /usr/bin/quantum-dhcp-agent --config-file=/etc/quantum/quantum.conf --config-file=/etc/quantum/dhcp_agent.ini --log-file=/var/log/quantum/dhcp-agent.log
nobody 19945 0.0 0.0 28820 1156 ? S Sep10 6:47 dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=tap8591954d-46 --except-interface=lo --pid-file=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/pid --dhcp-hostsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/host --dhcp-optsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/opts --dhcp-script=/usr/bin/quantum-dhcp-agent-dnsmasq-lease-update --leasefile-ro --dhcp-range=set:tag0,,static,120s --conf-file= --domain=openstacklocal
root 19948 0.0 0.0 28792 472 ? S Sep10 3:02 \_ dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=tap8591954d-46 --except-interface=lo --pid-file=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/pid --dhcp-hostsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/host --dhcp-optsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/opts --dhcp-script=/usr/bin/quantum-dhcp-agent-dnsmasq-lease-update --leasefile-ro --dhcp-range=set:tag0,,static,120s --conf-file= --domain=openstacklocal
root 24722 0.0 0.0 4292 352 ? S Sep17 0:00 \_ quantum-dhcp-agent-dnsmasq-lease-update old fa:16:3e:17:5f:46 172-16-0-129

Strace shows that 19945 is continuing to do stuff, but 19948 is blocked in a wait4 system call waiting for its child, 24722, to exit:

peter@gremlin ~/ssst % sudo strace -p 19948
Process 19948 attached - interrupt to quit
wait4(-1, ^C <unfinished ...>

But 24722 is blocked waiting to connect to the UDS that uses to communicate with quantum-dhcp-agent:

peter@gremlin ~/ssst % sudo strace -p 24722
Process 24722 attached - interrupt to quit
connect(3, {sa_family=AF_FILE, path="/var/lib/quantum/dhcp/lease_relay"}, 110

But quantum-dhcp-agent, pid 16347, is spinning epoll-ing on some epoll instance:

peter@gremlin ~/ssst % sudo strace -p 16347
Process 16347 attached - interrupt to quit
epoll_wait(5, {}, 1023, 1549) = 0
read(4, "2+\214~\346\235,\6\264\320\27\"\\\17?\354", 16) = 16
gettid() = 16347
epoll_wait(5, {}, 1023, 826) = 0
epoll_wait(5, {}, 1023, 3999) = 0
epoll_wait(5, {}, 1023, 0) = 0
epoll_wait(5, {}, 1023, 3999) = 0
epoll_wait(5, {}, 1023, 3999) = 0

Clearly that epoll instance doesn't include the UDS that the lease update script is trying to connect to in spite of having the file open:

peter@gremlin ~/ssst % sudo lsof | grep 16347 | grep lease_relay
python 16347 quantum 8u unix 0xffff880070c86e80 0t0 224377723 /var/lib/quantum/dhcp/lease_relay

I haven't looked into why quantum-dhcp-agent isn't epoll-ing on the UDS (fd 8). Presumably the epoll is the internal eventlet epoll.

We've observed this on Canonical's 2013.1-0ubuntu2~cloud0 package (which is based on 2013.1 AFAIK) on Ubuntu 12.04 and 2013.1.2 on CentOS 6.4 from a recent RDO installation. Although I haven't tested 2013.1.3 or the latest stable/grizzly code, I suspect the bug still exists because no similar bugs.

Comment 3 Jakub Libosvar 2013-10-15 10:53:15 EDT
Sadique, can you please attach /var/log/quantum/dhcp-agent.log from node that has running quantum-dhcp-agent?

Note You need to log in before you can comment on or make changes to this bug.