Bug 1275626 - dnsmasq crash with coredump on infiniband network with OpenStack
dnsmasq crash with coredump on infiniband network with OpenStack
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: dnsmasq (Show other bugs)
7.1
Unspecified Unspecified
high Severity medium
: beta
: 7.3
Assigned To: Pavel Šimerda (pavlix)
Vaclav Danek
: OtherQA, Patch
Depends On:
Blocks: 1171868 1255429 1313485 1289025 1289204 1295829 1364088
  Show dependency treegraph
 
Reported: 2015-10-27 07:31 EDT by Moshe Levi
Modified: 2016-11-04 02:14 EDT (History)
8 users (show)

See Also:
Fixed In Version: dnsmasq-2.66-17.el7
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-11-04 02:14:30 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
dnsmasq config files (1010 bytes, application/zip)
2015-10-27 07:31 EDT, Moshe Levi
no flags Details

  None (edit)
Description Moshe Levi 2015-10-27 07:31:04 EDT
Created attachment 1086826 [details]
dnsmasq config files

Description of problem:

I am experiencing coredump in dnsmasq on OpenStack environment.
This is my setup:
1.	RH 7.1 
2.	OpenOStack Liberty release
3.	dnsmasq-utils-2.66-14.el7_1.x86_64
4.	we are using dhcp on InfiniBand network (using client id and there is no MAC) 


Version-Release number of selected component (if applicable):
dnsmasq-utils-2.66-14.el7_1.x86_64

How reproducible:
1. install OpenOStack Liberty
2. install mlnx plugin according to https://wiki.openstack.org/wiki/Mellanox-Neutron-Liberty-Redhat-InfiniBand
3. spawn VM in openstack

Actual results:
dnsmasq coredump:
Oct 27 12:11:18 r-smg37 neutron-dhcp-agent: 2015-10-27 12:11:18.213 44868 ERROR neutron.agent.linux.external_process [-] respawning dnsmasq for uuid 82acf0a3-ec07-4009-84b5-74f750c89dc6
Oct 27 12:11:18 r-smg37 dnsmasq[41374]: started, version 2.66 cachesize 150
Oct 27 12:11:18 r-smg37 dnsmasq[41374]: compile time options: IPv6 GNU-getopt DBus no-i18n IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth
Oct 27 12:11:18 r-smg37 dnsmasq[41374]: warning: no upstream servers configured
Oct 27 12:11:18 r-smg37 dnsmasq-dhcp[41374]: DHCP, static leases only on 192.168.111.0, lease time 1d
Oct 27 12:11:18 r-smg37 dnsmasq[41374]: read /var/lib/neutron/dhcp/82acf0a3-ec07-4009-84b5-74f750c89dc6/addn_hosts - 6 addresses
Oct 27 12:11:18 r-smg37 dnsmasq-dhcp[41374]: read /var/lib/neutron/dhcp/82acf0a3-ec07-4009-84b5-74f750c89dc6/host
Oct 27 12:11:18 r-smg37 dnsmasq-dhcp[41374]: read /var/lib/neutron/dhcp/82acf0a3-ec07-4009-84b5-74f750c89dc6/opts
Oct 27 12:11:18 r-smg37 kernel: dnsmasq[41374]: segfault at 7a ip 00007f886e5501e8 sp 00007fff9c540b80 error 4 in dnsmasq[7f886e51e000+43000]



Expected results:
VM should get and IP address

Additional info:

Some times when spawning a VM the dnsmasq crashes see [1] and [2]
Just to point out when spawning a VM the neutron-dhcp-agent (which manage the dnsmasq instances for OpenStack)  send SIGHUP to reload the config files 
And after that I see 
Oct 27 12:11:18 r-smg37 neutron-dhcp-agent: 2015-10-27 12:11:18.213 44868 ERROR neutron.agent.linux.external_process [-] respawning dnsmasq for uuid 82acf0a3-ec07-4009-84b5-74f750c89dc6
Oct 27 12:11:18 r-smg37 dnsmasq[41374]: started, version 2.66 cachesize 150
Oct 27 12:11:18 r-smg37 dnsmasq[41374]: compile time options: IPv6 GNU-getopt DBus no-i18n IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth
Oct 27 12:11:18 r-smg37 dnsmasq[41374]: warning: no upstream servers configured
Oct 27 12:11:18 r-smg37 dnsmasq-dhcp[41374]: DHCP, static leases only on 192.168.111.0, lease time 1d
Oct 27 12:11:18 r-smg37 dnsmasq[41374]: read /var/lib/neutron/dhcp/82acf0a3-ec07-4009-84b5-74f750c89dc6/addn_hosts - 6 addresses
Oct 27 12:11:18 r-smg37 dnsmasq-dhcp[41374]: read /var/lib/neutron/dhcp/82acf0a3-ec07-4009-84b5-74f750c89dc6/host
Oct 27 12:11:18 r-smg37 dnsmasq-dhcp[41374]: read /var/lib/neutron/dhcp/82acf0a3-ec07-4009-84b5-74f750c89dc6/opts
Oct 27 12:11:18 r-smg37 kernel: dnsmasq[41374]: segfault at 7a ip 00007f886e5501e8 sp 00007fff9c540b80 error 4 in dnsmasq[7f886e51e000+43000]

This is how the neutron-dhcp-agent spawn the dnsmasq 
dnsmasq --no-hosts --no-resolv --strict-order --except-interface=lo --pid-file=/var/lib/neutron/dhcp/82acf0a3-ec07-4009-84b5-74f750c89dc6/pid --dhcp-hostsfile=/var/lib/neutron/dhcp/82acf0a3-ec07-4009-84b5-74f750c89dc6/host --addn-hosts=/var/lib/neutron/dhcp/82acf0a3-ec07-4009-84b5-74f750c89dc6/addn_hosts --dhcp-optsfile=/var/lib/neutron/dhcp/82acf0a3-ec07-4009-84b5-74f750c89dc6/opts --dhcp-leasefile=/var/lib/neutron/dhcp/82acf0a3-ec07-4009-84b5-74f750c89dc6/leases --dhcp-match=set:ipxe,175 --bind-interfaces --interface=tap04c60fe7-62 --dhcp-range=set:tag0,192.168.111.0,static,86400s --dhcp-lease-max=256 --conf-file= --domain=openstacklocal --dhcp-broadcast

And I also attached the config files: opts, leases, addn_hosts, host

Please note that it is not happened on the Ethernet network only InfiniBand, so I guess the crash (and as it seems in the logs) related to the client id.
It will be great if you can help me debug this issue. 

[1] - /var/log/messages
Oct 27 12:11:18 r-smg37 neutron-dhcp-agent: 2015-10-27 12:11:18.213 44868 ERROR neutron.agent.linux.external_process [-] respawning dnsmasq for uuid 82acf0a3-ec07-4009-84b5-74f750c89dc6
Oct 27 12:11:18 r-smg37 dnsmasq[41374]: started, version 2.66 cachesize 150
Oct 27 12:11:18 r-smg37 dnsmasq[41374]: compile time options: IPv6 GNU-getopt DBus no-i18n IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth
Oct 27 12:11:18 r-smg37 dnsmasq[41374]: warning: no upstream servers configured
Oct 27 12:11:18 r-smg37 dnsmasq-dhcp[41374]: DHCP, static leases only on 192.168.111.0, lease time 1d
Oct 27 12:11:18 r-smg37 dnsmasq[41374]: read /var/lib/neutron/dhcp/82acf0a3-ec07-4009-84b5-74f750c89dc6/addn_hosts - 6 addresses
Oct 27 12:11:18 r-smg37 dnsmasq-dhcp[41374]: read /var/lib/neutron/dhcp/82acf0a3-ec07-4009-84b5-74f750c89dc6/host
Oct 27 12:11:18 r-smg37 dnsmasq-dhcp[41374]: read /var/lib/neutron/dhcp/82acf0a3-ec07-4009-84b5-74f750c89dc6/opts
Oct 27 12:11:18 r-smg37 kernel: dnsmasq[41374]: segfault at 7a ip 00007f886e5501e8 sp 00007fff9c540b80 error 4 in dnsmasq[7f886e51e000+43000]


[2]: CoreDump
[root@r-smg37 ~(keystone_admin)]# gdb /usr/sbin/dnsmasq /root/core-dnsmasq-11-99-40-46692-1445940738
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-64.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/dnsmasq...Reading symbols from /usr/lib/debug/usr/sbin/dnsmasq.debug...done.
done.
[New LWP 46692]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `dnsmasq --no-hosts --no-resolv --strict-order --except-interface=lo --pid-file='.
Program terminated with signal 11, Segmentation fault.
#0  find_config (configs=0x7f4b5b014350, context=context@entry=0x0, clid=0x7f4b5b00bc00 "\377", clid_len=20, hwaddr=hwaddr@entry=0x7f4b5b00bb90 "", hw_len=0, hw_type=32, hostname=hostname@entry=0x0)
    at dhcp-common.c:319
319               if (!(context->flags & CONTEXT_V6) && *clid == 0 && config->clid_len == clid_len-1  &&
(gdb) bt
#0  find_config (configs=0x7f4b5b014350, context=context@entry=0x0, clid=0x7f4b5b00bc00 "\377", clid_len=20, hwaddr=hwaddr@entry=0x7f4b5b00bb90 "", hw_len=0, hw_type=32, hostname=hostname@entry=0x0)
    at dhcp-common.c:319
#1  0x00007f4b5952575b in lease_update_from_configs () at lease.c:193
#2  0x00007f4b59521c2c in clear_cache_and_reload (now=1445940738) at dnsmasq.c:1236
#3  0x00007f4b5950c334 in async_event (now=1445940738, pipe=10) at dnsmasq.c:1049
#4  main (argc=<optimized out>, argv=<optimized out>) at dnsmasq.c:852
Comment 2 Moshe Levi 2015-10-28 03:32:50 EDT
I compiled the latest dnsmasq 
commit 98079ea89851da1df4966dfdfa1852a98da02912
Author: Simon Kelley <simon@thekelleys.org.uk>
Date:   Tue Oct 13 20:30:32 2015 +0100

    Catch errors from sendmsg in DHCP code.
     Logs, eg,  iptables DROPS of dest 255.255.255.255

and we don't experience cordump now. is it possible to build newer version of dnsmasq  for el/centos 7 or at least in the RDO Delorean repository for Openstack?
Comment 3 Jakub Libosvar 2015-11-05 12:41:53 EST
I suspect we need http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=53c4c5c85942d4733f4723531c4d325235448326 that is included in 2.67 upstream version.
Comment 6 Moshe Levi 2015-11-23 06:35:28 EST
yes this patch should solve the issue 
I also confirm it with the dnsmasq community 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

This bug was fixed in the 2.67 release. The fix is here:

http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=53c4c5c85942d4
733f4723531c4d325235448326

the patch should apply fine to the version you're using, if that suits you best.


Cheers,

Simon.
Comment 11 Ihar Hrachyshka 2015-11-30 08:36:39 EST
I won't describe exact steps to reproduce it, but I believe dnsmasq maintainers should be able to deduce them from the code.

So, the bug was introduced by 'Support IPv6 assignment based on MAC for DHCPv6' patch that we backported before for openstack neutron needs (dnsmasq 2.66-13).

Looking at the code, the following should occur to trigger the trace:

- DHCP should be enabled for the service:

if (daemon->dhcp || daemon->doing_dhcp6)
{
   ...
   lease_update_from_configs();
}

- there should an existing lease with client ID:

void lease_update_from_configs(void)
{
    for (lease = leases; lease; lease = lease->next)
    {
        if ((config = find_config(daemon->dhcp_conf, NULL, lease->clid, lease->clid_len, lease->hwaddr, lease->hwaddr_len, lease->hwaddr_type, NULL)) ...
    }
}

- no existing configuration should match for the client ID;

- there should be at least one lease configurations entry that matches based on client ID, that does NOT match an existing lease.
Comment 13 Ihar Hrachyshka 2015-11-30 09:06:57 EST
I also see we have config files attached the the bug. Has anyone actually tried to HUP the service when using the files?.. I suspect it could just reveal the issue.
Comment 14 Pavel Šimerda (pavlix) 2015-11-30 11:31:27 EST
After an analysis of the code I can say that both the problem and the fix are obvious. There is a single call to `find_config()` with context explicitly set to `NULL` in the code. And that `find_config()` leads to code that doesn't work for a `NULL` context.

(In reply to Ihar Hrachyshka from comment #13)
> I also see we have config files attached the the bug. Has anyone actually
> tried to HUP the service when using the files?.. I suspect it could just
> reveal the issue.

I second that suspicion. Entering the code path seems to be feasible. Thanks for your help!
Comment 16 Pavel Šimerda (pavlix) 2015-12-08 20:18:57 EST
The following script is enough to show the issue. It manipulates the clid variable using gdb to avoid the need to test on infiniband.

#!/bin/bash -xe

interface=ens9

# prepare
mkdir -p tmp
cat > tmp/leases << EOF
2000000000 02:00:00:00:00:00 192.0.2.2 host *
EOF
cat > tmp/script << EOF
start
advance dhcp-common.c:308
set clid = "*"
continue
quit
EOF

# run
gdb -x tmp/script \
    --args dnsmasq \
        --no-daemon --no-hosts --no-resolv --conf-file= \
        --dhcp-leasefile=tmp/leases \
        --dhcp-range=192.0.2.0,static \
        --dhcp-host=fa:16:3e:3c:ac:55,id:ff:00:00:00:00:00:02:00:00:02:c9:00:fa:16:3e:00:00:3c:ac:55,host,192.0.2.2

# cleanup
rm tmp/leases
rm tmp/script
rmdir tmp
Comment 24 errata-xmlrpc 2016-11-04 02:14:30 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2421.html

Note You need to log in before you can comment on or make changes to this bug.