RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1887649 - dnsmasq --bind-dynamic stops replying to upstream queries
Summary: dnsmasq --bind-dynamic stops replying to upstream queries
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: dnsmasq
Version: 7.9
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Petr Menšík
QA Contact: František Hrdina
URL:
Whiteboard:
Depends On:
Blocks: 1927973
TreeView+ depends on / blocked
 
Reported: 2020-10-13 05:13 UTC by Brendan Shirren
Modified: 2023-12-15 19:45 UTC (History)
9 users (show)

Fixed In Version: dnsmasq-2.76-17.el7_9.1
Doc Type: Bug Fix
Doc Text:
Cause: Netlink reading might be unreliable on busy systems with many interfaces, when more interfaces are reconfigured quickly on the same time. Consequence: dnsmasq with --bind-dynamic option might lose some message and cannnot read current state correctly. It then might create only partially configured listeners and not fix them on subsequent changes, not responding on every address as configured. Internal dnsmasq state might no longer match state of interfaces on the system. Fix: Improve handling of dropped messages. Ensure previous messages are all read before new request and reduce number or required re-reads. Significantly increase probability it would read current state successfully on retry. Result: Listeners are correctly reconfigured, even on busy systems with fast interface changes. dnsmasq adjust to system changes always, even they when mass reconfiguration is done.
Clone Of:
: 1927973 (view as bug list)
Environment:
Last Closed: 2021-04-27 11:34:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Brendan Shirren 2020-10-13 05:13:00 UTC
Description of problem: dnsmasq stops replying to queries forwarded upstream until service is restarted.

Version-Release number of selected component (if applicable):

dnsmasq-2.76-10.el7_7.1.x86_64

How reproducible: Intermittent.

Steps to Reproduce:
1. ??
2.
3.

Actual results: no reply to query forwarded upstream even though upstream DNS resolver listening.

Oct  5 01:52:24 dnsmasq[29698]: query[AAAA] docker-registry.default.svc.cluster.local from 10.52.247.118
Oct  5 01:52:24 dnsmasq[29698]: forwarded docker-registry.default.svc.cluster.local to 127.0.0.1
Oct  5 01:52:24 dnsmasq[29698]: query[A] docker-registry.default.svc.cluster.local from 10.52.247.118
Oct  5 01:52:24 dnsmasq[29698]: forwarded docker-registry.default.svc.cluster.local to 127.0.0.1
Oct  5 01:52:29 dnsmasq[29698]: query[AAAA] docker-registry.default.svc.cluster.local from 10.52.247.118
Oct  5 01:52:29 dnsmasq[29698]: forwarded docker-registry.default.svc.cluster.local to 127.0.0.1
Oct  5 01:52:29 dnsmasq[29698]: query[A] docker-registry.default.svc.cluster.local from 10.52.247.118
Oct  5 01:52:29 dnsmasq[29698]: forwarded docker-registry.default.svc.cluster.local to 127.0.0.1
Oct  5 01:52:34 dnsmasq[29698]: query[AAAA] docker-registry.default.svc.cluster.local from 10.52.247.118
Oct  5 01:52:34 dnsmasq[29698]: forwarded docker-registry.default.svc.cluster.local to 127.0.0.1
Oct  5 01:52:34 dnsmasq[29698]: query[A] docker-registry.default.svc.cluster.local from 10.52.247.118
Oct  5 01:52:34 dnsmasq[29698]: forwarded docker-registry.default.svc.cluster.local to 127.0.0.1
Oct  5 01:52:39 dnsmasq[29698]: query[AAAA] docker-registry.default.svc.cluster.local from 10.52.247.118
Oct  5 01:52:39 dnsmasq[29698]: forwarded docker-registry.default.svc.cluster.local to 127.0.0.1
Oct  5 01:52:39 dnsmasq[29698]: query[A] docker-registry.default.svc.cluster.local from 10.52.247.118
Oct  5 01:52:39 dnsmasq[29698]: forwarded docker-registry.default.svc.cluster.local to 127.0.0.1


Expected results:

Oct  5 01:48:52 dnsmasq[29698]: query[AAAA] docker-registry.default.svc.cluster.local from 10.52.247.118
Oct  5 01:48:52 dnsmasq[29698]: forwarded docker-registry.default.svc.cluster.local to 127.0.0.1
Oct  5 01:48:52 dnsmasq[29698]: query[A] docker-registry.default.svc.cluster.local from 10.52.247.118
Oct  5 01:48:52 dnsmasq[29698]: forwarded docker-registry.default.svc.cluster.local to 127.0.0.1
Oct  5 01:48:52 dnsmasq[29698]: reply docker-registry.default.svc.cluster.local is 172.30.200.79


Additional info:

*** placeholder bug while awaiting strace captures and feedback post-upgrade of dnsmasq package ***


Not certain it's duplicate of BZ#1745445 as the queries here are UDP not TCP.

Note1: valid hostnames under search domain from /etc/resolv.conf will fail with NXDOMAIN

Host docker-registry.default.svc not found: 3(NXDOMAIN)

Note2: dnsmasq stops logging its expected frequent "setting upstream servers from DBus" message to journal around time of issue when service is still running.

Comment 4 Petr Menšík 2020-10-13 09:41:18 UTC
Can it be verified by tcpdump on localhost, that all those queries receive replies from 127.0.0.1?

tcpdump -i lo -n port domain

It should be clear whether failure is on side of dnsmasq or server operating on address 127.0.0.1

Comment 5 Brandon Anderson 2020-10-13 21:37:07 UTC
(In reply to Petr Menšík from comment #4)
> Can it be verified by tcpdump on localhost, that all those queries receive
> replies from 127.0.0.1?
> 
> tcpdump -i lo -n port domain
> 
> It should be clear whether failure is on side of dnsmasq or server operating
> on address 127.0.0.1

Received word from the cu that errors are resulting from dnsmasq:

x094846# systemctl status dnsmasq -l
â— dnsmasq.service - DNS caching server.
   Loaded: loaded (/usr/lib/systemd/system/dnsmasq.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/dnsmasq.service.d
           └─override.conf
   Active: active (running) since Sat 2020-10-10 19:20:39 PDT; 2 days ago
 Main PID: 1544 (dnsmasq)
   Memory: 1.6G
   CGroup: /system.slice/dnsmasq.service
           └─1544 /usr/sbin/dnsmasq -k

Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file
Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file
Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file
Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file
Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file
Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file
Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file
Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file
Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file
Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file
Oct 13 14:22:21 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file
x094846#

Comment 7 Brendan Shirren 2020-10-16 06:36:01 UTC
dnsmasq-2.76-7.el7_6.1.x86_64 gcore just shows that dnsmasq is waiting in the do_poll() routine:

[New LWP 2457]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/dnsmasq'.
#0  0x00007fa02d833370 in __poll_nocancel ()
    at ../sysdeps/unix/syscall-template.S:81
81	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) bt
#0  0x00007fa02d833370 in __poll_nocancel ()
    at ../sysdeps/unix/syscall-template.S:81
#1  0x000055a4f41ededd in main (argc=<optimized out>, argv=<optimized out>)
    at dnsmasq.c:976
(gdb) 

     /* must do this just before select(), when we know no
	 more calls to my_syslog() can occur */
      set_log_writer();
      
      if (do_poll(timeout) < 0)
	continue;




  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  2457 root      20   0   52856   1912    400 S   0.0  0.0  72:45.27 /usr/sbin/dnsmasq -dqD --log-facility=/var/log/dnsmasq.log

Comment 15 Brendan Shirren 2020-10-25 04:19:47 UTC
# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.9 (Maipo)

# uname -a
Linux rhel7 3.10.0-1160.2.2.el7.x86_64 #1 SMP Sat Oct 17 05:06:47 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

# rpm -q dnsmasq
dnsmasq-2.76-16.el7.x86_64


# cat /etc/dnsmasq.d/origin-dns.conf 
no-resolv
domain-needed
no-negcache
max-cache-ttl=1
enable-dbus
dns-forward-max=10000
cache-size=10000
bind-dynamic
min-port=1024
except-interface=lo


# cat /etc/dnsmasq.d/origin-upstream-dns.conf 
server=192.168.122.1


Oct 25 13:58:31 rhel7 dnsmasq[981]: started, version 2.76 cachesize 10000
Oct 25 13:58:31 rhel7 dnsmasq[981]: compile time options: IPv6 GNU-getopt DBus no-i18n IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
Oct 25 13:58:31 rhel7 dnsmasq[981]: DBus support enabled: connected to system bus
Oct 25 13:58:31 rhel7 dnsmasq[981]: using nameserver 192.168.122.1#53
Oct 25 13:58:31 rhel7 dnsmasq[981]: read /etc/hosts - 2 addresses


# netstat -tunlp | grep ":53 "
tcp        0      0 172.17.0.1:53           0.0.0.0:*               LISTEN      981/dnsmasq         
tcp        0      0 192.168.122.11:53       0.0.0.0:*               LISTEN      981/dnsmasq         
tcp6       0      0 fe80::9e26:cdb7:7e9d:53 :::*                    LISTEN      981/dnsmasq         
udp        0      0 172.17.0.1:53           0.0.0.0:*                           981/dnsmasq         
udp        0      0 192.168.122.11:53       0.0.0.0:*                           981/dnsmasq         
udp6       0      0 fe80::9e26:cdb7:7e9d:53 :::*                                981/dnsmasq    


# host www.redhat.com
www.redhat.com is an alias for ds-www.redhat.com.edgekey.net.
ds-www.redhat.com.edgekey.net is an alias for ds-www.redhat.com.edgekey.net.globalredir.akadns.net.
ds-www.redhat.com.edgekey.net.globalredir.akadns.net is an alias for e3396.dscx.akamaiedge.net.
e3396.dscx.akamaiedge.net has address 104.98.41.96
e3396.dscx.akamaiedge.net has IPv6 address 2600:1415:11:4a7::d44
e3396.dscx.akamaiedge.net has IPv6 address 2600:1415:11:4ad::d44


# BZ#1887649
# for X in `seq 1 254`; do ifconfig eth0:${X} 100.123.1.${X} netmask 255.255.255.0; done
# for X in `seq 1 254`; do ifconfig eth0:${X} down; done
# for X in `seq 1 254`; do ifconfig eth0:${X} 100.123.1.${X} netmask 255.255.255.0; done
# for X in `seq 1 254`; do ifconfig eth0:${X} down; done


# host www.redhat.com
;; connection timed out; no servers could be reached

# host www.google.com
;; connection timed out; no servers could be reached


# strace -p `pidof dnsmasq`
strace: Process 981 attached
recvmsg(3, 


# netstat -tunlp | grep ":53 "
...
tcp        0      0 100.123.1.78:53         0.0.0.0:*               LISTEN      981/dnsmasq         
tcp        0      0 100.123.1.77:53         0.0.0.0:*               LISTEN      981/dnsmasq         
tcp        0      0 100.123.1.76:53         0.0.0.0:*               LISTEN      981/dnsmasq         
tcp        0      0 100.123.1.75:53         0.0.0.0:*               LISTEN      981/dnsmasq         
tcp        0      0 100.123.1.74:53         0.0.0.0:*               LISTEN      981/dnsmasq         
tcp        0      0 100.123.1.73:53         0.0.0.0:*               LISTEN      981/dnsmasq         
tcp        0      0 172.17.0.1:53           0.0.0.0:*               LISTEN      981/dnsmasq         
tcp        0      0 192.168.122.11:53       0.0.0.0:*               LISTEN      981/dnsmasq         
tcp6       0      0 fe80::9e26:cdb7:7e9d:53 :::*                    LISTEN      981/dnsmasq         
...      
udp        0      0 100.123.1.78:53         0.0.0.0:*                           981/dnsmasq         
udp        0      0 100.123.1.77:53         0.0.0.0:*                           981/dnsmasq         
udp        0      0 100.123.1.76:53         0.0.0.0:*                           981/dnsmasq         
udp        0      0 100.123.1.75:53         0.0.0.0:*                           981/dnsmasq         
udp        0      0 100.123.1.74:53         0.0.0.0:*                           981/dnsmasq         
udp        0      0 100.123.1.73:53         0.0.0.0:*                           981/dnsmasq         
udp        0      0 172.17.0.1:53           0.0.0.0:*                           981/dnsmasq         
udp     1536      0 192.168.122.11:53       0.0.0.0:*                           981/dnsmasq         
udp6       0      0 fe80::9e26:cdb7:7e9d:53 :::*                                981/dnsmasq

Comment 22 Petr Menšík 2021-02-12 00:51:41 UTC
Affected is not only RHEL 7, but it manifests also in RHEL 8. According to my tests of fedora versions, it was fixed in dnsmasq 2.81.

The best candidate changes are those:
http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=1627d577af03cdf747285e79fa747b6aaae8033f
http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=b2ed691eb3ca6488a8878f5f3dd950a07b14a9db

Comment 25 Petr Menšík 2021-02-18 09:53:58 UTC
When checking RHEL8 failures, I found some deficiencies in current dnsmasq handling of listeners. It is problematic even in latest upstream release.

Main problem lies in src/network.c, create_listeners function. There is race condition, when interface is appearing and disappearing fast. When interface appears, netlink code detects new address, runs newaddress(). Then it spawns create_listeners for every interface. It can happen udp socket is created successfully, but tcp fails to bind to address with error EADDRNOTAVAIL. It prints warning to log, which is nice. But it does not react otherwise and create listener with just UDP socket, not TCP. Because the listener for the address exists, it would never create TCP socket again, until the address is removed again and appears a new time.

With NETLINK_NO_ENOBUFS socket option applied to netlink socket, it does not happen usually on RHEL7. It needs tweak on RHEL8 to work, even latest release 2.84 does not work on RHEL8, but works on Fedora. The problem with race socket binding is present on RHEL7 as well.

Comment 26 Petr Menšík 2021-03-11 13:04:39 UTC
Changes were merged upstream, they improve ENOBUFS handling and prevent some situations for it.


Related commits:
http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=8b8a4148ec14363d75ad76c1fb9542c85b66b8c5
http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=a8c14745625c0862fd77b6307ea00c2f67c9e5e3

Main commit:
http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=4c0aecc68524c5fd74053244a605e905dc644228

Related commits contains just optimizations. Main commit changes ENOBUFS handling to correctly restart changes fetching after netlink socket got full.

Comment 39 errata-xmlrpc 2021-04-27 11:34:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (dnsmasq bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1391


Note You need to log in before you can comment on or make changes to this bug.