Bug 1887649
| Summary: | dnsmasq --bind-dynamic stops replying to upstream queries | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Brendan Shirren <bshirren> | |
| Component: | dnsmasq | Assignee: | Petr Menšík <pemensik> | |
| Status: | CLOSED ERRATA | QA Contact: | František Hrdina <fhrdina> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 7.9 | CC: | aegorenk, braander, bshirren, fkrska, jmaxwell, jreznik, psklenar, rhowe, sjayapra | |
| Target Milestone: | rc | Keywords: | Patch, TestCaseProvided, Triaged, ZStream | |
| Target Release: | --- | Flags: | thozza:
mirror+
|
|
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | dnsmasq-2.76-17.el7_9.1 | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: Netlink reading might be unreliable on busy systems with many interfaces, when more interfaces are reconfigured quickly on the same time.
Consequence: dnsmasq with --bind-dynamic option might lose some message and cannnot read current state correctly. It then might create only partially configured listeners and not fix them on subsequent changes, not responding on every address as configured. Internal dnsmasq state might no longer match state of interfaces on the system.
Fix: Improve handling of dropped messages. Ensure previous messages are all read before new request and reduce number or required re-reads. Significantly increase probability it would read current state successfully on retry.
Result: Listeners are correctly reconfigured, even on busy systems with fast interface changes. dnsmasq adjust to system changes always, even they when mass reconfiguration is done.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1927973 (view as bug list) | Environment: | ||
| Last Closed: | 2021-04-27 11:34:08 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1927973 | |||
|
Description
Brendan Shirren
2020-10-13 05:13:00 UTC
Can it be verified by tcpdump on localhost, that all those queries receive replies from 127.0.0.1? tcpdump -i lo -n port domain It should be clear whether failure is on side of dnsmasq or server operating on address 127.0.0.1 (In reply to Petr Menšík from comment #4) > Can it be verified by tcpdump on localhost, that all those queries receive > replies from 127.0.0.1? > > tcpdump -i lo -n port domain > > It should be clear whether failure is on side of dnsmasq or server operating > on address 127.0.0.1 Received word from the cu that errors are resulting from dnsmasq: x094846# systemctl status dnsmasq -l â— dnsmasq.service - DNS caching server. Loaded: loaded (/usr/lib/systemd/system/dnsmasq.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/dnsmasq.service.d └─override.conf Active: active (running) since Sat 2020-10-10 19:20:39 PDT; 2 days ago Main PID: 1544 (dnsmasq) Memory: 1.6G CGroup: /system.slice/dnsmasq.service └─1544 /usr/sbin/dnsmasq -k Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file Oct 13 14:22:20 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file Oct 13 14:22:21 x094846 dnsmasq[1544]: DBus error: Connection ":1.3" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policies in the configuration file x094846# dnsmasq-2.76-7.el7_6.1.x86_64 gcore just shows that dnsmasq is waiting in the do_poll() routine:
[New LWP 2457]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/dnsmasq'.
#0 0x00007fa02d833370 in __poll_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) bt
#0 0x00007fa02d833370 in __poll_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
#1 0x000055a4f41ededd in main (argc=<optimized out>, argv=<optimized out>)
at dnsmasq.c:976
(gdb)
/* must do this just before select(), when we know no
more calls to my_syslog() can occur */
set_log_writer();
if (do_poll(timeout) < 0)
continue;
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2457 root 20 0 52856 1912 400 S 0.0 0.0 72:45.27 /usr/sbin/dnsmasq -dqD --log-facility=/var/log/dnsmasq.log
# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.9 (Maipo) # uname -a Linux rhel7 3.10.0-1160.2.2.el7.x86_64 #1 SMP Sat Oct 17 05:06:47 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux # rpm -q dnsmasq dnsmasq-2.76-16.el7.x86_64 # cat /etc/dnsmasq.d/origin-dns.conf no-resolv domain-needed no-negcache max-cache-ttl=1 enable-dbus dns-forward-max=10000 cache-size=10000 bind-dynamic min-port=1024 except-interface=lo # cat /etc/dnsmasq.d/origin-upstream-dns.conf server=192.168.122.1 Oct 25 13:58:31 rhel7 dnsmasq[981]: started, version 2.76 cachesize 10000 Oct 25 13:58:31 rhel7 dnsmasq[981]: compile time options: IPv6 GNU-getopt DBus no-i18n IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify Oct 25 13:58:31 rhel7 dnsmasq[981]: DBus support enabled: connected to system bus Oct 25 13:58:31 rhel7 dnsmasq[981]: using nameserver 192.168.122.1#53 Oct 25 13:58:31 rhel7 dnsmasq[981]: read /etc/hosts - 2 addresses # netstat -tunlp | grep ":53 " tcp 0 0 172.17.0.1:53 0.0.0.0:* LISTEN 981/dnsmasq tcp 0 0 192.168.122.11:53 0.0.0.0:* LISTEN 981/dnsmasq tcp6 0 0 fe80::9e26:cdb7:7e9d:53 :::* LISTEN 981/dnsmasq udp 0 0 172.17.0.1:53 0.0.0.0:* 981/dnsmasq udp 0 0 192.168.122.11:53 0.0.0.0:* 981/dnsmasq udp6 0 0 fe80::9e26:cdb7:7e9d:53 :::* 981/dnsmasq # host www.redhat.com www.redhat.com is an alias for ds-www.redhat.com.edgekey.net. ds-www.redhat.com.edgekey.net is an alias for ds-www.redhat.com.edgekey.net.globalredir.akadns.net. ds-www.redhat.com.edgekey.net.globalredir.akadns.net is an alias for e3396.dscx.akamaiedge.net. e3396.dscx.akamaiedge.net has address 104.98.41.96 e3396.dscx.akamaiedge.net has IPv6 address 2600:1415:11:4a7::d44 e3396.dscx.akamaiedge.net has IPv6 address 2600:1415:11:4ad::d44 # BZ#1887649 # for X in `seq 1 254`; do ifconfig eth0:${X} 100.123.1.${X} netmask 255.255.255.0; done # for X in `seq 1 254`; do ifconfig eth0:${X} down; done # for X in `seq 1 254`; do ifconfig eth0:${X} 100.123.1.${X} netmask 255.255.255.0; done # for X in `seq 1 254`; do ifconfig eth0:${X} down; done # host www.redhat.com ;; connection timed out; no servers could be reached # host www.google.com ;; connection timed out; no servers could be reached # strace -p `pidof dnsmasq` strace: Process 981 attached recvmsg(3, # netstat -tunlp | grep ":53 " ... tcp 0 0 100.123.1.78:53 0.0.0.0:* LISTEN 981/dnsmasq tcp 0 0 100.123.1.77:53 0.0.0.0:* LISTEN 981/dnsmasq tcp 0 0 100.123.1.76:53 0.0.0.0:* LISTEN 981/dnsmasq tcp 0 0 100.123.1.75:53 0.0.0.0:* LISTEN 981/dnsmasq tcp 0 0 100.123.1.74:53 0.0.0.0:* LISTEN 981/dnsmasq tcp 0 0 100.123.1.73:53 0.0.0.0:* LISTEN 981/dnsmasq tcp 0 0 172.17.0.1:53 0.0.0.0:* LISTEN 981/dnsmasq tcp 0 0 192.168.122.11:53 0.0.0.0:* LISTEN 981/dnsmasq tcp6 0 0 fe80::9e26:cdb7:7e9d:53 :::* LISTEN 981/dnsmasq ... udp 0 0 100.123.1.78:53 0.0.0.0:* 981/dnsmasq udp 0 0 100.123.1.77:53 0.0.0.0:* 981/dnsmasq udp 0 0 100.123.1.76:53 0.0.0.0:* 981/dnsmasq udp 0 0 100.123.1.75:53 0.0.0.0:* 981/dnsmasq udp 0 0 100.123.1.74:53 0.0.0.0:* 981/dnsmasq udp 0 0 100.123.1.73:53 0.0.0.0:* 981/dnsmasq udp 0 0 172.17.0.1:53 0.0.0.0:* 981/dnsmasq udp 1536 0 192.168.122.11:53 0.0.0.0:* 981/dnsmasq udp6 0 0 fe80::9e26:cdb7:7e9d:53 :::* 981/dnsmasq Affected is not only RHEL 7, but it manifests also in RHEL 8. According to my tests of fedora versions, it was fixed in dnsmasq 2.81. The best candidate changes are those: http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=1627d577af03cdf747285e79fa747b6aaae8033f http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=b2ed691eb3ca6488a8878f5f3dd950a07b14a9db When checking RHEL8 failures, I found some deficiencies in current dnsmasq handling of listeners. It is problematic even in latest upstream release. Main problem lies in src/network.c, create_listeners function. There is race condition, when interface is appearing and disappearing fast. When interface appears, netlink code detects new address, runs newaddress(). Then it spawns create_listeners for every interface. It can happen udp socket is created successfully, but tcp fails to bind to address with error EADDRNOTAVAIL. It prints warning to log, which is nice. But it does not react otherwise and create listener with just UDP socket, not TCP. Because the listener for the address exists, it would never create TCP socket again, until the address is removed again and appears a new time. With NETLINK_NO_ENOBUFS socket option applied to netlink socket, it does not happen usually on RHEL7. It needs tweak on RHEL8 to work, even latest release 2.84 does not work on RHEL8, but works on Fedora. The problem with race socket binding is present on RHEL7 as well. Changes were merged upstream, they improve ENOBUFS handling and prevent some situations for it. Related commits: http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=8b8a4148ec14363d75ad76c1fb9542c85b66b8c5 http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=a8c14745625c0862fd77b6307ea00c2f67c9e5e3 Main commit: http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=4c0aecc68524c5fd74053244a605e905dc644228 Related commits contains just optimizations. Main commit changes ENOBUFS handling to correctly restart changes fetching after netlink socket got full. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (dnsmasq bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1391 |