Bug 2186468 - dnsmasq crash when used for DNS caching with NetworkManager [fedora]
Summary: dnsmasq crash when used for DNS caching with NetworkManager [fedora]
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: dnsmasq
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Petr Menšík
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On: 2185878
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-04-13 11:52 UTC by Petr Menšík
Modified: 2023-10-08 04:25 UTC (History)
7 users (show)

Fixed In Version: dnsmasq-2.89-5.fc38
Clone Of: 2185878
Environment:
Last Closed: 2023-06-09 02:00:23 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Petr Menšík 2023-04-13 11:52:01 UTC
+++ This bug was initially created as a clone of Bug #2185878 +++

Description of problem:
I'm using dnsmasq to provide local DNS caching with NetworkManager so NM is responsible for starting and managing dnsmasq. This is an IPv4-only system.

These are the related configuration files in the setup:

# cat /etc/resolv.conf 
nameserver 127.0.0.1
search example.com
options edns0 no-aaaa trust-ad
# cat /etc/NetworkManager/conf.d/50-dns.conf 
[main]
dns=dnsmasq
# cat /etc/NetworkManager/conf.d/99-rc-manager.conf 
[main]
rc-manager=unmanaged
# cat /etc/NetworkManager/dnsmasq.d/zz-local.conf 
addn-hosts=/etc/hosts
domain-needed
bogus-priv
local=/example.com/
local-ttl=10
min-cache-ttl=10
neg-ttl=10

When booting up the system or restarting NM I see new core files appearing:

# file /var/lib/systemd/coredump/core.5835 
/var/lib/systemd/coredump/core.5835: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/dnsmasq --no-resolv --keep-in-foreground --no-hosts --bind-interfaces', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/dnsmasq', platform: 'x86_64'

I'm not sure if this is the same issue as bug 2182342 but filing just in case and will attach a core dump. Thanks.

Version-Release number of selected component (if applicable):
dnsmasq-2.85-5.el9.x86_64


--- Additional comment from Marko Myllynen on 2023-04-11 15:39:38 CEST ---

I should add that, interestingly enough, the setup described above still works, I think after the crash NM restarts it and then it's running:

# ps auxwww | grep dnsmasq
dnsmasq     1666  0.0  0.0  10456  4096 ?        S    16:15   0:00 /usr/sbin/dnsmasq --no-resolv --keep-in-foreground --no-hosts --bind-interfaces --pid-file=/run/NetworkManager/dnsmasq.pid --listen-address=127.0.0.1 --cache-size=400 --clear-on-reload --conf-file=/dev/null --proxy-dnssec --enable-dbus=org.freedesktop.NetworkManager.dnsmasq --conf-dir=/etc/NetworkManager/dnsmasq.d

SELinux is Permissive, the only audit log entry is:

# grep -i dns /var/log/audit/audit.log
type=ANOM_ABEND msg=audit(1681218907.965:646): auid=4294967295 uid=985 gid=980 ses=4294967295 subj=system_u:system_r:dnsmasq_t:s0 pid=5835 comm="dnsmasq" exe="/usr/sbin/dnsmasq" sig=11 res=1AUID="unset" UID="dnsmasq" GID="dnsmasq"

The most relevant log entries in the system log are:

Apr 11 16:15:43 $HOST NetworkManager[1536]: <info>  [1681218943.9239] dnsmasq: starting /usr/sbin/dnsmasq
Apr 11 16:15:43 $HOST dnsmasq[1666]: started, version 2.85 cachesize 400
Apr 11 16:15:43 $HOST dnsmasq[1666]: compile time options: IPv6 GNU-getopt DBus no-UBus no-i18n IDN2 DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth cryptohash DNSSEC loop-detect inotify dumpfile
Apr 11 16:15:43 $HOST dnsmasq[1666]: chown of PID file /run/NetworkManager/dnsmasq.pid failed: Operation not permitted
Apr 11 16:15:43 $HOST dnsmasq[1666]: DBus support enabled: connected to system bus
Apr 11 16:15:43 $HOST dnsmasq[1666]: using only locally-known addresses for domain example.com
Apr 11 16:15:43 $HOST dnsmasq[1666]: read /etc/hosts - 27 addresses
Apr 11 16:15:43 $HOST dnsmasq[1666]: setting upstream servers from DBus
Apr 11 16:15:43 $HOST dnsmasq[1666]: using only locally-known addresses for domain example.com
Apr 11 16:15:43 $HOST dnsmasq[1666]: using nameserver 192.168.1.1#53(via wlan0)
Apr 11 16:15:43 $HOST dnsmasq[1666]: using nameserver 192.168.1.1#53 for domain 1.168.192.in-addr.arpa
Apr 11 16:15:43 $HOST dnsmasq[1666]: read /etc/hosts - 27 addresses

Thanks.

--- Additional comment from Marko Myllynen on 2023-04-11 15:52:05 CEST ---

I filed a bug about the tiny chown failure at https://bugzilla.redhat.com/show_bug.cgi?id=2185880.

--- Additional comment from Marko Myllynen on 2023-04-12 12:17:40 CEST ---

FWIW, I tried to reproduce this on a fresh all-defaults RHEL 9 VM installation+updates but there dnsmasq didn't crash. I can't say what's different on the system where it crashed so hopefully the coredump will contain some clues. Thanks.

--- Additional comment from Petr Menšík on 2023-04-13 00:40:34 CEST ---

Interesting issue, thanks for the core dump.

Core was generated by `/usr/sbin/dnsmasq --no-resolv --keep-in-foreground --no-hosts --bind-interfaces'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  dbus_watch_get_enabled (watch=0x555f992406ac) at ../../dbus/dbus-watch.c:707
707	  return watch->enabled;

(gdb) bt
#0  dbus_watch_get_enabled (watch=0x555f992406ac) at ../../dbus/dbus-watch.c:707
#1  0x0000555aca9c8c45 in check_dbus_listeners () at /usr/src/debug/dnsmasq-2.85-5.el9.x86_64/src/dbus.c:804
#2  main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/dnsmasq-2.85-5.el9.x86_64/src/dnsmasq.c:1202
(gdb) p watch
$1 = (DBusWatch *) 0x555f992406ac
(gdb) p *watch
Cannot access memory at address 0x555f992406ac

(gdb) p dnsmasq_daemon->watches
$8 = (struct watch *) 0x0

It is on unusual place. It seems the pointer got invalidated, but during the for cycle invalid data were used after that. Most likely remove_watch() were called in reaction to some dbus event, which made the pointer invalid.

Even upstream looks the same. I guess similar issue is possible also on Fedora. Are you able to trigger this in a reliable way? How often does it happen?

--- Additional comment from Marko Myllynen on 2023-04-13 07:10:54 CEST ---

Thanks for looking into this.

> Are you able to trigger this in a reliable way? How often does it happen?

This happens always when NM is started on two RHEL 9 systems I have.

Thanks.

--- Additional comment from Petr Menšík on 2023-04-13 13:09:13 CEST ---

Found out simple reproducer, restarting NetworkManager does not do anything to me. But restarting dbus.service crashes dnsmasq on very similar backtrace.

Breakpoint 1, remove_watch (watch=0x555c8ecc4a60, data=0x0) at /usr/src/debug/dnsmasq-2.85-6.el9.x86_64/src/dbus.c:122
122	{
(gdb) bt
#0  remove_watch (watch=0x555c8ecc4a60, data=0x0) at /usr/src/debug/dnsmasq-2.85-6.el9.x86_64/src/dbus.c:122
#1  0x00007f4261df2247 in _dbus_watch_list_remove_watch () from target:/lib64/libdbus-1.so.3
#2  0x00007f4261df246d in free_watches () from target:/lib64/libdbus-1.so.3
#3  0x00007f4261df24ed in socket_disconnect.lto_priv () from target:/lib64/libdbus-1.so.3
#4  0x00007f4261dee45b in do_io_error () from target:/lib64/libdbus-1.so.3
#5  0x00007f4261df85f8 in do_reading.part () from target:/lib64/libdbus-1.so.3
#6  0x00007f4261df89ff in socket_handle_watch.lto_priv () from target:/lib64/libdbus-1.so.3
#7  0x00007f4261dddf90 in _dbus_connection_handle_watch () from target:/lib64/libdbus-1.so.3
#8  0x0000555c8df5dc9e in check_dbus_listeners () at /usr/src/debug/dnsmasq-2.85-6.el9.x86_64/src/dbus.c:819
#9  main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/dnsmasq-2.85-6.el9.x86_64/src/dnsmasq.c:1202

It is obvious dbus_watch_handle happens to cause remove_watch call. Because there are two watches of dbus, the removal of the first one causes following pointer invalidated. But the loop does not catch any changes and tries to continue. It makes the next loop iteration to access invalid pointer.

Comment 1 Petr Menšík 2023-04-13 12:10:28 UTC
Interesting difference between RHEL9 and Rawhide is systemctl restart dbus no longer triggers call to remove_watch in dnsmasq. So the same way to reproduce it does not work. Code on dnsmasq side is the same as on RHEL9, so I think it could still be crashed, but not so easy.

Tested on:
dbus-libs-1.14.6-1.fc38.x86_64
dnsmasq-2.89-3.fc39.x86_64

Comment 2 Petr Menšík 2023-04-26 15:51:45 UTC
Pushed test into fedora:
https://src.fedoraproject.org/tests/dnsmasq/c/6169c32d0635b96582c2ba833f8e13a7d9b897ca

Comment 3 Petr Menšík 2023-04-27 16:01:18 UTC
Hmm, interesting thing were discovered in new written test. The change prevents crash of dnsmasq. But anyway it stops listening on dbus and after restart it does not continue to react on dbus messages. Dbus becomes disconnected and it does not fix until restarted.

Comment 4 Petr Menšík 2023-04-28 11:10:09 UTC
Hello David,

I have found you as a dbus maintainer. Do you know a good way to handle dbus socket disconnection? Is it even required to handle that? It seems to be restarting dbus-broker.service is prevented by its package. I saw in [1] that dbus restart is indeed not common. Is there any way dbus might disconnect sockets from end application, different than stopping and restarting dbus-broker.service?

Is there any best practice to handle such situation?

[1] https://wiki.ubuntu.com/DbusRestart

Comment 5 Petr Menšík 2023-04-28 11:13:34 UTC
Test prepared as regression test in Fedora:
https://src.fedoraproject.org/tests/dnsmasq//blob/main/f/Regression/dbus-watches

Comment 6 Fedora Update System 2023-06-07 15:23:39 UTC
FEDORA-2023-5efaca2641 has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2023-5efaca2641

Comment 7 Fedora Update System 2023-06-08 01:44:50 UTC
FEDORA-2023-5efaca2641 has been pushed to the Fedora 38 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2023-5efaca2641`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2023-5efaca2641

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 8 Fedora Update System 2023-06-09 02:00:23 UTC
FEDORA-2023-5efaca2641 has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 9 Red Hat Bugzilla 2023-10-08 04:25:07 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.