Bug 2185878

Summary: dnsmasq crash when used for DNS caching with NetworkManager
Product: Red Hat Enterprise Linux 9 Reporter: Marko Myllynen <myllynen>
Component: dnsmasqAssignee: Petr Menšík <pemensik>
Status: CLOSED MIGRATED QA Contact: Petr Sklenar <psklenar>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 9.1CC: psklenar, walters
Target Milestone: rcKeywords: MigratedToJIRA, Triaged
Target Release: ---Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2186468 (view as bug list) Environment:
Last Closed: 2023-09-21 19:06:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2186468    
Attachments:
Description Flags
Candidate-dbus.patch none

Description Marko Myllynen 2023-04-11 13:31:15 UTC
Description of problem:
I'm using dnsmasq to provide local DNS caching with NetworkManager so NM is responsible for starting and managing dnsmasq. This is an IPv4-only system.

These are the related configuration files in the setup:

# cat /etc/resolv.conf 
nameserver 127.0.0.1
search example.com
options edns0 no-aaaa trust-ad
# cat /etc/NetworkManager/conf.d/50-dns.conf 
[main]
dns=dnsmasq
# cat /etc/NetworkManager/conf.d/99-rc-manager.conf 
[main]
rc-manager=unmanaged
# cat /etc/NetworkManager/dnsmasq.d/zz-local.conf 
addn-hosts=/etc/hosts
domain-needed
bogus-priv
local=/example.com/
local-ttl=10
min-cache-ttl=10
neg-ttl=10

When booting up the system or restarting NM I see new core files appearing:

# file /var/lib/systemd/coredump/core.5835 
/var/lib/systemd/coredump/core.5835: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/dnsmasq --no-resolv --keep-in-foreground --no-hosts --bind-interfaces', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/dnsmasq', platform: 'x86_64'

I'm not sure if this is the same issue as bug 2182342 but filing just in case and will attach a core dump. Thanks.

Version-Release number of selected component (if applicable):
dnsmasq-2.85-5.el9.x86_64

Comment 2 Marko Myllynen 2023-04-11 13:39:38 UTC
I should add that, interestingly enough, the setup described above still works, I think after the crash NM restarts it and then it's running:

# ps auxwww | grep dnsmasq
dnsmasq     1666  0.0  0.0  10456  4096 ?        S    16:15   0:00 /usr/sbin/dnsmasq --no-resolv --keep-in-foreground --no-hosts --bind-interfaces --pid-file=/run/NetworkManager/dnsmasq.pid --listen-address=127.0.0.1 --cache-size=400 --clear-on-reload --conf-file=/dev/null --proxy-dnssec --enable-dbus=org.freedesktop.NetworkManager.dnsmasq --conf-dir=/etc/NetworkManager/dnsmasq.d

SELinux is Permissive, the only audit log entry is:

# grep -i dns /var/log/audit/audit.log
type=ANOM_ABEND msg=audit(1681218907.965:646): auid=4294967295 uid=985 gid=980 ses=4294967295 subj=system_u:system_r:dnsmasq_t:s0 pid=5835 comm="dnsmasq" exe="/usr/sbin/dnsmasq" sig=11 res=1AUID="unset" UID="dnsmasq" GID="dnsmasq"

The most relevant log entries in the system log are:

Apr 11 16:15:43 $HOST NetworkManager[1536]: <info>  [1681218943.9239] dnsmasq: starting /usr/sbin/dnsmasq
Apr 11 16:15:43 $HOST dnsmasq[1666]: started, version 2.85 cachesize 400
Apr 11 16:15:43 $HOST dnsmasq[1666]: compile time options: IPv6 GNU-getopt DBus no-UBus no-i18n IDN2 DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth cryptohash DNSSEC loop-detect inotify dumpfile
Apr 11 16:15:43 $HOST dnsmasq[1666]: chown of PID file /run/NetworkManager/dnsmasq.pid failed: Operation not permitted
Apr 11 16:15:43 $HOST dnsmasq[1666]: DBus support enabled: connected to system bus
Apr 11 16:15:43 $HOST dnsmasq[1666]: using only locally-known addresses for domain example.com
Apr 11 16:15:43 $HOST dnsmasq[1666]: read /etc/hosts - 27 addresses
Apr 11 16:15:43 $HOST dnsmasq[1666]: setting upstream servers from DBus
Apr 11 16:15:43 $HOST dnsmasq[1666]: using only locally-known addresses for domain example.com
Apr 11 16:15:43 $HOST dnsmasq[1666]: using nameserver 192.168.1.1#53(via wlan0)
Apr 11 16:15:43 $HOST dnsmasq[1666]: using nameserver 192.168.1.1#53 for domain 1.168.192.in-addr.arpa
Apr 11 16:15:43 $HOST dnsmasq[1666]: read /etc/hosts - 27 addresses

Thanks.

Comment 3 Marko Myllynen 2023-04-11 13:52:05 UTC
I filed a bug about the tiny chown failure at https://bugzilla.redhat.com/show_bug.cgi?id=2185880.

Comment 4 Marko Myllynen 2023-04-12 10:17:40 UTC
FWIW, I tried to reproduce this on a fresh all-defaults RHEL 9 VM installation+updates but there dnsmasq didn't crash. I can't say what's different on the system where it crashed so hopefully the coredump will contain some clues. Thanks.

Comment 5 Petr Menšík 2023-04-12 22:40:34 UTC
Interesting issue, thanks for the core dump.

Core was generated by `/usr/sbin/dnsmasq --no-resolv --keep-in-foreground --no-hosts --bind-interfaces'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  dbus_watch_get_enabled (watch=0x555f992406ac) at ../../dbus/dbus-watch.c:707
707	  return watch->enabled;

(gdb) bt
#0  dbus_watch_get_enabled (watch=0x555f992406ac) at ../../dbus/dbus-watch.c:707
#1  0x0000555aca9c8c45 in check_dbus_listeners () at /usr/src/debug/dnsmasq-2.85-5.el9.x86_64/src/dbus.c:804
#2  main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/dnsmasq-2.85-5.el9.x86_64/src/dnsmasq.c:1202
(gdb) p watch
$1 = (DBusWatch *) 0x555f992406ac
(gdb) p *watch
Cannot access memory at address 0x555f992406ac

(gdb) p dnsmasq_daemon->watches
$8 = (struct watch *) 0x0

It is on unusual place. It seems the pointer got invalidated, but during the for cycle invalid data were used after that. Most likely remove_watch() were called in reaction to some dbus event, which made the pointer invalid.

Even upstream looks the same. I guess similar issue is possible also on Fedora. Are you able to trigger this in a reliable way? How often does it happen?

Comment 6 Marko Myllynen 2023-04-13 05:10:54 UTC
Thanks for looking into this.

> Are you able to trigger this in a reliable way? How often does it happen?

This happens always when NM is started on two RHEL 9 systems I have.

Thanks.

Comment 8 Petr Menšík 2023-04-13 11:09:13 UTC
Found out simple reproducer, restarting NetworkManager does not do anything to me. But restarting dbus.service crashes dnsmasq on very similar backtrace.

Breakpoint 1, remove_watch (watch=0x555c8ecc4a60, data=0x0) at /usr/src/debug/dnsmasq-2.85-6.el9.x86_64/src/dbus.c:122
122	{
(gdb) bt
#0  remove_watch (watch=0x555c8ecc4a60, data=0x0) at /usr/src/debug/dnsmasq-2.85-6.el9.x86_64/src/dbus.c:122
#1  0x00007f4261df2247 in _dbus_watch_list_remove_watch () from target:/lib64/libdbus-1.so.3
#2  0x00007f4261df246d in free_watches () from target:/lib64/libdbus-1.so.3
#3  0x00007f4261df24ed in socket_disconnect.lto_priv () from target:/lib64/libdbus-1.so.3
#4  0x00007f4261dee45b in do_io_error () from target:/lib64/libdbus-1.so.3
#5  0x00007f4261df85f8 in do_reading.part () from target:/lib64/libdbus-1.so.3
#6  0x00007f4261df89ff in socket_handle_watch.lto_priv () from target:/lib64/libdbus-1.so.3
#7  0x00007f4261dddf90 in _dbus_connection_handle_watch () from target:/lib64/libdbus-1.so.3
#8  0x0000555c8df5dc9e in check_dbus_listeners () at /usr/src/debug/dnsmasq-2.85-6.el9.x86_64/src/dbus.c:819
#9  main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/dnsmasq-2.85-6.el9.x86_64/src/dnsmasq.c:1202

It is obvious dbus_watch_handle happens to cause remove_watch call. Because there are two watches of dbus, the removal of the first one causes following pointer invalidated. But the loop does not catch any changes and tries to continue. It makes the next loop iteration to access invalid pointer.

Comment 9 Petr Menšík 2023-04-13 11:55:13 UTC
Created attachment 1957345 [details]
Candidate-dbus.patch

Candidate patch adjusted for latest dnsmasq version.

Comment 10 Petr Menšík 2023-04-19 12:25:24 UTC
Candidate patch merged into dnsmasq. Request:
https://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2023q2/017007.html

Comment 11 Petr Menšík 2023-05-22 14:26:28 UTC
Hi! would you know as a dbus maintainer, are there any cases where dbus watches might change during runtime? Is there any case, where watch can be removed but the dbus service will continue to provide dbus service? It seems to be dbus-broker does not have supported restart operation and services using dbus do not have any good behaviour for dbus service disconnection.

Is there any good behaviour I should implement in dnsmasq for those cases?

Comment 21 Petr Menšík 2023-09-07 15:24:04 UTC
I am not sure dbus disconnection needs to be handled.

I have found special handling of dbus service. it seems it is never restarted.
So there should not be reason for surviving its restart, because it breaks most of services. Including NM to my rough testing.

Is there a way to reproduce this issue without restarting dbus service? I haven't found some so far.

Comment 22 RHEL Program Management 2023-09-21 18:52:35 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 23 RHEL Program Management 2023-09-21 19:06:13 UTC
This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.

Comment 24 Red Hat Bugzilla 2024-01-20 04:25:44 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days