Bug 2185878
| Summary: | dnsmasq crash when used for DNS caching with NetworkManager | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Marko Myllynen <myllynen> | ||||
| Component: | dnsmasq | Assignee: | Petr Menšík <pemensik> | ||||
| Status: | ASSIGNED --- | QA Contact: | Petr Sklenar <psklenar> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 9.1 | CC: | psklenar, walters | ||||
| Target Milestone: | rc | Keywords: | Triaged | ||||
| Target Release: | --- | Flags: | pemensik:
needinfo?
(walters) |
||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 2186468 (view as bug list) | Environment: | |||||
| Last Closed: | Type: | Bug | |||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 2186468 | ||||||
| Attachments: |
|
||||||
|
Description
Marko Myllynen
2023-04-11 13:31:15 UTC
I should add that, interestingly enough, the setup described above still works, I think after the crash NM restarts it and then it's running: # ps auxwww | grep dnsmasq dnsmasq 1666 0.0 0.0 10456 4096 ? S 16:15 0:00 /usr/sbin/dnsmasq --no-resolv --keep-in-foreground --no-hosts --bind-interfaces --pid-file=/run/NetworkManager/dnsmasq.pid --listen-address=127.0.0.1 --cache-size=400 --clear-on-reload --conf-file=/dev/null --proxy-dnssec --enable-dbus=org.freedesktop.NetworkManager.dnsmasq --conf-dir=/etc/NetworkManager/dnsmasq.d SELinux is Permissive, the only audit log entry is: # grep -i dns /var/log/audit/audit.log type=ANOM_ABEND msg=audit(1681218907.965:646): auid=4294967295 uid=985 gid=980 ses=4294967295 subj=system_u:system_r:dnsmasq_t:s0 pid=5835 comm="dnsmasq" exe="/usr/sbin/dnsmasq" sig=11 res=1AUID="unset" UID="dnsmasq" GID="dnsmasq" The most relevant log entries in the system log are: Apr 11 16:15:43 $HOST NetworkManager[1536]: <info> [1681218943.9239] dnsmasq: starting /usr/sbin/dnsmasq Apr 11 16:15:43 $HOST dnsmasq[1666]: started, version 2.85 cachesize 400 Apr 11 16:15:43 $HOST dnsmasq[1666]: compile time options: IPv6 GNU-getopt DBus no-UBus no-i18n IDN2 DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth cryptohash DNSSEC loop-detect inotify dumpfile Apr 11 16:15:43 $HOST dnsmasq[1666]: chown of PID file /run/NetworkManager/dnsmasq.pid failed: Operation not permitted Apr 11 16:15:43 $HOST dnsmasq[1666]: DBus support enabled: connected to system bus Apr 11 16:15:43 $HOST dnsmasq[1666]: using only locally-known addresses for domain example.com Apr 11 16:15:43 $HOST dnsmasq[1666]: read /etc/hosts - 27 addresses Apr 11 16:15:43 $HOST dnsmasq[1666]: setting upstream servers from DBus Apr 11 16:15:43 $HOST dnsmasq[1666]: using only locally-known addresses for domain example.com Apr 11 16:15:43 $HOST dnsmasq[1666]: using nameserver 192.168.1.1#53(via wlan0) Apr 11 16:15:43 $HOST dnsmasq[1666]: using nameserver 192.168.1.1#53 for domain 1.168.192.in-addr.arpa Apr 11 16:15:43 $HOST dnsmasq[1666]: read /etc/hosts - 27 addresses Thanks. I filed a bug about the tiny chown failure at https://bugzilla.redhat.com/show_bug.cgi?id=2185880. FWIW, I tried to reproduce this on a fresh all-defaults RHEL 9 VM installation+updates but there dnsmasq didn't crash. I can't say what's different on the system where it crashed so hopefully the coredump will contain some clues. Thanks. Interesting issue, thanks for the core dump. Core was generated by `/usr/sbin/dnsmasq --no-resolv --keep-in-foreground --no-hosts --bind-interfaces'. Program terminated with signal SIGSEGV, Segmentation fault. #0 dbus_watch_get_enabled (watch=0x555f992406ac) at ../../dbus/dbus-watch.c:707 707 return watch->enabled; (gdb) bt #0 dbus_watch_get_enabled (watch=0x555f992406ac) at ../../dbus/dbus-watch.c:707 #1 0x0000555aca9c8c45 in check_dbus_listeners () at /usr/src/debug/dnsmasq-2.85-5.el9.x86_64/src/dbus.c:804 #2 main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/dnsmasq-2.85-5.el9.x86_64/src/dnsmasq.c:1202 (gdb) p watch $1 = (DBusWatch *) 0x555f992406ac (gdb) p *watch Cannot access memory at address 0x555f992406ac (gdb) p dnsmasq_daemon->watches $8 = (struct watch *) 0x0 It is on unusual place. It seems the pointer got invalidated, but during the for cycle invalid data were used after that. Most likely remove_watch() were called in reaction to some dbus event, which made the pointer invalid. Even upstream looks the same. I guess similar issue is possible also on Fedora. Are you able to trigger this in a reliable way? How often does it happen? Thanks for looking into this.
> Are you able to trigger this in a reliable way? How often does it happen?
This happens always when NM is started on two RHEL 9 systems I have.
Thanks.
Found out simple reproducer, restarting NetworkManager does not do anything to me. But restarting dbus.service crashes dnsmasq on very similar backtrace.
Breakpoint 1, remove_watch (watch=0x555c8ecc4a60, data=0x0) at /usr/src/debug/dnsmasq-2.85-6.el9.x86_64/src/dbus.c:122
122 {
(gdb) bt
#0 remove_watch (watch=0x555c8ecc4a60, data=0x0) at /usr/src/debug/dnsmasq-2.85-6.el9.x86_64/src/dbus.c:122
#1 0x00007f4261df2247 in _dbus_watch_list_remove_watch () from target:/lib64/libdbus-1.so.3
#2 0x00007f4261df246d in free_watches () from target:/lib64/libdbus-1.so.3
#3 0x00007f4261df24ed in socket_disconnect.lto_priv () from target:/lib64/libdbus-1.so.3
#4 0x00007f4261dee45b in do_io_error () from target:/lib64/libdbus-1.so.3
#5 0x00007f4261df85f8 in do_reading.part () from target:/lib64/libdbus-1.so.3
#6 0x00007f4261df89ff in socket_handle_watch.lto_priv () from target:/lib64/libdbus-1.so.3
#7 0x00007f4261dddf90 in _dbus_connection_handle_watch () from target:/lib64/libdbus-1.so.3
#8 0x0000555c8df5dc9e in check_dbus_listeners () at /usr/src/debug/dnsmasq-2.85-6.el9.x86_64/src/dbus.c:819
#9 main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/dnsmasq-2.85-6.el9.x86_64/src/dnsmasq.c:1202
It is obvious dbus_watch_handle happens to cause remove_watch call. Because there are two watches of dbus, the removal of the first one causes following pointer invalidated. But the loop does not catch any changes and tries to continue. It makes the next loop iteration to access invalid pointer.
Created attachment 1957345 [details]
Candidate-dbus.patch
Candidate patch adjusted for latest dnsmasq version.
Candidate patch merged into dnsmasq. Request: https://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2023q2/017007.html Hi! would you know as a dbus maintainer, are there any cases where dbus watches might change during runtime? Is there any case, where watch can be removed but the dbus service will continue to provide dbus service? It seems to be dbus-broker does not have supported restart operation and services using dbus do not have any good behaviour for dbus service disconnection. Is there any good behaviour I should implement in dnsmasq for those cases? |