Bug 2185878
| Summary: | dnsmasq crash when used for DNS caching with NetworkManager | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Marko Myllynen <myllynen> | ||||
| Component: | dnsmasq | Assignee: | Petr Menšík <pemensik> | ||||
| Status: | CLOSED MIGRATED | QA Contact: | Petr Sklenar <psklenar> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 9.1 | CC: | psklenar, walters | ||||
| Target Milestone: | rc | Keywords: | MigratedToJIRA, Triaged | ||||
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 2186468 (view as bug list) | Environment: | |||||
| Last Closed: | 2023-09-21 19:06:13 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 2186468 | ||||||
| Attachments: |
|
||||||
|
Description
Marko Myllynen
2023-04-11 13:31:15 UTC
I should add that, interestingly enough, the setup described above still works, I think after the crash NM restarts it and then it's running: # ps auxwww | grep dnsmasq dnsmasq 1666 0.0 0.0 10456 4096 ? S 16:15 0:00 /usr/sbin/dnsmasq --no-resolv --keep-in-foreground --no-hosts --bind-interfaces --pid-file=/run/NetworkManager/dnsmasq.pid --listen-address=127.0.0.1 --cache-size=400 --clear-on-reload --conf-file=/dev/null --proxy-dnssec --enable-dbus=org.freedesktop.NetworkManager.dnsmasq --conf-dir=/etc/NetworkManager/dnsmasq.d SELinux is Permissive, the only audit log entry is: # grep -i dns /var/log/audit/audit.log type=ANOM_ABEND msg=audit(1681218907.965:646): auid=4294967295 uid=985 gid=980 ses=4294967295 subj=system_u:system_r:dnsmasq_t:s0 pid=5835 comm="dnsmasq" exe="/usr/sbin/dnsmasq" sig=11 res=1AUID="unset" UID="dnsmasq" GID="dnsmasq" The most relevant log entries in the system log are: Apr 11 16:15:43 $HOST NetworkManager[1536]: <info> [1681218943.9239] dnsmasq: starting /usr/sbin/dnsmasq Apr 11 16:15:43 $HOST dnsmasq[1666]: started, version 2.85 cachesize 400 Apr 11 16:15:43 $HOST dnsmasq[1666]: compile time options: IPv6 GNU-getopt DBus no-UBus no-i18n IDN2 DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth cryptohash DNSSEC loop-detect inotify dumpfile Apr 11 16:15:43 $HOST dnsmasq[1666]: chown of PID file /run/NetworkManager/dnsmasq.pid failed: Operation not permitted Apr 11 16:15:43 $HOST dnsmasq[1666]: DBus support enabled: connected to system bus Apr 11 16:15:43 $HOST dnsmasq[1666]: using only locally-known addresses for domain example.com Apr 11 16:15:43 $HOST dnsmasq[1666]: read /etc/hosts - 27 addresses Apr 11 16:15:43 $HOST dnsmasq[1666]: setting upstream servers from DBus Apr 11 16:15:43 $HOST dnsmasq[1666]: using only locally-known addresses for domain example.com Apr 11 16:15:43 $HOST dnsmasq[1666]: using nameserver 192.168.1.1#53(via wlan0) Apr 11 16:15:43 $HOST dnsmasq[1666]: using nameserver 192.168.1.1#53 for domain 1.168.192.in-addr.arpa Apr 11 16:15:43 $HOST dnsmasq[1666]: read /etc/hosts - 27 addresses Thanks. I filed a bug about the tiny chown failure at https://bugzilla.redhat.com/show_bug.cgi?id=2185880. FWIW, I tried to reproduce this on a fresh all-defaults RHEL 9 VM installation+updates but there dnsmasq didn't crash. I can't say what's different on the system where it crashed so hopefully the coredump will contain some clues. Thanks. Interesting issue, thanks for the core dump. Core was generated by `/usr/sbin/dnsmasq --no-resolv --keep-in-foreground --no-hosts --bind-interfaces'. Program terminated with signal SIGSEGV, Segmentation fault. #0 dbus_watch_get_enabled (watch=0x555f992406ac) at ../../dbus/dbus-watch.c:707 707 return watch->enabled; (gdb) bt #0 dbus_watch_get_enabled (watch=0x555f992406ac) at ../../dbus/dbus-watch.c:707 #1 0x0000555aca9c8c45 in check_dbus_listeners () at /usr/src/debug/dnsmasq-2.85-5.el9.x86_64/src/dbus.c:804 #2 main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/dnsmasq-2.85-5.el9.x86_64/src/dnsmasq.c:1202 (gdb) p watch $1 = (DBusWatch *) 0x555f992406ac (gdb) p *watch Cannot access memory at address 0x555f992406ac (gdb) p dnsmasq_daemon->watches $8 = (struct watch *) 0x0 It is on unusual place. It seems the pointer got invalidated, but during the for cycle invalid data were used after that. Most likely remove_watch() were called in reaction to some dbus event, which made the pointer invalid. Even upstream looks the same. I guess similar issue is possible also on Fedora. Are you able to trigger this in a reliable way? How often does it happen? Thanks for looking into this.
> Are you able to trigger this in a reliable way? How often does it happen?
This happens always when NM is started on two RHEL 9 systems I have.
Thanks.
Found out simple reproducer, restarting NetworkManager does not do anything to me. But restarting dbus.service crashes dnsmasq on very similar backtrace.
Breakpoint 1, remove_watch (watch=0x555c8ecc4a60, data=0x0) at /usr/src/debug/dnsmasq-2.85-6.el9.x86_64/src/dbus.c:122
122 {
(gdb) bt
#0 remove_watch (watch=0x555c8ecc4a60, data=0x0) at /usr/src/debug/dnsmasq-2.85-6.el9.x86_64/src/dbus.c:122
#1 0x00007f4261df2247 in _dbus_watch_list_remove_watch () from target:/lib64/libdbus-1.so.3
#2 0x00007f4261df246d in free_watches () from target:/lib64/libdbus-1.so.3
#3 0x00007f4261df24ed in socket_disconnect.lto_priv () from target:/lib64/libdbus-1.so.3
#4 0x00007f4261dee45b in do_io_error () from target:/lib64/libdbus-1.so.3
#5 0x00007f4261df85f8 in do_reading.part () from target:/lib64/libdbus-1.so.3
#6 0x00007f4261df89ff in socket_handle_watch.lto_priv () from target:/lib64/libdbus-1.so.3
#7 0x00007f4261dddf90 in _dbus_connection_handle_watch () from target:/lib64/libdbus-1.so.3
#8 0x0000555c8df5dc9e in check_dbus_listeners () at /usr/src/debug/dnsmasq-2.85-6.el9.x86_64/src/dbus.c:819
#9 main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/dnsmasq-2.85-6.el9.x86_64/src/dnsmasq.c:1202
It is obvious dbus_watch_handle happens to cause remove_watch call. Because there are two watches of dbus, the removal of the first one causes following pointer invalidated. But the loop does not catch any changes and tries to continue. It makes the next loop iteration to access invalid pointer.
Created attachment 1957345 [details]
Candidate-dbus.patch
Candidate patch adjusted for latest dnsmasq version.
Candidate patch merged into dnsmasq. Request: https://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2023q2/017007.html Hi! would you know as a dbus maintainer, are there any cases where dbus watches might change during runtime? Is there any case, where watch can be removed but the dbus service will continue to provide dbus service? It seems to be dbus-broker does not have supported restart operation and services using dbus do not have any good behaviour for dbus service disconnection. Is there any good behaviour I should implement in dnsmasq for those cases? I am not sure dbus disconnection needs to be handled. I have found special handling of dbus service. it seems it is never restarted. So there should not be reason for surviving its restart, because it breaks most of services. Including NM to my rough testing. Is there a way to reproduce this issue without restarting dbus service? I haven't found some so far. Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |