Bug 1451236
Summary: | [abrt] [faf] NetworkManager: g_logv(): /usr/sbin/NetworkManager killed by 5 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Beniamino Galvani <bgalvani> |
Component: | NetworkManager | Assignee: | Beniamino Galvani <bgalvani> |
Status: | CLOSED ERRATA | QA Contact: | Desktop QE <desktop-qa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 7.3 | CC: | aloughla, atragler, bgalvani, fgiudici, lrintel, msekleta, msugaya, rkhan, sukulkar, thaller, vbenes |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
URL: | http://faf.lab.eng.brq.redhat.com/faf/reports/bthash/babb88d25ffb46540bc0d5efa5454d6c8d901bc2/ | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-04-10 13:22:08 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Beniamino Galvani
2017-05-16 08:13:41 UTC
I think it's of similar nature as bug 1449833. Thread 1 (Thread 0x7f53fe79b700 (LWP 1106)): #0 0x00007f54088de643 in g_logv (log_domain=0x7f5408f1b358 "GLib-GIO", log_level=G_LOG_LEVEL_WARNING, format=<optimized out>, args=args@entry=0x7f53fe799f70) at gmessages.c:1086 domain = 0x0 data = 0x0 depth = 0 log_func = 0x7f54088ddde0 <g_log_default_handler> domain_fatal_mask = <optimized out> masquerade_fatal = <optimized out> test_level = <optimized out> was_fatal = <optimized out> was_recursion = <optimized out> msg = 0x7f53f4007ea0 "Error setting close-on-exec flag on incoming fd: Bad file descriptor" msg_alloc = 0x7f53f4007ea0 "Error setting close-on-exec flag on incoming fd: Bad file descriptor" i = 4 #1 0x00007f54088de7bf in g_log (log_domain=log_domain@entry=0x7f5408f1b358 "GLib-GIO", log_level=log_level@entry=G_LOG_LEVEL_WARNING, format=format@entry=0x7f5408f33880 "Error setting close-on-exec flag on incoming fd: %s") at gmessages.c:1119 args = {{gp_offset = 32, fp_offset = 48, overflow_arg_area = 0x7f53fe79a050, reg_save_area = 0x7f53fe799f90}} #2 0x00007f5408eb3ce2 in g_unix_fd_message_deserialize (level=<optimized out>, type=<optimized out>, size=<optimized out>, data=0x7f53fe79a100) at gunixfdmessage.c:119 message = <optimized out> list = <optimized out> n = 1 s = <optimized out> i = <optimized out> fds = 0x7f53fe79a100 #3 0x00007f5408e942ce in g_socket_control_message_deserialize (level=1, type=1, size=4, data=0x7f53fe79a100) at gsocketcontrolmessage.c:196 class = 0x7f53f4005fd0 message = 0x0 n_message_types = 2 i = 1 #4 0x00007f5408e8e45a in g_socket_receive_message (socket=socket@entry=0x7f53f0007130 [GSocket], address=address@entry=0x0, vectors=<optimized out>, vectors@entry=0x7f53fe79aab0, num_vectors=<optimized out>, num_vectors@entry=1, messages=0x7f53f000be90, num_messages=0x7f53f000be98, flags=flags@entry=0x0, cancellable=0x7f53f0001f70 [GCancellable], error=error@entry=0x7f53fe79aaa8) at gsocket.c:4588 message = <optimized out> my_messages = 0x0 cmsg = 0x7f53fe79a0f0 msg = {msg_name = 0x0, msg_namelen = 0, msg_iov = 0x7f53fe79aab0, msg_iovlen = 1, msg_control = 0x7f53fe79a0f0, msg_controllen = 24, msg_flags = 1073741824} result = <optimized out> one_sockaddr = {ss_family = 35344, __ss_align = 139998552661504, __ss_padding = "\000\000\000\000\000\000\000\000\264\300\275\bT\177\000\000pe\000\360S\177\000\000\230Q\215\bT\177", '\000' <repeats 11 times>, "\301\372rk\236E\365\002\000\000\000\000\000\000\000\332j\214\bT\177\000\000\020\212\357\bT\177\000\000\000\276\000\360S\177\000\000\000\f\000\364S\177\000\000\360\353\000\360S\177\000\000P4\374\rT\177\000\000`\305\000\360S\177\000"} one_vector = {buffer = 0x1, size = 0} one_byte = 0 '\000' __FUNCTION__ = "g_socket_receive_message" #5 0x00007f5408ef6b9e in _g_socket_read_with_control_messages_ready (socket=socket@entry=0x7f53f0007130 [GSocket], condition=condition@entry=G_IO_IN, user_data=user_data@entry=0x7f53f4003200) at gdbusprivate.c:128 task = 0x7f53f4003200 [GTask] data = 0x7f540de520e0 error = 0x0 result = <optimized out> vector = {buffer = 0x7f53f4001e00, size = 16} __FUNCTION__ = "_g_socket_read_with_control_messages_ready" #6 0x00007f5408ef71bf in _g_dbus_worker_do_read_unlocked (callback=0x7f5408ef8a10 <_g_dbus_worker_do_read_cb>, io_priority=0, user_data=<optimized out>, cancellable=0x7f53f0001f70 [GCancellable], num_messages=<optimized out>, messages=<optimized out>, count=16, buffer=0x7f53f4001e00, socket=0x7f53f0007130 [GSocket]) at gdbusprivate.c:180 task = 0x7f53f4003200 [GTask] data = 0x7f540de520e0 source = <optimized out> #7 0x00007f5408ef71bf in _g_dbus_worker_do_read_unlocked (worker=<optimized out>) at gdbusprivate.c:828 #8 0x00007f5408ef8ec2 in _g_dbus_worker_do_read_cb (input_stream=<optimized out>, res=<optimized out>, user_data=0x7f53f000be00) at gdbusprivate.c:774 message = 0x7f540dfc3450 worker = 0x7f53f000be00 error = 0x0 bytes_read = 206 #9 0x00007f5408e9af93 in g_task_return_now (task=0x7f53f40032d0 [GTask]) at gtask.c:1106 #10 0x00007f5408e9afc9 in complete_in_idle_cb (task=0x7f53f40032d0) at gtask.c:1120 #11 0x00007f54088d7d7a in g_main_context_dispatch (context=0x7f53f000c560) at gmain.c:3152 dispatch = 0x7f54088d4a00 <g_idle_dispatch> prev_source = 0x0 was_in_call = 0 user_data = 0x7f53f40032d0 callback = 0x7f5408e9afc0 <complete_in_idle_cb> cb_funcs = 0x7f5408bc38c0 <g_source_callback_funcs> cb_data = 0x7f53f4002ee0 need_destroy = <optimized out> source = 0x7f53f4003eb0 current = 0x7f53f000c380 i = 0 #12 0x00007f54088d7d7a in g_main_context_dispatch (context=context@entry=0x7f53f000c560) at gmain.c:3767 #13 0x00007f54088d80b8 in g_main_context_iterate (context=0x7f53f000c560, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3838 max_priority = 0 timeout = 0 some_ready = 1 nfds = <optimized out> allocated_nfds = 3 fds = 0x7f53f40010c0 #14 0x00007f54088d838a in g_main_loop_run (loop=0x7f53f000e820) at gmain.c:4032 __FUNCTION__ = "g_main_loop_run" #15 0x00007f5408ef6c66 in gdbus_shared_thread_func (user_data=0x7f53f000c530) at gdbusprivate.c:246 data = 0x7f53f000c530 #16 0x00007f54088fd0f5 in g_thread_proxy (data=0x7f540dde2ed0) at gthread.c:778 thread = 0x7f540dde2ed0 #17 0x00007f540a1e3dc5 in start_thread (arg=0x7f53fe79b700) at pthread_create.c:308 __res = <optimized out> pd = 0x7f53fe79b700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139998728402688, 5735798018442000178, 0, 139998728403392, 139998728402688, 139998485595440, -5675000504269146318, -5678562126826440910}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> *** Bug 1451306 has been marked as a duplicate of this bug. *** From analysis of http://faf.lab.eng.brq.redhat.com/faf/reports/2955/ : (gdb) frame 3 #3 0x00007f9376833f32 in g_unix_fd_message_deserialize (level=<optimized out>, type=<optimized out>, size=<optimized out>, data=0x7f9368f3b1f0) at gunixfdmessage.c:119 (gdb) p fds[0] $x = 17 (gdb) frame 6 #6 0x00007f93767bef36 in g_socket_receive_message_with_timeout (socket=socket@entry=0x7f935c008130, address=address@entry=0x0, vectors=<optimized out>, vectors@entry=0x7f9368f3bb60, num_vectors=<optimized out>, num_vectors@entry=1, messages=0x7f935c00e6b0, num_messages=0x7f935c00e6b8, flags=flags@entry=0x0, timeout=0, cancellable=cancellable@entry=0x7f935c0023a0, error=error@entry=0x7f9368f3bb58) at gsocket.c:4845 (gdb) p *socket->priv $x = {family = G_SOCKET_FAMILY_UNIX, type = G_SOCKET_TYPE_STREAM, protocol = G_SOCKET_PROTOCOL_DEFAULT, fd = 8, listen_backlog = 10, timeout = 0, construct_error = 0x0, remote_address = 0x557854756ef0, inited = 1, blocking = 0, keepalive = 0, closed = 0, connected_read = 1, connected_write = 1, listening = 0, timed_out = 0, connect_pending = 0, recv_addr_cache = {{addr = 0x0, native = 0x0, native_len = 0, last_used = 0}, {addr = 0x0, native = 0x0, native_len = 0, last_used = 0}, {addr = 0x0, native = 0x0, native_len = 0, last_used = 0}, {addr = 0x0, native = 0x0, native_len = 0, last_used = 0}, {addr = 0x0, native = 0x0, native_len = 0, last_used = 0}, {addr = 0x0, native = 0x0, native_len = 0, last_used = 0}, {addr = 0x0, native = 0x0, native_len = 0, last_used = 0}, { addr = 0x0, native = 0x0, native_len = 0, last_used = 0}}} (gdb) p *((GUnixSocketAddress *)socket->priv->remote_address)->priv $x = {path = "/var/run/dbus/system_bus_socket", '\000' <repeats 76 times>, path_len = 31, address_type = G_UNIX_SOCKET_ADDRESS_PATH} $ grep ^17: open_fds 17:/proc/sys/net/ipv6/conf/eth6/forwarding (gdb) thread apply 4 bt Thread 4 (Thread 0x7f93782d1940 (LWP 19578)): #0 send () from /lib64/libc.so.6 #1 __vsyslog_chk () from /lib64/libc.so.6 #2 __syslog_chk () from /lib64/libc.so.6 #3 syslog (__fmt=0x557852a24914 "%s%-7s [%ld.%04ld] %s", __pri=<optimized out>) at /usr/include/bits/syslog.h:31 #4 _nm_log_impl (file=file@entry=0x557852a238f8 "src/platform/nm-linux-platform.c", line=line@entry=2643, func=func@entry=0x0, level=level@entry=LOGL_DEBUG, domain=domain@entry=LOGD_PLATFORM, error=error@entry=0, ifname=ifname@entry=0x0, conn_uuid=conn_uuid@entry=0x0, fmt=fmt@entry=0x557852a23c40 "%s: sysctl: setting '%s' to '%s' (current value is identical)") at src/nm-logging.c:725 #5 _log_dbg_sysctl_set_impl (value=0x557852a46f85 "0", path=0x557852cb15c0 <path.32401> "/proc/sys/net/ipv6/conf/eth6/forwarding", dirfd=<optimized out>, pathid=0x557852cb15c0 <path.32401> "/proc/sys/net/ipv6/conf/eth6/forwarding", platform=0x557854733850) at src/platform/nm-linux-platform.c:2643 #6 sysctl_set (platform=0x557854733850, pathid=0x557852cb15c0 <path.32401> "/proc/sys/net/ipv6/conf/eth6/forwarding", dirfd=<optimized out>, path=0x557852cb15c0 <path.32401> "/proc/sys/net/ipv6/conf/eth6/forwarding", value=0x557852a46f85 "0") at src/platform/nm-linux-platform.c:2712 #7 _set_state_full (self=self@entry=0x55785484ecf0, state=state@entry=NM_DEVICE_STATE_DISCONNECTED, reason=<optimized out>, quitting=quitting@entry=0) at src/devices/nm-device.c:12572 #8 nm_device_state_changed (self=self@entry=0x55785484ecf0, state=state@entry=NM_DEVICE_STATE_DISCONNECTED, reason=<optimized out>) at src/devices/nm-device.c:12772 #9 queued_state_set (user_data=user_data@entry=0x55785484ecf0) at src/devices/nm-device.c:12798 The process receives fd 17 as ancillary data in a message from the D-Bus system socket, but when it tries to set FD_CLOEXEC on it, the file descriptor is invalid. At the same time another thread opens "/proc/sys/net/ipv6/conf/eth6/forwarding" and obtains the same file descriptor (17). It seems to me that there is something wrong in the lower layers if we receive an invalid fd over D-Bus. I think this is related to the systemd sleep monitor, which is the only place in NM where we call a D-Bus method returning a Unix fd. Indeed, few lines above the crash: <debug> [1495478072.2589] sleep-monitor-sd: inhibit: taking sleep inhibitor... I'm reassigning this to systemd, not because I believe the bug is in systemd, but only to have someone more familiar with the inhibitor API look at the bug and try to identify which component (dbus-daemon, kernel, glib, ...) is causing the problem. What happens is that when NM calls the D-Bus "Inhibit" method of logind, the returned fd is invalid and GLib throws an assertion just after deserializing the message when it tries to set the FD_CLOEXEC on the file descriptor. We have multiple reports (with core dumps) of the same issue. Feel free to reassign to other components or back to NM if you think we may be doing something wrong. Thanks! I have discussed this issue with David Herrmann (as I was suspecting this is due to some DBus bug) and David is working a lot with DBus these days. However, as of now we think this is NM bug. Most likely close() was called on the fd in one thread, while other thread receives fd via unix socket and then tries to set O_CLOEXEC on fd that was closed by first thread, Possible parallel execution, leading to bug, Thread #1: Read gdbus get FD 17 <pause 1> cloexec on 17 <pause 2> continue Thread #2: <during pause 1> close(17) /* for *whatever* reason */ <during pause 2> open /proc/... and get 17 assigned This hasn't happened anymore since CI machines started using RHEL 7.4, so I guess the bug was fixed implicitly, maybe when the glib version was updated. I'll close this if the problem doesn't appear again in few weeks. *** Bug 1492660 has been marked as a duplicate of this bug. *** (In reply to Beniamino Galvani from comment #10) > This hasn't happened anymore since CI machines started using RHEL 7.4, so I > guess the bug was fixed implicitly, maybe when the glib version was updated. I take it back, it happened again. See details in bug 1492660. *** Bug 1493140 has been marked as a duplicate of this bug. *** *** Bug 1495112 has been marked as a duplicate of this bug. *** (In reply to Beniamino Galvani from comment #12) > (In reply to Beniamino Galvani from comment #10) > > This hasn't happened anymore since CI machines started using RHEL 7.4, so I > > guess the bug was fixed implicitly, maybe when the glib version was updated. > > I take it back, it happened again. See details in bug 1492660. I can reproduce this crash with https://github.com/NetworkManager/NetworkManager-ci/blob/master/nmcli/features/bond.feature#L459 running in cycle. Sooner or latter there is a crash (in less than 20 cycles). It has to run with real devices. With veth interfaces it does happen very very rarely. Could you please create a scratch build again so I can reverify? Created attachment 1339634 [details]
[PATCH] core: don't close input fd in nm_utils_fd_get_contents()
There was a double close(); usually the second close() would simply fail: open("/proc/sys/net/ipv6/conf/eth4/accept_ra", O_RDONLY|O_CLOEXEC) = 15 fcntl(15, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE) close(15) = 0 close(15) = -1 EBADF (Bad file descriptor) but sometimes it closed somebody else's file descriptor: open("/proc/sys/net/ipv6/conf/eth1/accept_ra_defrtr", O_WRONLY|O_TRUNC|O_CLOEXEC) = 16 open("/proc/sys/net/ipv6/conf/eth1/accept_ra_defrtr", O_RDONLY|O_CLOEXEC) = 17 fcntl(17, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE) close(17) = 0 # ... receive fd 17 via D-Bus ... close(17) = 0 # try to set close-on-exec on the new fd fcntl(17, F_SETFD, FD_CLOEXEC) = -1 EBADF (Bad file descriptor) Created attachment 1339638 [details]
[PATCH v2] core: don't close input fd in nm_utils_fd_get_contents()
Created attachment 1339681 [details]
[PATCH v3] core: don't close input fd in nm_utils_fd_get_contents()
v3: don't leak fd on fdopen() error.
(In reply to Beniamino Galvani from comment #20) > Created attachment 1339681 [details] > [PATCH v3] core: don't close input fd in nm_utils_fd_get_contents() > > v3: don't leak fd on fdopen() error. lgtm Applied to master: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=597072296a19be051ea70b017fd19505ebee6530 and backported to nm-1-8, nm-1-6 branches. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0778 |