Bug 1451236

Summary: [abrt] [faf] NetworkManager: g_logv(): /usr/sbin/NetworkManager killed by 5
Product: Red Hat Enterprise Linux 7 Reporter: Beniamino Galvani <bgalvani>
Component: NetworkManagerAssignee: Beniamino Galvani <bgalvani>
Status: CLOSED ERRATA QA Contact: Desktop QE <desktop-qa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3CC: aloughla, atragler, bgalvani, fgiudici, lrintel, msekleta, msugaya, rkhan, sukulkar, thaller, vbenes
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
URL: http://faf.lab.eng.brq.redhat.com/faf/reports/bthash/babb88d25ffb46540bc0d5efa5454d6c8d901bc2/
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 13:22:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
[PATCH] core: don't close input fd in nm_utils_fd_get_contents()
none
[PATCH v2] core: don't close input fd in nm_utils_fd_get_contents()
none
[PATCH v3] core: don't close input fd in nm_utils_fd_get_contents() none

Description Beniamino Galvani 2017-05-16 08:13:41 UTC
This bug has been created based on an anonymous crash report requested by the package maintainer.

Report URL: http://faf.lab.eng.brq.redhat.com/faf/reports/bthash/babb88d25ffb46540bc0d5efa5454d6c8d901bc2/

 GLib-GIO-WARNING **: Error setting close-on-exec flag on incoming fd: Bad file descriptor

 #0  g_logv
 #1  g_log
 #2  g_unix_fd_message_deserialize
 #3  g_socket_control_message_deserialize
 #4  g_socket_receive_message
 #5  _g_socket_read_with_control_messages_ready
 #6  _g_dbus_worker_do_read_unlocked
 #7  _g_dbus_worker_do_read_unlocked
 #8  _g_dbus_worker_do_read_cb
 #9  g_task_return_now
 #10 complete_in_idle_cb
 #11 g_main_context_dispatch
 #12 g_main_context_dispatch
 #13 g_main_context_iterate
 #14 g_main_loop_run
 #15 gdbus_shared_thread_func
 #16 g_thread_proxy
 #17 start_thread
 #18 clone ()

Comment 2 Thomas Haller 2017-05-16 08:21:02 UTC
I think it's of similar nature as bug 1449833.

Comment 3 Thomas Haller 2017-05-16 13:54:03 UTC
Thread 1 (Thread 0x7f53fe79b700 (LWP 1106)):
#0  0x00007f54088de643 in g_logv (log_domain=0x7f5408f1b358 "GLib-GIO", log_level=G_LOG_LEVEL_WARNING, format=<optimized out>, args=args@entry=0x7f53fe799f70) at gmessages.c:1086
        domain = 0x0
        data = 0x0
        depth = 0
        log_func = 0x7f54088ddde0 <g_log_default_handler>
        domain_fatal_mask = <optimized out>
        masquerade_fatal = <optimized out>
        test_level = <optimized out>
        was_fatal = <optimized out>
        was_recursion = <optimized out>
        msg = 0x7f53f4007ea0 "Error setting close-on-exec flag on incoming fd: Bad file descriptor"
        msg_alloc = 0x7f53f4007ea0 "Error setting close-on-exec flag on incoming fd: Bad file descriptor"
        i = 4
#1  0x00007f54088de7bf in g_log (log_domain=log_domain@entry=0x7f5408f1b358 "GLib-GIO", log_level=log_level@entry=G_LOG_LEVEL_WARNING, format=format@entry=0x7f5408f33880 "Error setting close-on-exec flag on incoming fd: %s") at gmessages.c:1119
        args = {{gp_offset = 32, fp_offset = 48, overflow_arg_area = 0x7f53fe79a050, reg_save_area = 0x7f53fe799f90}}
#2  0x00007f5408eb3ce2 in g_unix_fd_message_deserialize (level=<optimized out>, type=<optimized out>, size=<optimized out>, data=0x7f53fe79a100) at gunixfdmessage.c:119
        message = <optimized out>
        list = <optimized out>
        n = 1
        s = <optimized out>
        i = <optimized out>
        fds = 0x7f53fe79a100
#3  0x00007f5408e942ce in g_socket_control_message_deserialize (level=1, type=1, size=4, data=0x7f53fe79a100) at gsocketcontrolmessage.c:196
        class = 0x7f53f4005fd0
        message = 0x0
        n_message_types = 2
        i = 1
#4  0x00007f5408e8e45a in g_socket_receive_message (socket=socket@entry=0x7f53f0007130 [GSocket], address=address@entry=0x0, vectors=<optimized out>, vectors@entry=0x7f53fe79aab0, num_vectors=<optimized out>, num_vectors@entry=1, messages=0x7f53f000be90, num_messages=0x7f53f000be98, flags=flags@entry=0x0, cancellable=0x7f53f0001f70 [GCancellable], error=error@entry=0x7f53fe79aaa8) at gsocket.c:4588
        message = <optimized out>
        my_messages = 0x0
        cmsg = 0x7f53fe79a0f0
        msg = {msg_name = 0x0, msg_namelen = 0, msg_iov = 0x7f53fe79aab0, msg_iovlen = 1, msg_control = 0x7f53fe79a0f0, msg_controllen = 24, msg_flags = 1073741824}
        result = <optimized out>
        one_sockaddr = {ss_family = 35344, __ss_align = 139998552661504, __ss_padding = "\000\000\000\000\000\000\000\000\264\300\275\bT\177\000\000pe\000\360S\177\000\000\230Q\215\bT\177", '\000' <repeats 11 times>, "\301\372rk\236E\365\002\000\000\000\000\000\000\000\332j\214\bT\177\000\000\020\212\357\bT\177\000\000\000\276\000\360S\177\000\000\000\f\000\364S\177\000\000\360\353\000\360S\177\000\000P4\374\rT\177\000\000`\305\000\360S\177\000"}
        one_vector = {buffer = 0x1, size = 0}
        one_byte = 0 '\000'
        __FUNCTION__ = "g_socket_receive_message"
#5  0x00007f5408ef6b9e in _g_socket_read_with_control_messages_ready (socket=socket@entry=0x7f53f0007130 [GSocket], condition=condition@entry=G_IO_IN, user_data=user_data@entry=0x7f53f4003200) at gdbusprivate.c:128
        task = 0x7f53f4003200 [GTask]
        data = 0x7f540de520e0
        error = 0x0
        result = <optimized out>
        vector = {buffer = 0x7f53f4001e00, size = 16}
        __FUNCTION__ = "_g_socket_read_with_control_messages_ready"
#6  0x00007f5408ef71bf in _g_dbus_worker_do_read_unlocked (callback=0x7f5408ef8a10 <_g_dbus_worker_do_read_cb>, io_priority=0, user_data=<optimized out>, cancellable=0x7f53f0001f70 [GCancellable], num_messages=<optimized out>, messages=<optimized out>, count=16, buffer=0x7f53f4001e00, socket=0x7f53f0007130 [GSocket]) at gdbusprivate.c:180
        task = 0x7f53f4003200 [GTask]
        data = 0x7f540de520e0
        source = <optimized out>
#7  0x00007f5408ef71bf in _g_dbus_worker_do_read_unlocked (worker=<optimized out>) at gdbusprivate.c:828
#8  0x00007f5408ef8ec2 in _g_dbus_worker_do_read_cb (input_stream=<optimized out>, res=<optimized out>, user_data=0x7f53f000be00) at gdbusprivate.c:774
        message = 0x7f540dfc3450
        worker = 0x7f53f000be00
        error = 0x0
        bytes_read = 206
#9  0x00007f5408e9af93 in g_task_return_now (task=0x7f53f40032d0 [GTask]) at gtask.c:1106
#10 0x00007f5408e9afc9 in complete_in_idle_cb (task=0x7f53f40032d0) at gtask.c:1120
#11 0x00007f54088d7d7a in g_main_context_dispatch (context=0x7f53f000c560) at gmain.c:3152
        dispatch = 0x7f54088d4a00 <g_idle_dispatch>
        prev_source = 0x0
        was_in_call = 0
        user_data = 0x7f53f40032d0
        callback = 0x7f5408e9afc0 <complete_in_idle_cb>
        cb_funcs = 0x7f5408bc38c0 <g_source_callback_funcs>
        cb_data = 0x7f53f4002ee0
        need_destroy = <optimized out>
        source = 0x7f53f4003eb0
        current = 0x7f53f000c380
        i = 0
#12 0x00007f54088d7d7a in g_main_context_dispatch (context=context@entry=0x7f53f000c560) at gmain.c:3767
#13 0x00007f54088d80b8 in g_main_context_iterate (context=0x7f53f000c560, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3838
        max_priority = 0
        timeout = 0
        some_ready = 1
        nfds = <optimized out>
        allocated_nfds = 3
        fds = 0x7f53f40010c0
#14 0x00007f54088d838a in g_main_loop_run (loop=0x7f53f000e820) at gmain.c:4032
        __FUNCTION__ = "g_main_loop_run"
#15 0x00007f5408ef6c66 in gdbus_shared_thread_func (user_data=0x7f53f000c530) at gdbusprivate.c:246
        data = 0x7f53f000c530
#16 0x00007f54088fd0f5 in g_thread_proxy (data=0x7f540dde2ed0) at gthread.c:778
        thread = 0x7f540dde2ed0
#17 0x00007f540a1e3dc5 in start_thread (arg=0x7f53fe79b700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f53fe79b700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139998728402688, 5735798018442000178, 0, 139998728403392, 139998728402688, 139998485595440, -5675000504269146318, -5678562126826440910}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>

Comment 4 Thomas Haller 2017-05-16 15:02:06 UTC
*** Bug 1451306 has been marked as a duplicate of this bug. ***

Comment 5 Beniamino Galvani 2017-06-07 12:35:24 UTC
From analysis of http://faf.lab.eng.brq.redhat.com/faf/reports/2955/ :

 (gdb) frame 3
 #3  0x00007f9376833f32 in g_unix_fd_message_deserialize (level=<optimized out>,        type=<optimized out>, size=<optimized out>, data=0x7f9368f3b1f0) at gunixfdmessage.c:119

 (gdb) p fds[0]
 $x = 17

 (gdb) frame 6
 #6  0x00007f93767bef36 in g_socket_receive_message_with_timeout (socket=socket@entry=0x7f935c008130, address=address@entry=0x0,        vectors=<optimized out>, vectors@entry=0x7f9368f3bb60,
     num_vectors=<optimized out>, num_vectors@entry=1, messages=0x7f935c00e6b0, num_messages=0x7f935c00e6b8, flags=flags@entry=0x0, timeout=0, cancellable=cancellable@entry=0x7f935c0023a0,
     error=error@entry=0x7f9368f3bb58) at gsocket.c:4845

 (gdb) p *socket->priv
 $x = {family = G_SOCKET_FAMILY_UNIX, type = G_SOCKET_TYPE_STREAM, protocol = G_SOCKET_PROTOCOL_DEFAULT, fd = 8, listen_backlog = 10, timeout = 0, construct_error = 0x0, remote_address = 0x557854756ef0,
       inited = 1, blocking = 0, keepalive = 0, closed = 0, connected_read = 1, connected_write = 1,    listening = 0, timed_out = 0, connect_pending = 0, recv_addr_cache = {{addr = 0x0, native = 0x0,
       native_len = 0, last_used = 0}, {addr = 0x0, native = 0x0, native_len = 0, last_used = 0}, {addr = 0x0, native = 0x0, native_len = 0, last_used = 0}, {addr = 0x0, native = 0x0, native_len = 0,
       last_used        = 0}, {addr = 0x0, native = 0x0, native_len = 0, last_used = 0}, {addr = 0x0, native = 0x0, native_len = 0, last_used = 0}, {addr = 0x0, native = 0x0, native_len = 0, last_used = 0}, {
       addr = 0x0, native = 0x0,        native_len = 0, last_used = 0}}}

 (gdb) p *((GUnixSocketAddress *)socket->priv->remote_address)->priv
 $x = {path = "/var/run/dbus/system_bus_socket", '\000' <repeats 76 times>, path_len = 31, address_type = G_UNIX_SOCKET_ADDRESS_PATH}

 $ grep ^17: open_fds
 17:/proc/sys/net/ipv6/conf/eth6/forwarding

 (gdb) thread apply 4 bt
 Thread 4 (Thread 0x7f93782d1940 (LWP 19578)):
 #0  send () from /lib64/libc.so.6
 #1  __vsyslog_chk () from /lib64/libc.so.6
 #2  __syslog_chk () from /lib64/libc.so.6
 #3  syslog (__fmt=0x557852a24914 "%s%-7s [%ld.%04ld] %s", __pri=<optimized out>) at /usr/include/bits/syslog.h:31
 #4  _nm_log_impl (file=file@entry=0x557852a238f8 "src/platform/nm-linux-platform.c", line=line@entry=2643, func=func@entry=0x0, level=level@entry=LOGL_DEBUG, domain=domain@entry=LOGD_PLATFORM,
     error=error@entry=0, ifname=ifname@entry=0x0, conn_uuid=conn_uuid@entry=0x0, fmt=fmt@entry=0x557852a23c40 "%s: sysctl: setting '%s' to '%s' (current value is identical)") at src/nm-logging.c:725
 #5  _log_dbg_sysctl_set_impl (value=0x557852a46f85 "0", path=0x557852cb15c0 <path.32401> "/proc/sys/net/ipv6/conf/eth6/forwarding", dirfd=<optimized out>,
     pathid=0x557852cb15c0 <path.32401> "/proc/sys/net/ipv6/conf/eth6/forwarding", platform=0x557854733850) at src/platform/nm-linux-platform.c:2643
 #6  sysctl_set (platform=0x557854733850, pathid=0x557852cb15c0 <path.32401> "/proc/sys/net/ipv6/conf/eth6/forwarding", dirfd=<optimized out>,
     path=0x557852cb15c0 <path.32401> "/proc/sys/net/ipv6/conf/eth6/forwarding", value=0x557852a46f85 "0") at src/platform/nm-linux-platform.c:2712
 #7  _set_state_full (self=self@entry=0x55785484ecf0, state=state@entry=NM_DEVICE_STATE_DISCONNECTED, reason=<optimized out>, quitting=quitting@entry=0) at src/devices/nm-device.c:12572
 #8  nm_device_state_changed (self=self@entry=0x55785484ecf0, state=state@entry=NM_DEVICE_STATE_DISCONNECTED, reason=<optimized out>) at src/devices/nm-device.c:12772
 #9  queued_state_set (user_data=user_data@entry=0x55785484ecf0) at src/devices/nm-device.c:12798


The process receives fd 17 as ancillary data in a message from the
D-Bus system socket, but when it tries to set FD_CLOEXEC on it, the
file descriptor is invalid. At the same time another thread opens
"/proc/sys/net/ipv6/conf/eth6/forwarding" and obtains the same file
descriptor (17).

It seems to me that there is something wrong in the lower layers if we
receive an invalid fd over D-Bus.

Comment 6 Beniamino Galvani 2017-06-08 14:27:07 UTC
I think this is related to the systemd sleep monitor, which is the only place in NM where we call a D-Bus method returning a Unix fd. Indeed, few lines above the crash:

<debug> [1495478072.2589] sleep-monitor-sd: inhibit: taking sleep inhibitor...

Comment 7 Beniamino Galvani 2017-06-09 09:49:23 UTC
I'm reassigning this to systemd, not because I believe the bug is in
systemd, but only to have someone more familiar with the inhibitor API
look at the bug and try to identify which component (dbus-daemon,
kernel, glib, ...) is causing the problem.

What happens is that when NM calls the D-Bus "Inhibit" method of
logind, the returned fd is invalid and GLib throws an assertion just
after deserializing the message when it tries to set the FD_CLOEXEC on
the file descriptor. We have multiple reports (with core dumps) of the
same issue.

Feel free to reassign to other components or back to NM if you think
we may be doing something wrong. Thanks!

Comment 9 Michal Sekletar 2017-08-22 11:56:19 UTC
I have discussed this issue with David Herrmann (as I was suspecting this is due to some DBus bug) and David is working a lot with DBus these days.

However, as of now we think this is NM bug. Most likely close() was called on the fd in one thread, while other thread receives fd via unix socket and then tries to set O_CLOEXEC on fd that was closed by first thread,

Possible parallel execution, leading to bug,

Thread #1:
   Read gdbus
   get FD 17
   <pause 1>
   cloexec on 17
   <pause 2>
   continue

Thread #2:
  <during pause 1>
  close(17) /* for *whatever* reason */
  <during pause 2>
  open /proc/...  and get 17 assigned

Comment 10 Beniamino Galvani 2017-09-13 08:41:21 UTC
This hasn't happened anymore since CI machines started using RHEL 7.4, so I guess the bug was fixed implicitly, maybe when the glib version was updated.

I'll close this if the problem doesn't appear again in few weeks.

Comment 11 Beniamino Galvani 2017-09-18 12:58:46 UTC
*** Bug 1492660 has been marked as a duplicate of this bug. ***

Comment 12 Beniamino Galvani 2017-09-18 13:00:47 UTC
(In reply to Beniamino Galvani from comment #10)
> This hasn't happened anymore since CI machines started using RHEL 7.4, so I
> guess the bug was fixed implicitly, maybe when the glib version was updated.

I take it back, it happened again. See details in bug 1492660.

Comment 13 Beniamino Galvani 2017-09-19 14:10:08 UTC
*** Bug 1493140 has been marked as a duplicate of this bug. ***

Comment 14 Beniamino Galvani 2017-09-25 20:17:16 UTC
*** Bug 1495112 has been marked as a duplicate of this bug. ***

Comment 15 Vladimir Benes 2017-10-16 22:06:50 UTC
(In reply to Beniamino Galvani from comment #12)
> (In reply to Beniamino Galvani from comment #10)
> > This hasn't happened anymore since CI machines started using RHEL 7.4, so I
> > guess the bug was fixed implicitly, maybe when the glib version was updated.
> 
> I take it back, it happened again. See details in bug 1492660.

I can reproduce this crash with https://github.com/NetworkManager/NetworkManager-ci/blob/master/nmcli/features/bond.feature#L459 running in cycle. Sooner or latter there is a crash (in less than 20 cycles). It has to run with real devices. With veth interfaces it does happen very very rarely.

Could you please create a scratch build again so I can reverify?

Comment 16 Beniamino Galvani 2017-10-17 09:23:21 UTC
Created attachment 1339634 [details]
[PATCH] core: don't close input fd in nm_utils_fd_get_contents()

Comment 18 Beniamino Galvani 2017-10-17 09:35:43 UTC
There was a double close(); usually the second close() would simply fail:

 open("/proc/sys/net/ipv6/conf/eth4/accept_ra", O_RDONLY|O_CLOEXEC) = 15
 fcntl(15, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE)
 close(15)       = 0
 close(15)       = -1 EBADF (Bad file descriptor)

but sometimes it closed somebody else's file descriptor:

 open("/proc/sys/net/ipv6/conf/eth1/accept_ra_defrtr", O_WRONLY|O_TRUNC|O_CLOEXEC) = 16
 open("/proc/sys/net/ipv6/conf/eth1/accept_ra_defrtr", O_RDONLY|O_CLOEXEC) = 17
 fcntl(17, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE)
 close(17)       = 0
 # ... receive fd 17 via D-Bus ...
 close(17)       = 0
 # try to set close-on-exec on the new fd
 fcntl(17, F_SETFD, FD_CLOEXEC) = -1 EBADF (Bad file descriptor)

Comment 19 Beniamino Galvani 2017-10-17 09:50:54 UTC
Created attachment 1339638 [details]
[PATCH v2] core: don't close input fd in nm_utils_fd_get_contents()

Comment 20 Beniamino Galvani 2017-10-17 12:33:32 UTC
Created attachment 1339681 [details]
[PATCH v3] core: don't close input fd in nm_utils_fd_get_contents()

v3: don't leak fd on fdopen() error.

Comment 21 Thomas Haller 2017-10-18 12:07:07 UTC
(In reply to Beniamino Galvani from comment #20)
> Created attachment 1339681 [details]
> [PATCH v3] core: don't close input fd in nm_utils_fd_get_contents()
> 
> v3: don't leak fd on fdopen() error.

lgtm

Comment 22 Beniamino Galvani 2017-10-19 07:12:17 UTC
Applied to master:

https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=597072296a19be051ea70b017fd19505ebee6530

and backported to nm-1-8, nm-1-6 branches.

Comment 26 errata-xmlrpc 2018-04-10 13:22:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0778