Bug 1784707

Summary: [Regression] [Memory leak] Has unclosed fd
Product: Red Hat Enterprise Linux 8 Reporter: Gris Ge <fge>
Component: nmstateAssignee: Gris Ge <fge>
Status: CLOSED ERRATA QA Contact: Mingyu Shi <mshi>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 8.2CC: atragler, bgalvani, ferferna, jiji, jishi, lrintel, network-qe, rkhan, sukulkar, thaller, till
Target Milestone: rcKeywords: Regression
Target Release: 8.0Flags: pm-rhel: mirror+
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: nmstate-0.2.2-3.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-28 16:00:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1738136    
Attachments:
Description Flags
nm_bug_unclosed_fd.py
none
python example
none
reproducer none

Description Gris Ge 2019-12-18 06:38:55 UTC
Created attachment 1646016 [details]
nm_bug_unclosed_fd.py

Description of problem:

Got error:

(process:17095): GLib-ERROR **: 14:28:39.189: Creating pipes for GWakeup: Too many open files

when running `libnmstate.show()` 1000 times.

Version-Release number of selected component (if applicable):
NetworkManager-1.22.0-0.2.el8.x86_64
nmstate-0.2.0-2.el8

How reproducible:
100%

Steps to Reproduce:
1. ./nm_bug_unclosed_fd.py
2.
3.

Actual results:

Crash as too many opened file.

Expected results:
No crash.

Additional info:

With NM 1.22, the script is much faster when using NM 1.20.

Comment 1 Gris Ge 2019-12-18 06:49:25 UTC
Got the same problem on NetworkManager-1.22.0-2.el8.x86_64 at the loop 509.

Comment 2 Thomas Haller 2019-12-18 09:08:10 UTC
This is expected and a bug in nmstate. Hold on why:

When you create an NMClient instance (from after calling either g_initiable_init() or g_async_initiable_init()), NMClient will register signals on D-Bus. It does so using glib's GDBusConnection.

When somebody (e.g. NMClient) makes a D-Bus call with GDBusConnection (g_dbus_connection_call()) or subscribes a signal (g_dbus_connection_signal_subscribe()), then GDBusConnection remembers and references the caller's g_main_context_get_thread_default(). Later, the call's reply or the D-Bus signal will be reported to the caller by enqueueing the callback in that GMainContext. In other words: as long as there are pending requests/subscriptions, the GMainContext is kept alive by the GDBusConnection. Also: as long as NMClient is alive, it keeps D-Bus signals registered.

Also, since 1.22, NMClient is strongly associated with the g_main_context_get_thread_default() instance at the time when you create it. Previously, this was all a mess, and you could not effectively use NMClient with any main contexts except g_main_context_get_default() instance [1]. Now this works.

[1] https://stackoverflow.com/questions/55937592/glib-processing-custom-gmaincontext-loop-on-another-thread-not-raising-signal

Later, when you destroy the NMClient, then NMClient will unregister the D-Bus signal and cancel the pending requests. Note that like most asynchronous API in glib, cancelling g_dbus_connection_call() will not immediately cancel everything. Instead, it will still invoke the callback (with a cancelled reason)! That means, although everything is cancelled, the GMainContext still stays alive to process the cancelled notification (and this is after NMClient is already gone).

That means, after you destroy the NMClient, the GMainContext is still kept alive. You need to iterate the context a bit more to cleanup all the data. How long exactly? Until the nm_client_get_context_busy_watcher() object gets destroyed. You can take that object, register a weak pointer to it, and you need to iterate the context as long as the object is alive.



Now, there is more: as said, NMClient is associated with the g_main_context_get_thread_default() at the time of constructing it. However, when you use the synchronous initialization (nm_client_new(), g_initiable_new()), then NMClient will create an internal GMainContext. That is because the sync init must not iterate the caller's context (otherwise it wouldn't be "sync"). However, when using GDBusConnection, NMClient needs some GMainContext to iterate. Hence, it creates an internal one and integrates that internal context with the callers. See [2]

[2]  https://gitlab.freedesktop.org/NetworkManager/NetworkManager/blob/9aa00a8a14cb64f851234d025498ff2f9ec18e94/libnm/nm-client.c#L7221

Point being: if you don't iterate the context after destroying the NMClient, you leak something. If you used the sync init, you also leak a GMainContext, which has a file descriptor and is thus more expensive to leak than mere memory.


Maybe you should not use the sync init. That has an additional overhead of having this internal context. However, as long as you don't iterate your main context, you leak stuff just the same.

What is new in libnm 1.22, is the internal GMainContext with the sync init. This makes the leak much more expensive than it used to be.



TL;DR: fix your code, and iterate the context accordingly. Possibly your example is just flawed: it depends on what libnmstate.show() is supposed to do. Commonly a library is not supposed to iterate the context owned by the caller. Depending on how libnmstate.show() is supposed to work, it may well be a bug of the caller not to ever iterate the context. That depends...

Comment 3 Thomas Haller 2019-12-18 09:10:30 UTC
Created attachment 1646053 [details]
python example

The example shows the issue.

if you adjust `num_min` to > 1024, you will exhaust the number of file descriptors. Check also /proc/$PID/fd to see the file descriptors used.

The example also shows the context_busy_watcher() and that iterating the context avoids the issue.

Comment 4 Gris Ge 2019-12-18 13:02:59 UTC
Changed to nmstate bug.

Comment 5 Thomas Haller 2019-12-18 15:24:44 UTC
Created attachment 1646117 [details]
reproducer

Comment 6 Thomas Haller 2019-12-18 15:27:57 UTC
Comment on attachment 1646117 [details]
reproducer

applies on top of https://github.com/nmstate/nmstate/commit/b306c3d3a8987b04b70ed3b701778a7084dc8642

Comment 7 Thomas Haller 2019-12-18 15:31:24 UTC
btw, creating a new NMClient instance all the time is very wasteful and unnecessary. It requires to fetch all the information again.

Of course, for testing that might make sense. E.g. if you run multiple test in the same process, you may want to reset all internal state so that a bug in one test does not affect the other. But otherwise, calling "nm.nmclient.client(refresh=True)" in normal operation is bad.

Comment 11 errata-xmlrpc 2020-04-28 16:00:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1696