Bug 1784707
| Summary: | [Regression] [Memory leak] Has unclosed fd | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Gris Ge <fge> | ||||||||
| Component: | nmstate | Assignee: | Gris Ge <fge> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Mingyu Shi <mshi> | ||||||||
| Severity: | urgent | Docs Contact: | |||||||||
| Priority: | urgent | ||||||||||
| Version: | 8.2 | CC: | atragler, bgalvani, ferferna, jiji, jishi, lrintel, network-qe, rkhan, sukulkar, thaller, till | ||||||||
| Target Milestone: | rc | Keywords: | Regression | ||||||||
| Target Release: | 8.0 | Flags: | pm-rhel:
mirror+
|
||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | All | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | nmstate-0.2.2-3.el8 | Doc Type: | If docs needed, set a value | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2020-04-28 16:00:10 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1738136 | ||||||||||
| Attachments: |
|
||||||||||
Got the same problem on NetworkManager-1.22.0-2.el8.x86_64 at the loop 509. This is expected and a bug in nmstate. Hold on why: When you create an NMClient instance (from after calling either g_initiable_init() or g_async_initiable_init()), NMClient will register signals on D-Bus. It does so using glib's GDBusConnection. When somebody (e.g. NMClient) makes a D-Bus call with GDBusConnection (g_dbus_connection_call()) or subscribes a signal (g_dbus_connection_signal_subscribe()), then GDBusConnection remembers and references the caller's g_main_context_get_thread_default(). Later, the call's reply or the D-Bus signal will be reported to the caller by enqueueing the callback in that GMainContext. In other words: as long as there are pending requests/subscriptions, the GMainContext is kept alive by the GDBusConnection. Also: as long as NMClient is alive, it keeps D-Bus signals registered. Also, since 1.22, NMClient is strongly associated with the g_main_context_get_thread_default() instance at the time when you create it. Previously, this was all a mess, and you could not effectively use NMClient with any main contexts except g_main_context_get_default() instance [1]. Now this works. [1] https://stackoverflow.com/questions/55937592/glib-processing-custom-gmaincontext-loop-on-another-thread-not-raising-signal Later, when you destroy the NMClient, then NMClient will unregister the D-Bus signal and cancel the pending requests. Note that like most asynchronous API in glib, cancelling g_dbus_connection_call() will not immediately cancel everything. Instead, it will still invoke the callback (with a cancelled reason)! That means, although everything is cancelled, the GMainContext still stays alive to process the cancelled notification (and this is after NMClient is already gone). That means, after you destroy the NMClient, the GMainContext is still kept alive. You need to iterate the context a bit more to cleanup all the data. How long exactly? Until the nm_client_get_context_busy_watcher() object gets destroyed. You can take that object, register a weak pointer to it, and you need to iterate the context as long as the object is alive. Now, there is more: as said, NMClient is associated with the g_main_context_get_thread_default() at the time of constructing it. However, when you use the synchronous initialization (nm_client_new(), g_initiable_new()), then NMClient will create an internal GMainContext. That is because the sync init must not iterate the caller's context (otherwise it wouldn't be "sync"). However, when using GDBusConnection, NMClient needs some GMainContext to iterate. Hence, it creates an internal one and integrates that internal context with the callers. See [2] [2] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/blob/9aa00a8a14cb64f851234d025498ff2f9ec18e94/libnm/nm-client.c#L7221 Point being: if you don't iterate the context after destroying the NMClient, you leak something. If you used the sync init, you also leak a GMainContext, which has a file descriptor and is thus more expensive to leak than mere memory. Maybe you should not use the sync init. That has an additional overhead of having this internal context. However, as long as you don't iterate your main context, you leak stuff just the same. What is new in libnm 1.22, is the internal GMainContext with the sync init. This makes the leak much more expensive than it used to be. TL;DR: fix your code, and iterate the context accordingly. Possibly your example is just flawed: it depends on what libnmstate.show() is supposed to do. Commonly a library is not supposed to iterate the context owned by the caller. Depending on how libnmstate.show() is supposed to work, it may well be a bug of the caller not to ever iterate the context. That depends... Created attachment 1646053 [details]
python example
The example shows the issue.
if you adjust `num_min` to > 1024, you will exhaust the number of file descriptors. Check also /proc/$PID/fd to see the file descriptors used.
The example also shows the context_busy_watcher() and that iterating the context avoids the issue.
Changed to nmstate bug. Created attachment 1646117 [details]
reproducer
Comment on attachment 1646117 [details] reproducer applies on top of https://github.com/nmstate/nmstate/commit/b306c3d3a8987b04b70ed3b701778a7084dc8642 btw, creating a new NMClient instance all the time is very wasteful and unnecessary. It requires to fetch all the information again. Of course, for testing that might make sense. E.g. if you run multiple test in the same process, you may want to reset all internal state so that a bug in one test does not affect the other. But otherwise, calling "nm.nmclient.client(refresh=True)" in normal operation is bad. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1696 |
Created attachment 1646016 [details] nm_bug_unclosed_fd.py Description of problem: Got error: (process:17095): GLib-ERROR **: 14:28:39.189: Creating pipes for GWakeup: Too many open files when running `libnmstate.show()` 1000 times. Version-Release number of selected component (if applicable): NetworkManager-1.22.0-0.2.el8.x86_64 nmstate-0.2.0-2.el8 How reproducible: 100% Steps to Reproduce: 1. ./nm_bug_unclosed_fd.py 2. 3. Actual results: Crash as too many opened file. Expected results: No crash. Additional info: With NM 1.22, the script is much faster when using NM 1.20.