1784707 – [Regression] [Memory leak] Has unclosed fd

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1784707 - [Regression] [Memory leak] Has unclosed fd

Summary: [Regression] [Memory leak] Has unclosed fd

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	nmstate
Sub Component:
Version:	8.2
Hardware:	x86_64
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	8.0
Assignee:	Gris Ge
QA Contact:	Mingyu Shi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	nmstate-nm
TreeView+	depends on / blocked

Reported:	2019-12-18 06:38 UTC by Gris Ge
Modified:	2022-03-08 09:48 UTC (History)
CC List:	11 users (show)
Fixed In Version:	nmstate-0.2.2-3.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-28 16:00:10 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
nm_bug_unclosed_fd.py (99 bytes, text/x-python) 2019-12-18 06:38 UTC, Gris Ge	no flags	Details
python example (805 bytes, application/x-shellscript) 2019-12-18 09:10 UTC, Thomas Haller	no flags	Details
reproducer (9.49 KB, application/text) 2019-12-18 15:24 UTC, Thomas Haller	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-33235	0	None	None	None	2022-03-08 09:48:27 UTC
Red Hat Product Errata	RHBA-2020:1696	0	None	None	None	2020-04-28 16:00:26 UTC

Description Gris Ge 2019-12-18 06:38:55 UTC

Created attachment 1646016 [details]
nm_bug_unclosed_fd.py

Description of problem:

Got error:

(process:17095): GLib-ERROR **: 14:28:39.189: Creating pipes for GWakeup: Too many open files

when running `libnmstate.show()` 1000 times.

Version-Release number of selected component (if applicable):
NetworkManager-1.22.0-0.2.el8.x86_64
nmstate-0.2.0-2.el8

How reproducible:
100%

Steps to Reproduce:
1. ./nm_bug_unclosed_fd.py
2.
3.

Actual results:

Crash as too many opened file.

Expected results:
No crash.

Additional info:

With NM 1.22, the script is much faster when using NM 1.20.

Comment 1 Gris Ge 2019-12-18 06:49:25 UTC

Got the same problem on NetworkManager-1.22.0-2.el8.x86_64 at the loop 509.

Comment 2 Thomas Haller 2019-12-18 09:08:10 UTC

This is expected and a bug in nmstate. Hold on why:

When you create an NMClient instance (from after calling either g_initiable_init() or g_async_initiable_init()), NMClient will register signals on D-Bus. It does so using glib's GDBusConnection.

When somebody (e.g. NMClient) makes a D-Bus call with GDBusConnection (g_dbus_connection_call()) or subscribes a signal (g_dbus_connection_signal_subscribe()), then GDBusConnection remembers and references the caller's g_main_context_get_thread_default(). Later, the call's reply or the D-Bus signal will be reported to the caller by enqueueing the callback in that GMainContext. In other words: as long as there are pending requests/subscriptions, the GMainContext is kept alive by the GDBusConnection. Also: as long as NMClient is alive, it keeps D-Bus signals registered.

Also, since 1.22, NMClient is strongly associated with the g_main_context_get_thread_default() instance at the time when you create it. Previously, this was all a mess, and you could not effectively use NMClient with any main contexts except g_main_context_get_default() instance [1]. Now this works.

[1] https://stackoverflow.com/questions/55937592/glib-processing-custom-gmaincontext-loop-on-another-thread-not-raising-signal

Later, when you destroy the NMClient, then NMClient will unregister the D-Bus signal and cancel the pending requests. Note that like most asynchronous API in glib, cancelling g_dbus_connection_call() will not immediately cancel everything. Instead, it will still invoke the callback (with a cancelled reason)! That means, although everything is cancelled, the GMainContext still stays alive to process the cancelled notification (and this is after NMClient is already gone).

That means, after you destroy the NMClient, the GMainContext is still kept alive. You need to iterate the context a bit more to cleanup all the data. How long exactly? Until the nm_client_get_context_busy_watcher() object gets destroyed. You can take that object, register a weak pointer to it, and you need to iterate the context as long as the object is alive.

Now, there is more: as said, NMClient is associated with the g_main_context_get_thread_default() at the time of constructing it. However, when you use the synchronous initialization (nm_client_new(), g_initiable_new()), then NMClient will create an internal GMainContext. That is because the sync init must not iterate the caller's context (otherwise it wouldn't be "sync"). However, when using GDBusConnection, NMClient needs some GMainContext to iterate. Hence, it creates an internal one and integrates that internal context with the callers. See [2]

[2] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/blob/9aa00a8a14cb64f851234d025498ff2f9ec18e94/libnm/nm-client.c#L7221

Point being: if you don't iterate the context after destroying the NMClient, you leak something. If you used the sync init, you also leak a GMainContext, which has a file descriptor and is thus more expensive to leak than mere memory.

Maybe you should not use the sync init. That has an additional overhead of having this internal context. However, as long as you don't iterate your main context, you leak stuff just the same.

What is new in libnm 1.22, is the internal GMainContext with the sync init. This makes the leak much more expensive than it used to be.

TL;DR: fix your code, and iterate the context accordingly. Possibly your example is just flawed: it depends on what libnmstate.show() is supposed to do. Commonly a library is not supposed to iterate the context owned by the caller. Depending on how libnmstate.show() is supposed to work, it may well be a bug of the caller not to ever iterate the context. That depends...

Comment 3 Thomas Haller 2019-12-18 09:10:30 UTC

Created attachment 1646053 [details]
python example

The example shows the issue.

if you adjust `num_min` to > 1024, you will exhaust the number of file descriptors. Check also /proc/$PID/fd to see the file descriptors used.

The example also shows the context_busy_watcher() and that iterating the context avoids the issue.

Comment 4 Gris Ge 2019-12-18 13:02:59 UTC

Changed to nmstate bug.

Comment 5 Thomas Haller 2019-12-18 15:24:44 UTC

Created attachment 1646117 [details]
reproducer

Comment 6 Thomas Haller 2019-12-18 15:27:57 UTC

Comment on attachment 1646117 [details]
reproducer

applies on top of https://github.com/nmstate/nmstate/commit/b306c3d3a8987b04b70ed3b701778a7084dc8642

Comment 7 Thomas Haller 2019-12-18 15:31:24 UTC

btw, creating a new NMClient instance all the time is very wasteful and unnecessary. It requires to fetch all the information again.

Of course, for testing that might make sense. E.g. if you run multiple test in the same process, you may want to reset all internal state so that a bug in one test does not affect the other. But otherwise, calling "nm.nmclient.client(refresh=True)" in normal operation is bad.

Comment 11 errata-xmlrpc 2020-04-28 16:00:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1696

Note You need to log in before you can comment on or make changes to this bug.