This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2063216 - cleanup/prune stale data in /var/lib/NetworkManager (e.g. DHCP lease files)
Summary: cleanup/prune stale data in /var/lib/NetworkManager (e.g. DHCP lease files)
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: NetworkManager
Version: 8.6
Hardware: Unspecified
OS: Unspecified
low
unspecified
Target Milestone: rc
: ---
Assignee: NetworkManager Development Team
QA Contact: Desktop QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-11 14:22 UTC by Thomas Haller
Modified: 2023-08-16 18:33 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-08-16 18:33:32 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker NMT-70 0 None None None 2023-01-22 14:11:29 UTC
Red Hat Issue Tracker   RHEL-1386 0 None Migrated None 2023-08-23 14:45:00 UTC
Red Hat Issue Tracker RHELPLAN-115353 0 None None None 2022-03-11 14:36:03 UTC

Description Thomas Haller 2022-03-11 14:22:38 UTC
/var/lib/NetworkManager contains data, that does not get garbage collected. That is a problem, in particular when lots of new profiles/devices get created, and we leak DHCP lease files. etc.



O) in the past, /var/lib/NetworkManager/seen-bssids and /var/lib/NetworkManager/bssids had similar problems. This was solved by [1]. Maybe get inspired. But note that there the leaking happened inside one file. This bug is mostly about leaking while files files.

[1] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/26f38b9ffafeab145150332f8a11eb94559c31bf



I) the lease files "/var/lib/NetworkManager/*.lease". We write these files per-connection and device, so they can pile up. We need to regularly clean files that:

  1) refer to a connection UUID which no longer exists

  2) it's harder to clean them up for unused devices. But maybe not too hard, and we can differentiate between the cases based on the `connection.type` of the profile:

    2a) if the profile specifies "connection.interface-name", then we can delete all leases that are for different interface-names. That's easy and correct.

    2b) otherwise, if the lease is for a hardware device (ethernet), then we can delete it if the interface currently does not exist. That is problematic, because if the user unplugs the USB device, we would delete the lease. 4) will mitigate that problem.

    2c) otherwise, see the timebase cleanup below (3).

  3) leases have a timestamp when we received them. Well, actually we no longer store the timestamp in the lease itself, but we could interpret the file timestamp for that. We also know the lease duration -- although, again, we don't store that in the "intern*.lease" files. Anyway, it means, maybe we could also prune files that are older than a certain time. This would require the system clock accurate, which may not be the case. So only do this step, if we have *too* many files (2000?) that are still in use.

  4) together with the previous rules, we can also keep an excess of e.g. 100 extra files. The goal is to not have an unlimited amount of stale files, but we are fine to keep a few (100). So, from all the candidates above that we would want to delete, sort them by the file timestamp, and keep the 100 least recently used. This should cover the case where you just "temporarily" deleted a profile or unplugged a device, but when you replug it, the file is still there... unless, you have too many of such files.

This means the timestamp of the lease files is important. Maybe make sure we touch them, when we activate/deactivate a profile.



Some of the suggestions approach seem unsafe and we might delete still useful leases. Note that with the internal DHCP client, we almost don't use the lease file for anything. Except for the "requested-address" (see [2]). For dhclient plugin that might be different, but in general, it seems the lease file is not very important to us. And we might be able to delete them more aggressively.

[2] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/ae0cc9618c49bb74bbe54a073dc337e9a3b0005b/src/core/dhcp/nm-dhcp-nettools.c#L1102




II) we usually write files with `g_file_set_contents()` or `nm_utils_file_set_contents()`. This creates a temporary file (like `mktemp ".XXXXXX"`) and does an atomit replace. If we crash in the middle, we might leak such temporary files. We should regularly clean such files, based on the filename. For example, check the following patterns whether they are created using the mktemp approach and have this problem:

   - internal-*.lease.XXXXXX
   - dhclient-*.lease.XXXXXX
   - dnsmasq-*.leases.XXXXXX
   - NetworkManager.state.XXXXXX
   - timestamps.XXXXXX
   - seen-bssids.XXXXXX
   - NetworkManager-intern.conf.XXXXXX
   - no-auto-default.state.XXXXXX
   - ** maybe others? **




III) do something about /var/lib/NetworkManager/dnsmasq-*.leases files. Probably works similar as lease files.



IV) check whether /var/lib/NetworkManager/no-auto-default.state has a problem. I think it doesn't.


---

What should happen, that we have a function prune_var_lib(), which iterates over all files in the directory, classifies them based on the well-known filenames, and checks whether they should be cleaned.

then, ensure that we call this function at least once (e.g. during startup). And after every X writes of stale files (so we need to count the writes, and trigger a prune periodically).

Comment 1 Thomas Haller 2022-03-22 11:27:08 UTC
see also nm_utils_find_mkstemp_files() in the sources.

Comment 3 RHEL Program Management 2023-08-16 18:28:21 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.


Note You need to log in before you can comment on or make changes to this bug.