Bug 2063216

Summary: cleanup/prune stale data in /var/lib/NetworkManager (e.g. DHCP lease files)
Product: Red Hat Enterprise Linux 8 Reporter: Thomas Haller <thaller>
Component: NetworkManagerAssignee: NetworkManager Development Team <nm-team>
Status: CLOSED MIGRATED QA Contact: Desktop QE <desktop-qa-list>
Severity: unspecified Docs Contact:
Priority: low    
Version: 8.6CC: bgalvani, lrintel, rkhan, sfaye, sukulkar, till
Target Milestone: rcKeywords: MigratedToJIRA, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-16 18:33:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Thomas Haller 2022-03-11 14:22:38 UTC
/var/lib/NetworkManager contains data, that does not get garbage collected. That is a problem, in particular when lots of new profiles/devices get created, and we leak DHCP lease files. etc.



O) in the past, /var/lib/NetworkManager/seen-bssids and /var/lib/NetworkManager/bssids had similar problems. This was solved by [1]. Maybe get inspired. But note that there the leaking happened inside one file. This bug is mostly about leaking while files files.

[1] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/26f38b9ffafeab145150332f8a11eb94559c31bf



I) the lease files "/var/lib/NetworkManager/*.lease". We write these files per-connection and device, so they can pile up. We need to regularly clean files that:

  1) refer to a connection UUID which no longer exists

  2) it's harder to clean them up for unused devices. But maybe not too hard, and we can differentiate between the cases based on the `connection.type` of the profile:

    2a) if the profile specifies "connection.interface-name", then we can delete all leases that are for different interface-names. That's easy and correct.

    2b) otherwise, if the lease is for a hardware device (ethernet), then we can delete it if the interface currently does not exist. That is problematic, because if the user unplugs the USB device, we would delete the lease. 4) will mitigate that problem.

    2c) otherwise, see the timebase cleanup below (3).

  3) leases have a timestamp when we received them. Well, actually we no longer store the timestamp in the lease itself, but we could interpret the file timestamp for that. We also know the lease duration -- although, again, we don't store that in the "intern*.lease" files. Anyway, it means, maybe we could also prune files that are older than a certain time. This would require the system clock accurate, which may not be the case. So only do this step, if we have *too* many files (2000?) that are still in use.

  4) together with the previous rules, we can also keep an excess of e.g. 100 extra files. The goal is to not have an unlimited amount of stale files, but we are fine to keep a few (100). So, from all the candidates above that we would want to delete, sort them by the file timestamp, and keep the 100 least recently used. This should cover the case where you just "temporarily" deleted a profile or unplugged a device, but when you replug it, the file is still there... unless, you have too many of such files.

This means the timestamp of the lease files is important. Maybe make sure we touch them, when we activate/deactivate a profile.



Some of the suggestions approach seem unsafe and we might delete still useful leases. Note that with the internal DHCP client, we almost don't use the lease file for anything. Except for the "requested-address" (see [2]). For dhclient plugin that might be different, but in general, it seems the lease file is not very important to us. And we might be able to delete them more aggressively.

[2] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/ae0cc9618c49bb74bbe54a073dc337e9a3b0005b/src/core/dhcp/nm-dhcp-nettools.c#L1102




II) we usually write files with `g_file_set_contents()` or `nm_utils_file_set_contents()`. This creates a temporary file (like `mktemp ".XXXXXX"`) and does an atomit replace. If we crash in the middle, we might leak such temporary files. We should regularly clean such files, based on the filename. For example, check the following patterns whether they are created using the mktemp approach and have this problem:

   - internal-*.lease.XXXXXX
   - dhclient-*.lease.XXXXXX
   - dnsmasq-*.leases.XXXXXX
   - NetworkManager.state.XXXXXX
   - timestamps.XXXXXX
   - seen-bssids.XXXXXX
   - NetworkManager-intern.conf.XXXXXX
   - no-auto-default.state.XXXXXX
   - ** maybe others? **




III) do something about /var/lib/NetworkManager/dnsmasq-*.leases files. Probably works similar as lease files.



IV) check whether /var/lib/NetworkManager/no-auto-default.state has a problem. I think it doesn't.


---

What should happen, that we have a function prune_var_lib(), which iterates over all files in the directory, classifies them based on the well-known filenames, and checks whether they should be cleaned.

then, ensure that we call this function at least once (e.g. during startup). And after every X writes of stale files (so we need to count the writes, and trigger a prune periodically).

Comment 1 Thomas Haller 2022-03-22 11:27:08 UTC
see also nm_utils_find_mkstemp_files() in the sources.

Comment 3 RHEL Program Management 2023-08-16 18:28:21 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.