1884262 – Device properties (in particular description) is recomputed every time a /proc/1/mountinfo is processed

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1884262 - Device properties (in particular description) is recomputed every time a /proc/1/mountinfo is processed

Summary: Device properties (in particular description) is recomputed every time a /pro...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	systemd
Sub Component:
Version:	8.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	8.0
Assignee:	systemd-maint
QA Contact:	Frantisek Sumsal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-01 13:31 UTC by Renaud Métrich
Modified:	2022-04-01 07:27 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-04-01 07:27:26 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	systemd systemd pull 17409	0	None	open	Mountinfo processing improvement	2021-03-22 08:36:39 UTC
Red Hat Bugzilla	1793533	0	urgent	CLOSED	systemd sends tons of useless PropertiesChanged messages when a mount happens	2024-06-13 22:27:19 UTC

Description Renaud Métrich 2020-10-01 13:31:59 UTC

Description of problem:

This is a continuation of BZ #1793533 (PropertiesChanged messages emitted for no good reason).
Every time systemd processes /proc/1/mountinfo, the device unit is updated through calling device_found_node() in mount_load_proc_self_mountinfo().

Usually the device didn't change and this is useless. Knowing that the device description is computed based on udev database, this creates a huge load on systemd, which prevents systemd for doing other useful tasks in the mean time.

We need to find a way to prevent doing this constantly, this greatly affects all heavy loaded environments (OpenShift, etc.).

I implemented a prototype caching /proc/1/mountinfo between calls and finding the difference to only process "new" devices (related to new mount points).
Using the reproducer from BZ #1793533, early testing shows an improvement in number of syscalls being called by 96%.

Without the optimization (but PR 15233):
   726796 systemd.strace.PR15233

With the optimization:
    27917 systemd.strace.PR15233+mountinfo_cache

Version-Release number of selected component (if applicable):

systemd-219 and later


How reproducible:

Always


Steps to Reproduce:
1. Use reproducer in BZ #1793533

Comment 4 Renaud Métrich 2020-10-08 09:10:28 UTC

Attaching Upstream PR (code slightly modified): https://github.com/systemd/systemd/pull/17283

Comment 11 Renaud Métrich 2020-10-12 09:53:00 UTC

New Upstream PR: https://github.com/systemd/systemd/pull/17321

The new implementation uses a Hashmap.

Comment 12 Renaud Métrich 2020-10-14 12:54:59 UTC

My PR is largely broken, I believe it optimizes too much and leads to issues in the unit states.
For example there are cases where device will remain as "loading tentative".

Hopefully raven-au is proposing some enhancements and these seem to work pretty well.
The optimization is less important but still profiling shows a 8x times decrease in mount_load_proc_self_mountinfo().
He should open a PR tomorrow.

Just for the records, the profiling systemtap script I wrote:

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
#!/usr/bin/env stap

global stats

probe process("/usr/lib/systemd/systemd").function("mount_load_proc_self_mountinfo").return {
	stats <<< gettimeofday_ns() - @entry(gettimeofday_ns())
}

probe end {
	printf("Distribution of %s latencies (in nanoseconds) for %d samples\n", "mount_load_proc_self_mountinfo", @count(stats))
        printf("max/avg/min: %d/%d/%d\n", @max(stats), @avg(stats), @min(stats))
        print(@hist_log(stats))
}
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Usage:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
# stap ./systemd.stp -x 1 -v
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Comment 13 Ian Kent 2020-10-15 00:48:53 UTC

(In reply to Renaud Métrich from comment #12)
> My PR is largely broken, I believe it optimizes too much and leads to issues
> in the unit states.
> For example there are cases where device will remain as "loading tentative".
> 
> Hopefully raven-au is proposing some enhancements and these seem to work
> pretty well.
> The optimization is less important but still profiling shows a 8x times
> decrease in mount_load_proc_self_mountinfo().
> He should open a PR tomorrow.

That's right, thanks Renaud.

I will need to do some more testing of my own before I submit the PR.

I'm still getting familiar with the systemd PR process and what's expected
of submissions so it may take a little while to get that right.

The testing will be more focused on CPU usage under load since that seems
to be the main concern in the reports we are seeing, and it's a bit different
to mount table handling testing I've done before. That was isolated purely
on rate of notifications and mount table size testing but the problem here
needs to cover the affect of flow on notifications over the d-bus too.

Ian

Comment 16 Ian Kent 2020-10-20 10:32:48 UTC

(In reply to Ian Kent from comment #15)
> (In reply to Ian Kent from comment #14)
> > 
> > There is no way to test the effectiveness of any possible fix if the needed
> > parameters to simulate the problem aren't known.
> 
> This info. might not be needed, but would be useful if we can get
> it.
>  
> I've been doing some testing with the change I have and the result
> is not too good.

So it turns out that I had made a small change to my patches and that
resulted in negligible improvement.

I went back to the patches I originally sent to Renaud and tested again.
This time I got results closer to what I expected.

For these tests I've used the new mountbomber utility from Karel Zak
in his experimental util-linux branch with a small change to report
the number of mounts and umounts performed during the test run.

I think this is more representative of normal(ish) usage and so is
better for testing. 

When looking at the test results it's important to understand that
mountbomber will mount the number of mounts specified for the test
at the start and umount them at the end. That skews the results a
little since we're concerned with measuring the effect the patches
have for mounts changing such as new, remounted or umounnted mounts
during normal activity and mounting and umounting a bunch of mounts
will register as changed mounts.

So I've kept the number of mounts used by mount bomber fairly low and
the number of random mounts and umounts fairly high.

First the system call count test.

mountbomber 200 mounts, with 10 seconds of random mount and umount.

Unpatched:

[root@localhost test]# wc -l /tmp/systemd.strace
146848 /tmp/systemd.strace

Patched:

[root@localhost test]# wc -l /tmp/systemd.strace
86915 /tmp/systemd.strace

This looks sensible at around 40% reduction in system calls given
the 200 mount actions at the start and again at the finish which
will count as changed mounts.

Next I used Renaud's systemtap script to count the number of calls
to mount_load_proc_self_mountinfo() and hence the number of events
being caught.

mountbomber 200 mounts, with 10 seconds of random mount and umount.

Unpatched:

# stap ./mount_load_proc_self_mountinfo.stp -x 1 -v
[root@localhost test]# stap ./mount_load_proc_self_mountinfo.stp -x 1 -v
Pass 1: parsed user script and 492 library scripts using 400468virt/170024res/13080shr/156964data kb, in 470usr/150sys/842real ms.
Pass 2: analyzed script: 3 probes, 2 functions, 1 embed, 3 globals using 414068virt/184864res/14092shr/170564data kb, in 180usr/10sys/223real ms.
Pass 3: translated to C into "/tmp/stapcF3vGC/stap_16e25d390818cd77a19693a6f5e4ec02_3054_src.c" using 414068virt/185080res/14308shr/170564data kb, in 20usr/380sys/1492real ms.
Pass 4: compiled C into "stap_16e25d390818cd77a19693a6f5e4ec02_3054.ko" in 2880usr/650sys/3786real ms.
Pass 5: starting run.
Distribution of mount_load_proc_self_mountinfo latencies (in nanoseconds) for 4000 samples
max/avg/min: 4250949/2144625/893277
   value |-------------------------------------------------- count
  131072 |                                                      0
  262144 |                                                      0
  524288 |                                                      2
 1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  2459
 2097152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     1538
 4194304 |                                                      1
 8388608 |                                                      0
16777216 |                                                      0

Pass 5: run completed in 20usr/40sys/17907real ms.

17:29:11        CPU     %user     %nice   %system   %iowait    %steal     %idle
17:29:15        all      0.00      0.00      0.00      0.00      0.00    100.00
17:29:16        all     21.56      0.00     15.38      0.13      0.00     62.93
17:29:17        all     23.65      0.00     16.94      0.00      0.00     59.41
17:29:18        all     23.67      0.00     17.20      0.00      0.00     59.13
17:29:19        all     23.48      0.00     17.20      0.00      0.19     59.13
17:29:20        all     23.56      0.00     17.23      0.00      0.06     59.15
17:29:21        all     23.99      0.00     16.67      1.33      0.13     57.89
17:29:22        all     23.83      0.00     17.02      0.00      0.06     59.08
17:29:23        all     23.81      0.00     17.29      0.00      0.00     58.90
17:29:24        all     23.93      0.00     16.96      0.00      0.25     58.86
17:29:25        all     23.47      0.00     17.50      0.00      0.00     59.03
17:29:26        all      3.06      0.00      2.50      0.00      0.00     94.45
17:29:27        all      0.00      0.00      0.00      0.00      0.00    100.00
Patched:

[root@localhost test]# stap ./mount_load_proc_self_mountinfo.stp -x 1 -v
Pass 1: parsed user script and 492 library scripts using 400468virt/170124res/13176shr/156964data kb, in 460usr/150sys/821real ms.
Pass 2: analyzed script: 3 probes, 2 functions, 1 embed, 3 globals using 414072virt/185076res/14296shr/170568data kb, in 170usr/20sys/216real ms.
Pass 3: using cached /root/.systemtap/cache/1f/stap_1f5145cb146da316e6083966f10436fc_3054.c
Pass 4: using cached /root/.systemtap/cache/1f/stap_1f5145cb146da316e6083966f10436fc_3054.ko
Pass 5: starting run.
Distribution of mount_load_proc_self_mountinfo latencies (in nanoseconds) for 5409 samples
max/avg/min: 3184960/1662429/222757
  value |-------------------------------------------------- count
  32768 |                                                      0
  65536 |                                                      0
 131072 |                                                      2
 262144 |                                                      3
 524288 |                                                      6
1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  4909
2097152 |@@@@                                                489
4194304 |                                                      0
8388608 |                                                      0

Pass 5: run completed in 10usr/40sys/17412real ms.

17:29:11        CPU     %user     %nice   %system   %iowait    %steal     %idle
17:38:30        all      0.00      0.00      0.00      0.00      0.00    100.00
17:38:31        all      0.06      0.00      0.00      0.00      0.00     99.94
17:38:32        all     21.59      0.00     14.46      0.25      0.00     63.70
17:38:33        all     23.19      0.00     16.31      0.00      0.00     60.50
17:38:34        all     23.26      0.00     15.68      0.00      0.00     61.06
17:38:35        all     23.64      0.00     15.61      0.00      0.00     60.75
17:38:36        all     23.67      0.00     15.57      0.00      0.19     60.58
17:38:37        all     23.77      0.00     15.60      1.70      0.06     58.87
17:38:38        all     23.51      0.00     15.65      0.00      0.06     60.78
17:38:39        all     23.90      0.00     15.41      0.00      0.06     60.63
17:38:40        all     24.06      0.00     15.62      0.00      0.00     60.33
17:38:41        all     23.65      0.00     15.72      0.25      0.06     60.31
17:38:42        all      3.19      0.00      2.25      0.00      0.00     94.56
17:38:43        all      0.00      0.00      0.00      0.00      0.00    100.00

Not as good as we would like but, again, there's those mounts at the start and
end of the run.

The thing that's most interesting about these results is the CPU usage is pretty
much identical for the patched and unpatched runs but we see around a 26% increase
in the number of calls to mount_load_proc_self_mountinfo() with the patched results.

That's because as the mount unit processing is improved the systemd event handling
sub-system processes more change events negating the improvements that have been
made.

So, from a user perspective, there's been no improvement in resource usage!

I think this is a clear indication of the importance of Michal Sekletar's
notifications rate limiting change proposal. Without the ability to mange
the rate of notifications that are serviced under load, improvements made
won't be realised.

I saw this with the newly proposed kernel API (that was ultimately rejected)
where there was a significant improvement in resource usage but to be able to
take advantage of the notification queueing feature of the new notifications
API the notification rate needed to be managed. That then gave a further large
improvement that resulted in reaching the expected improvement for the changes
(around 5-10% CPU per mount table monitoring process under heavy load) from
new kernel API.

Anyway, it's an improvement so I'll construct a PR, but really it also depends
on Michal's change to get us anywhere with this or any other improvements for
that matter.

Ian

Comment 17 Ian Kent 2020-10-24 07:48:20 UTC

Upstream PR (awaiting review): https://github.com/systemd/systemd/pull/17409

Comment 19 RHEL Program Management 2022-04-01 07:27:26 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Note You need to log in before you can comment on or make changes to this bug.