Bug 1884262

Summary:	Device properties (in particular description) is recomputed every time a /proc/1/mountinfo is processed
Product:	Red Hat Enterprise Linux 8	Reporter:	Renaud Métrich <rmetrich>
Component:	systemd	Assignee:	systemd-maint
Status:	CLOSED WONTFIX	QA Contact:	Frantisek Sumsal <fsumsal>
Severity:	high	Docs Contact:
Priority:	high
Version:	8.2	CC:	dtardon, ikent, kwalker, kzak, msekleta, systemd-maint-list
Target Milestone:	rc	Keywords:	Triaged
Target Release:	8.0
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-04-01 07:27:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Renaud Métrich 2020-10-01 13:31:59 UTC

Description of problem:

This is a continuation of BZ #1793533 (PropertiesChanged messages emitted for no good reason).
Every time systemd processes /proc/1/mountinfo, the device unit is updated through calling device_found_node() in mount_load_proc_self_mountinfo().

Usually the device didn't change and this is useless. Knowing that the device description is computed based on udev database, this creates a huge load on systemd, which prevents systemd for doing other useful tasks in the mean time.

We need to find a way to prevent doing this constantly, this greatly affects all heavy loaded environments (OpenShift, etc.).

I implemented a prototype caching /proc/1/mountinfo between calls and finding the difference to only process "new" devices (related to new mount points).
Using the reproducer from BZ #1793533, early testing shows an improvement in number of syscalls being called by 96%.

Without the optimization (but PR 15233):
   726796 systemd.strace.PR15233

With the optimization:
    27917 systemd.strace.PR15233+mountinfo_cache

Version-Release number of selected component (if applicable):

systemd-219 and later


How reproducible:

Always


Steps to Reproduce:
1. Use reproducer in BZ #1793533

Comment 4 Renaud Métrich 2020-10-08 09:10:28 UTC

Attaching Upstream PR (code slightly modified): https://github.com/systemd/systemd/pull/17283

Comment 11 Renaud Métrich 2020-10-12 09:53:00 UTC

New Upstream PR: https://github.com/systemd/systemd/pull/17321

The new implementation uses a Hashmap.

Comment 12 Renaud Métrich 2020-10-14 12:54:59 UTC

My PR is largely broken, I believe it optimizes too much and leads to issues in the unit states.
For example there are cases where device will remain as "loading tentative".

Hopefully raven-au is proposing some enhancements and these seem to work pretty well.
The optimization is less important but still profiling shows a 8x times decrease in mount_load_proc_self_mountinfo().
He should open a PR tomorrow.

Just for the records, the profiling systemtap script I wrote:

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
#!/usr/bin/env stap

global stats

probe process("/usr/lib/systemd/systemd").function("mount_load_proc_self_mountinfo").return {
	stats <<< gettimeofday_ns() - @entry(gettimeofday_ns())
}

probe end {
	printf("Distribution of %s latencies (in nanoseconds) for %d samples\n", "mount_load_proc_self_mountinfo", @count(stats))
        printf("max/avg/min: %d/%d/%d\n", @max(stats), @avg(stats), @min(stats))
        print(@hist_log(stats))
}
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Usage:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
# stap ./systemd.stp -x 1 -v
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Comment 13 Ian Kent 2020-10-15 00:48:53 UTC

(In reply to Renaud Métrich from comment #12)
> My PR is largely broken, I believe it optimizes too much and leads to issues
> in the unit states.
> For example there are cases where device will remain as "loading tentative".
> 
> Hopefully raven-au is proposing some enhancements and these seem to work
> pretty well.
> The optimization is less important but still profiling shows a 8x times
> decrease in mount_load_proc_self_mountinfo().
> He should open a PR tomorrow.

That's right, thanks Renaud.

I will need to do some more testing of my own before I submit the PR.

I'm still getting familiar with the systemd PR process and what's expected
of submissions so it may take a little while to get that right.

The testing will be more focused on CPU usage under load since that seems
to be the main concern in the reports we are seeing, and it's a bit different
to mount table handling testing I've done before. That was isolated purely
on rate of notifications and mount table size testing but the problem here
needs to cover the affect of flow on notifications over the d-bus too.

Ian

Comment 16 Ian Kent 2020-10-20 10:32:48 UTC

(In reply to Ian Kent from comment #15)
> (In reply to Ian Kent from comment #14)
> > 
> > There is no way to test the effectiveness of any possible fix if the needed
> > parameters to simulate the problem aren't known.
> 
> This info. might not be needed, but would be useful if we can get
> it.
>  
> I've been doing some testing with the change I have and the result
> is not too good.

So it turns out that I had made a small change to my patches and that
resulted in negligible improvement.

I went back to the patches I originally sent to Renaud and tested again.
This time I got results closer to what I expected.

For these tests I've used the new mountbomber utility from Karel Zak
in his experimental util-linux branch with a small change to report
the number of mounts and umounts performed during the test run.

I think this is more representative of normal(ish) usage and so is
better for testing. 

When looking at the test results it's important to understand that
mountbomber will mount the number of mounts specified for the test
at the start and umount them at the end. That skews the results a
little since we're concerned with measuring the effect the patches
have for mounts changing such as new, remounted or umounnted mounts
during normal activity and mounting and umounting a bunch of mounts
will register as changed mounts.

So I've kept the number of mounts used by mount bomber fairly low and
the number of random mounts and umounts fairly high.

First the system call count test.

mountbomber 200 mounts, with 10 seconds of random mount and umount.

Unpatched:

[root@localhost test]# wc -l /tmp/systemd.strace
146848 /tmp/systemd.strace

Patched:

[root@localhost test]# wc -l /tmp/systemd.strace
86915 /tmp/systemd.strace

This looks sensible at around 40% reduction in system calls given
the 200 mount actions at the start and again at the finish which
will count as changed mounts.

Next I used Renaud's systemtap script to count the number of calls
to mount_load_proc_self_mountinfo() and hence the number of events
being caught.

mountbomber 200 mounts, with 10 seconds of random mount and umount.

Unpatched:

# stap ./mount_load_proc_self_mountinfo.stp -x 1 -v
[root@localhost test]# stap ./mount_load_proc_self_mountinfo.stp -x 1 -v
Pass 1: parsed user script and 492 library scripts using 400468virt/170024res/13080shr/156964data kb, in 470usr/150sys/842real ms.
Pass 2: analyzed script: 3 probes, 2 functions, 1 embed, 3 globals using 414068virt/184864res/14092shr/170564data kb, in 180usr/10sys/223real ms.
Pass 3: translated to C into "/tmp/stapcF3vGC/stap_16e25d390818cd77a19693a6f5e4ec02_3054_src.c" using 414068virt/185080res/14308shr/170564data kb, in 20usr/380sys/1492real ms.
Pass 4: compiled C into "stap_16e25d390818cd77a19693a6f5e4ec02_3054.ko" in 2880usr/650sys/3786real ms.
Pass 5: starting run.
Distribution of mount_load_proc_self_mountinfo latencies (in nanoseconds) for 4000 samples
max/avg/min: 4250949/2144625/893277
   value |-------------------------------------------------- count
  131072 |                                                      0
  262144 |                                                      0
  524288 |                                                      2
 1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  2459
 2097152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     1538
 4194304 |                                                      1
 8388608 |                                                      0
16777216 |                                                      0

Pass 5: run completed in 20usr/40sys/17907real ms.

17:29:11        CPU     %user     %nice   %system   %iowait    %steal     %idle
17:29:15        all      0.00      0.00      0.00      0.00      0.00    100.00
17:29:16        all     21.56      0.00     15.38      0.13      0.00     62.93
17:29:17        all     23.65      0.00     16.94      0.00      0.00     59.41
17:29:18        all     23.67      0.00     17.20      0.00      0.00     59.13
17:29:19        all     23.48      0.00     17.20      0.00      0.19     59.13
17:29:20        all     23.56      0.00     17.23      0.00      0.06     59.15
17:29:21        all     23.99      0.00     16.67      1.33      0.13     57.89
17:29:22        all     23.83      0.00     17.02      0.00      0.06     59.08
17:29:23        all     23.81      0.00     17.29      0.00      0.00     58.90
17:29:24        all     23.93      0.00     16.96      0.00      0.25     58.86
17:29:25        all     23.47      0.00     17.50      0.00      0.00     59.03
17:29:26        all      3.06      0.00      2.50      0.00      0.00     94.45
17:29:27        all      0.00      0.00      0.00      0.00      0.00    100.00
Patched:

[root@localhost test]# stap ./mount_load_proc_self_mountinfo.stp -x 1 -v
Pass 1: parsed user script and 492 library scripts using 400468virt/170124res/13176shr/156964data kb, in 460usr/150sys/821real ms.
Pass 2: analyzed script: 3 probes, 2 functions, 1 embed, 3 globals using 414072virt/185076res/14296shr/170568data kb, in 170usr/20sys/216real ms.
Pass 3: using cached /root/.systemtap/cache/1f/stap_1f5145cb146da316e6083966f10436fc_3054.c
Pass 4: using cached /root/.systemtap/cache/1f/stap_1f5145cb146da316e6083966f10436fc_3054.ko
Pass 5: starting run.
Distribution of mount_load_proc_self_mountinfo latencies (in nanoseconds) for 5409 samples
max/avg/min: 3184960/1662429/222757
  value |-------------------------------------------------- count
  32768 |                                                      0
  65536 |                                                      0
 131072 |                                                      2
 262144 |                                                      3
 524288 |                                                      6
1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  4909
2097152 |@@@@                                                489
4194304 |                                                      0
8388608 |                                                      0

Pass 5: run completed in 10usr/40sys/17412real ms.

17:29:11        CPU     %user     %nice   %system   %iowait    %steal     %idle
17:38:30        all      0.00      0.00      0.00      0.00      0.00    100.00
17:38:31        all      0.06      0.00      0.00      0.00      0.00     99.94
17:38:32        all     21.59      0.00     14.46      0.25      0.00     63.70
17:38:33        all     23.19      0.00     16.31      0.00      0.00     60.50
17:38:34        all     23.26      0.00     15.68      0.00      0.00     61.06
17:38:35        all     23.64      0.00     15.61      0.00      0.00     60.75
17:38:36        all     23.67      0.00     15.57      0.00      0.19     60.58
17:38:37        all     23.77      0.00     15.60      1.70      0.06     58.87
17:38:38        all     23.51      0.00     15.65      0.00      0.06     60.78
17:38:39        all     23.90      0.00     15.41      0.00      0.06     60.63
17:38:40        all     24.06      0.00     15.62      0.00      0.00     60.33
17:38:41        all     23.65      0.00     15.72      0.25      0.06     60.31
17:38:42        all      3.19      0.00      2.25      0.00      0.00     94.56
17:38:43        all      0.00      0.00      0.00      0.00      0.00    100.00

Not as good as we would like but, again, there's those mounts at the start and
end of the run.

The thing that's most interesting about these results is the CPU usage is pretty
much identical for the patched and unpatched runs but we see around a 26% increase
in the number of calls to mount_load_proc_self_mountinfo() with the patched results.

That's because as the mount unit processing is improved the systemd event handling
sub-system processes more change events negating the improvements that have been
made.

So, from a user perspective, there's been no improvement in resource usage!

I think this is a clear indication of the importance of Michal Sekletar's
notifications rate limiting change proposal. Without the ability to mange
the rate of notifications that are serviced under load, improvements made
won't be realised.

I saw this with the newly proposed kernel API (that was ultimately rejected)
where there was a significant improvement in resource usage but to be able to
take advantage of the notification queueing feature of the new notifications
API the notification rate needed to be managed. That then gave a further large
improvement that resulted in reaching the expected improvement for the changes
(around 5-10% CPU per mount table monitoring process under heavy load) from
new kernel API.

Anyway, it's an improvement so I'll construct a PR, but really it also depends
on Michal's change to get us anywhere with this or any other improvements for
that matter.

Ian

Comment 17 Ian Kent 2020-10-24 07:48:20 UTC

Upstream PR (awaiting review): https://github.com/systemd/systemd/pull/17409

Comment 19 RHEL Program Management 2022-04-01 07:27:26 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.