1819868 – systemd excessively reads mountinfo and udev is dense OpenShift environments

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1819868 - systemd excessively reads mountinfo and udev is dense OpenShift environments

Summary: systemd excessively reads mountinfo and udev is dense OpenShift environments

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	systemd
Sub Component:
Version:	8.1
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	8.0
Assignee:	Michal Sekletar
QA Contact:	Frantisek Sumsal
Docs Contact:
URL:
Whiteboard:
Duplicates (4):	1871106 1878470 1879152 1931535 (view as bug list)
Depends On:
Blocks:	1871029 1937315 1957726
TreeView+	depends on / blocked

Reported:	2020-04-01 18:58 UTC by Matthew Robson
Modified:	2024-12-20 19:02 UTC (History)
CC List:	31 users (show)
Fixed In Version:	systemd-239-45.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1937315 (view as bug list)
Environment:
Last Closed:	2021-05-18 14:53:17 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	systemd systemd issues 15464	0	None	open	[RFE] Optimise /proc/self/mountinfo notification dispatch	2021-02-15 09:23:57 UTC
Github	systemd systemd pull 16537	0	None	closed	[WIP] Mount storms and sd-event rate-limiting	2021-02-15 09:23:57 UTC

Internal Links: 1957726

Description Matthew Robson 2020-04-01 18:58:55 UTC

Description of problem:

This is a continuation of issues found via: https://bugzilla.redhat.com/show_bug.cgi?id=1793527 and https://bugzilla.redhat.com/show_bug.cgi?id=1769928

A customer running OpenShift 3.11 is suffering from high CPU utilization on all OCP compute nodes running systemd-219-67.el7_7.1.

- Looking at 2 consecutive processing, we can see that systemd does the following:

  - reads /proc/self/mountinfo
  - reads /run/mount/utab
  - reads all block devices in /run/udev
  - send D-Bus messages for each device unit

The code being called when a mount event happens is shown below:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 
1711 static int mount_dispatch_io(sd_event_source *source, int fd, uint32_t revents, void *userdata) {
1712         _cleanup_set_free_ Set *around = NULL, *gone = NULL;
1713         Manager *m = userdata;
1714         const char *what;
1715         Iterator i;
1716         Unit *u;
1717         int r;
1718 
1719         assert(m);
1720         assert(revents & (EPOLLPRI | EPOLLIN));
1721 
1722         /* The manager calls this for every fd event happening on the
1723          * /proc/self/mountinfo file, which informs us about mounting
1724          * table changes, and for /run/mount events which we watch
1725          * for mount options. */
1726 
1727         if (fd == m->utab_inotify_fd) {
 :
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

In our scenario, we can see that the "m->utab_inotify_fd" filedescriptor is modified continuously, as shown in the 2 strace excerpts below ("st_atime" fiel, the rest is unchanged):
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
stat("/run/mount/utab", {st_dev=makedev(0, 20), st_ino=816636239, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=244, st_atime=2019/10/31-09:30:29.822785431, st_mtime=2019/10/30-12:03:59.681429926, st_ctime=2019/10/30-12:03:59.681429926})

later, next epoll_wait() return:

stat("/run/mount/utab", {st_dev=makedev(0, 20), st_ino=816636239, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=244, st_atime=2019/10/31-09:30:30.293786124, st_mtime=2019/10/30-12:03:59.681429926, st_ctime=2019/10/30-12:03:59.681429926})
read(16</run/mount/utab>, "SRC=ociopf-p-150.dmz:/srv/nfs/oscbkp/app-prod-0z6f0qq0k2fz TARGET=/var/lib/origin/openshift.local.volumes/pods/28f79e7c-fb47-11e9-b012-005056832285/volumes/kubernetes.io~nfs/app-prod-0z6f0qq0k2fz ROOT=/ ATTRS=addr=10.10.10.10 \n", 4096)
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Every time systemd reads /proc/1/mountinfo (due to having a new mount or one disappearing), the whole udev database is read, device descriptions built based on properties from udev database, all this to end up with doing nothing because the device description didn't change at all.

Using systemtap, I found out (with my test build which doesn't send PropertiesChanged when systemd finds out device description didn't change, which is always the case in fact) that tons of strdup() and free() were still done for no good reason:

During the night, top MALLOC callers:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
--- MALLOC 144336200 samples recorded:

15727136 hits for __strdup+0x1a [libc-2.17.so]
udev_list_entry_add+0x96 [systemd]
udev_device_add_property_from_string+0x5d [systemd]
udev_device_read_db+0x19b [systemd]
udev_device_get_properties_list_entry+0x8f [systemd]
udev_device_get_property_value+0x13 [systemd]
device_setup_unit+0xfd [systemd]
device_found_node+0x174 [systemd]
mount_load_proc_self_mountinfo+0x147 [systemd]
mount_dispatch_io+0x67 [systemd]
source_dispatch+0x1c0 [systemd]
sd_event_dispatch+0x6a [systemd]
manager_loop+0x403 [systemd]
main+0x1e7b [system

15727136 hits for __strdup+0x1a [libc-2.17.so]
udev_list_entry_add+0xb0 [systemd]
udev_device_add_property_from_string+0x5d [systemd]
udev_device_read_db+0x19b [systemd]
udev_device_get_properties_list_entry+0x8f [systemd]
udev_device_get_property_value+0x13 [systemd]
device_setup_unit+0xfd [systemd]
device_found_node+0x174 [systemd]
mount_load_proc_self_mountinfo+0x147 [systemd]
mount_dispatch_io+0x67 [systemd]
source_dispatch+0x1c0 [systemd]
sd_event_dispatch+0x6a [systemd]
manager_loop+0x403 [systemd]
main+0x1e7b [system

12249906 hits for _IO_vfscanf+0x1ff4 [libc-2.17.so]
__vsscanf+0x67 [libc-2.17.so]
sscanf+0x87 [libc-2.17.so]
0x7f1098d6d25a [libmount.so.1.1.0+0x1e25a]

12249906 hits for _IO_vfscanf+0x1ff4 [libc-2.17.so]
__vsscanf+0x67 [libc-2.17.so]
sscanf+0x87 [libc-2.17.so]
0x7f1098d6d1fc [libmount.so.1.1.0+0x1e1fc]
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Everything there is related to udev, then mount processing.

Same from FREE:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
--- FREE 268964508 samples recorded:

27328937 hits for udev_list_entry_delete+0x6e [systemd]
udev_list_cleanup+0x56 [systemd]
udev_device_unref+0x73 [systemd]
device_found_node+0x12c [systemd]
mount_load_proc_self_mountinfo+0x147 [systemd]
mount_dispatch_io+0x67 [systemd]
source_dispatch+0x1c0 [systemd]
sd_event_dispatch+0x6a [systemd]
manager_loop+0x403 [systemd]
main+0x1e7b [systemd]
__libc_start_main+0xf5 [libc-2.17.so]
_start+0x29 [systemd]

27328937 hits for udev_list_cleanup+0x56 [systemd]
udev_device_unref+0x73 [systemd]
device_found_node+0x12c [systemd]
mount_load_proc_self_mountinfo+0x147 [systemd]
mount_dispatch_io+0x67 [systemd]
source_dispatch+0x1c0 [systemd]
sd_event_dispatch+0x6a [systemd]
manager_loop+0x403 [systemd]
main+0x1e7b [systemd]
__libc_start_main+0xf5 [libc-2.17.so]
_start+0x29 [systemd]

27328937 hits for udev_list_entry_delete+0x77 [systemd]
udev_list_cleanup+0x56 [systemd]
udev_device_unref+0x73 [systemd]
device_found_node+0x12c [systemd]
mount_load_proc_self_mountinfo+0x147 [systemd]
mount_dispatch_io+0x67 [systemd]
source_dispatch+0x1c0 [systemd]
sd_event_dispatch+0x6a [systemd]
manager_loop+0x403 [systemd]
main+0x1e7b [systemd]
__libc_start_main+0xf5 [libc-2.17.so]
_start+0x29 [systemd]
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------


Version-Release number of selected component (if applicable):

systemd-219-67.el7_7.1

How reproducible:

Easily - have reproduced in an internal lab

Steps to Reproduce:
1. In an OCP 3.11 environment, create 200 pods on a single node that use a combination of overlay2, persistent volumes, configmaps, secrets and hostmounts to get 1500-2000 entries in /proc/1/mountinfo
2. Randomly create and delete pods
3. Monitor changes with strace and system tap

Actual results:

Highg CPU utilization which can lead to systemd overload causing connection timeouts and system crashes.

Expected results:

Stability

Additional info:

Comment 2 Michal Sekletar 2020-06-30 10:34:29 UTC

I am working on a patch that will introduce possibility to set rate-limits on sd_event_source objects. We can then use the new API and apply rate-limiting on mountinfo event source. This should help us to workaround this issue until new fsinfo() kernel interface gives us possibility to receive only differential updates of the mount table.

Comment 3 Ian Kent 2020-07-16 02:09:51 UTC

(In reply to Michal Sekletar from comment #2)
> I am working on a patch that will introduce possibility to set rate-limits
> on sd_event_source objects. We can then use the new API and apply
> rate-limiting on mountinfo event source. This should help us to workaround
> this issue until new fsinfo() kernel interface gives us possibility to
> receive only differential updates of the mount table.

Right, the fsinfo() and kernelwatch notifications changes are proving difficult
to get merged upstream.

Linux wants to see that these changes will be used by user space and are
sufficient to help before merging the changes. We have assumed that means
writing libmount and systemd code that shows the proposed changes work to
resolve this and similar problems which is what I'm doing.

Now, I've found the systemd notifications code difficult to follow but I've
finally I've got to a point where I'm implementing the mount table handling
in systemd (Karel Zak has done a lot of work on the libmount side already)
with the new system. That is quickly leading to questions about what the unit
update handling functions do and there's also the detail of of the notifications
behaviour and matching them to each other to get the notified entry only updated.

I posted to the systemd-devel list describing what I'm doing and asked for help
but unfortunately, at that point, I couldn't be specific about what I was asking
so got virtually no replies.

So, if anyone on this bug is familiar with the unit processing function calls
under mount_dispatch_io() I would very much appreciate any information on what
they are meant to do, even an overview of the unit update process that occurs
here would be useful.

Ian

Comment 4 Ian Kent 2020-07-16 02:11:33 UTC

(In reply to Ian Kent from comment #3)
> 
> Linux wants to see that these changes will be used by user space and are
> sufficient to help before merging the changes. We have assumed that means
> writing libmount and systemd code that shows the proposed changes work to
> resolve this and similar problems which is what I'm doing.
> 

Of course that should be Linus not Linux, ;)

Comment 8 David Tardon 2020-09-02 10:46:28 UTC

*** Bug 1871106 has been marked as a duplicate of this bug. ***

Comment 9 Seth Jennings 2020-09-14 17:16:29 UTC

*** Bug 1878470 has been marked as a duplicate of this bug. ***

Comment 15 Seth Jennings 2020-09-28 15:59:33 UTC

*** Bug 1879152 has been marked as a duplicate of this bug. ***

Comment 39 David Tardon 2021-03-08 08:45:58 UTC

*** Bug 1931535 has been marked as a duplicate of this bug. ***

Comment 42 Lukáš Nykrýn 2021-03-10 11:41:55 UTC

fix merged to github master branch -> https://github.com/systemd-rhel/rhel-8/pull/154

Comment 53 Li Zhou 2021-04-20 02:21:11 UTC

Hi, Michal and all:
    We met this issue on systemd-219-67.el7 (centos7) too.
So is there any plan to port those patches back to systemd-219 (centos7)?
Thanks.

Comment 59 errata-xmlrpc 2021-05-18 14:53:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: systemd security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1611

Note You need to log in before you can comment on or make changes to this bug.

amarirom
cfergeau
chris.bowles
djuran
dmunneor
dornelas
dtardon
fdeutsch
fiezzi
fkrska
guchen
ikent
jamacku
jcrumple
jhopper
joedward
li_zhou_julia
lmiksik
mfojtik
michele
mpatel
msekleta
qguo
rcain
rmetrich
rperiyas
sbroz
sperezto
svelosol
systemd-maint-list
vhernand