Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Description of problem:
This is a continuation of issues found via: https://bugzilla.redhat.com/show_bug.cgi?id=1793527 and https://bugzilla.redhat.com/show_bug.cgi?id=1769928
A customer running OpenShift 3.11 is suffering from high CPU utilization on all OCP compute nodes running systemd-219-67.el7_7.1.
- Looking at 2 consecutive processing, we can see that systemd does the following:
- reads /proc/self/mountinfo
- reads /run/mount/utab
- reads all block devices in /run/udev
- send D-Bus messages for each device unit
The code being called when a mount event happens is shown below:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
1711 static int mount_dispatch_io(sd_event_source *source, int fd, uint32_t revents, void *userdata) {
1712 _cleanup_set_free_ Set *around = NULL, *gone = NULL;
1713 Manager *m = userdata;
1714 const char *what;
1715 Iterator i;
1716 Unit *u;
1717 int r;
1718
1719 assert(m);
1720 assert(revents & (EPOLLPRI | EPOLLIN));
1721
1722 /* The manager calls this for every fd event happening on the
1723 * /proc/self/mountinfo file, which informs us about mounting
1724 * table changes, and for /run/mount events which we watch
1725 * for mount options. */
1726
1727 if (fd == m->utab_inotify_fd) {
:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
In our scenario, we can see that the "m->utab_inotify_fd" filedescriptor is modified continuously, as shown in the 2 strace excerpts below ("st_atime" fiel, the rest is unchanged):
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
stat("/run/mount/utab", {st_dev=makedev(0, 20), st_ino=816636239, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=244, st_atime=2019/10/31-09:30:29.822785431, st_mtime=2019/10/30-12:03:59.681429926, st_ctime=2019/10/30-12:03:59.681429926})
later, next epoll_wait() return:
stat("/run/mount/utab", {st_dev=makedev(0, 20), st_ino=816636239, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=244, st_atime=2019/10/31-09:30:30.293786124, st_mtime=2019/10/30-12:03:59.681429926, st_ctime=2019/10/30-12:03:59.681429926})
read(16</run/mount/utab>, "SRC=ociopf-p-150.dmz:/srv/nfs/oscbkp/app-prod-0z6f0qq0k2fz TARGET=/var/lib/origin/openshift.local.volumes/pods/28f79e7c-fb47-11e9-b012-005056832285/volumes/kubernetes.io~nfs/app-prod-0z6f0qq0k2fz ROOT=/ ATTRS=addr=10.10.10.10 \n", 4096)
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
Every time systemd reads /proc/1/mountinfo (due to having a new mount or one disappearing), the whole udev database is read, device descriptions built based on properties from udev database, all this to end up with doing nothing because the device description didn't change at all.
Using systemtap, I found out (with my test build which doesn't send PropertiesChanged when systemd finds out device description didn't change, which is always the case in fact) that tons of strdup() and free() were still done for no good reason:
During the night, top MALLOC callers:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
--- MALLOC 144336200 samples recorded:
15727136 hits for __strdup+0x1a [libc-2.17.so]
udev_list_entry_add+0x96 [systemd]
udev_device_add_property_from_string+0x5d [systemd]
udev_device_read_db+0x19b [systemd]
udev_device_get_properties_list_entry+0x8f [systemd]
udev_device_get_property_value+0x13 [systemd]
device_setup_unit+0xfd [systemd]
device_found_node+0x174 [systemd]
mount_load_proc_self_mountinfo+0x147 [systemd]
mount_dispatch_io+0x67 [systemd]
source_dispatch+0x1c0 [systemd]
sd_event_dispatch+0x6a [systemd]
manager_loop+0x403 [systemd]
main+0x1e7b [system
15727136 hits for __strdup+0x1a [libc-2.17.so]
udev_list_entry_add+0xb0 [systemd]
udev_device_add_property_from_string+0x5d [systemd]
udev_device_read_db+0x19b [systemd]
udev_device_get_properties_list_entry+0x8f [systemd]
udev_device_get_property_value+0x13 [systemd]
device_setup_unit+0xfd [systemd]
device_found_node+0x174 [systemd]
mount_load_proc_self_mountinfo+0x147 [systemd]
mount_dispatch_io+0x67 [systemd]
source_dispatch+0x1c0 [systemd]
sd_event_dispatch+0x6a [systemd]
manager_loop+0x403 [systemd]
main+0x1e7b [system
12249906 hits for _IO_vfscanf+0x1ff4 [libc-2.17.so]
__vsscanf+0x67 [libc-2.17.so]
sscanf+0x87 [libc-2.17.so]
0x7f1098d6d25a [libmount.so.1.1.0+0x1e25a]
12249906 hits for _IO_vfscanf+0x1ff4 [libc-2.17.so]
__vsscanf+0x67 [libc-2.17.so]
sscanf+0x87 [libc-2.17.so]
0x7f1098d6d1fc [libmount.so.1.1.0+0x1e1fc]
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
Everything there is related to udev, then mount processing.
Same from FREE:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
--- FREE 268964508 samples recorded:
27328937 hits for udev_list_entry_delete+0x6e [systemd]
udev_list_cleanup+0x56 [systemd]
udev_device_unref+0x73 [systemd]
device_found_node+0x12c [systemd]
mount_load_proc_self_mountinfo+0x147 [systemd]
mount_dispatch_io+0x67 [systemd]
source_dispatch+0x1c0 [systemd]
sd_event_dispatch+0x6a [systemd]
manager_loop+0x403 [systemd]
main+0x1e7b [systemd]
__libc_start_main+0xf5 [libc-2.17.so]
_start+0x29 [systemd]
27328937 hits for udev_list_cleanup+0x56 [systemd]
udev_device_unref+0x73 [systemd]
device_found_node+0x12c [systemd]
mount_load_proc_self_mountinfo+0x147 [systemd]
mount_dispatch_io+0x67 [systemd]
source_dispatch+0x1c0 [systemd]
sd_event_dispatch+0x6a [systemd]
manager_loop+0x403 [systemd]
main+0x1e7b [systemd]
__libc_start_main+0xf5 [libc-2.17.so]
_start+0x29 [systemd]
27328937 hits for udev_list_entry_delete+0x77 [systemd]
udev_list_cleanup+0x56 [systemd]
udev_device_unref+0x73 [systemd]
device_found_node+0x12c [systemd]
mount_load_proc_self_mountinfo+0x147 [systemd]
mount_dispatch_io+0x67 [systemd]
source_dispatch+0x1c0 [systemd]
sd_event_dispatch+0x6a [systemd]
manager_loop+0x403 [systemd]
main+0x1e7b [systemd]
__libc_start_main+0xf5 [libc-2.17.so]
_start+0x29 [systemd]
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
Version-Release number of selected component (if applicable):
systemd-219-67.el7_7.1
How reproducible:
Easily - have reproduced in an internal lab
Steps to Reproduce:
1. In an OCP 3.11 environment, create 200 pods on a single node that use a combination of overlay2, persistent volumes, configmaps, secrets and hostmounts to get 1500-2000 entries in /proc/1/mountinfo
2. Randomly create and delete pods
3. Monitor changes with strace and system tap
Actual results:
Highg CPU utilization which can lead to systemd overload causing connection timeouts and system crashes.
Expected results:
Stability
Additional info:
I am working on a patch that will introduce possibility to set rate-limits on sd_event_source objects. We can then use the new API and apply rate-limiting on mountinfo event source. This should help us to workaround this issue until new fsinfo() kernel interface gives us possibility to receive only differential updates of the mount table.
(In reply to Michal Sekletar from comment #2)
> I am working on a patch that will introduce possibility to set rate-limits
> on sd_event_source objects. We can then use the new API and apply
> rate-limiting on mountinfo event source. This should help us to workaround
> this issue until new fsinfo() kernel interface gives us possibility to
> receive only differential updates of the mount table.
Right, the fsinfo() and kernelwatch notifications changes are proving difficult
to get merged upstream.
Linux wants to see that these changes will be used by user space and are
sufficient to help before merging the changes. We have assumed that means
writing libmount and systemd code that shows the proposed changes work to
resolve this and similar problems which is what I'm doing.
Now, I've found the systemd notifications code difficult to follow but I've
finally I've got to a point where I'm implementing the mount table handling
in systemd (Karel Zak has done a lot of work on the libmount side already)
with the new system. That is quickly leading to questions about what the unit
update handling functions do and there's also the detail of of the notifications
behaviour and matching them to each other to get the notified entry only updated.
I posted to the systemd-devel list describing what I'm doing and asked for help
but unfortunately, at that point, I couldn't be specific about what I was asking
so got virtually no replies.
So, if anyone on this bug is familiar with the unit processing function calls
under mount_dispatch_io() I would very much appreciate any information on what
they are meant to do, even an overview of the unit update process that occurs
here would be useful.
Ian
(In reply to Ian Kent from comment #3)
>
> Linux wants to see that these changes will be used by user space and are
> sufficient to help before merging the changes. We have assumed that means
> writing libmount and systemd code that shows the proposed changes work to
> resolve this and similar problems which is what I'm doing.
>
Of course that should be Linus not Linux, ;)
Hi, Michal and all:
We met this issue on systemd-219-67.el7 (centos7) too.
So is there any plan to port those patches back to systemd-219 (centos7)?
Thanks.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: systemd security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2021:1611