Description of problem: This is a continuation of issues found via: https://bugzilla.redhat.com/show_bug.cgi?id=1793527 and https://bugzilla.redhat.com/show_bug.cgi?id=1769928 A customer running OpenShift 3.11 is suffering from high CPU utilization on all OCP compute nodes running systemd-219-67.el7_7.1. - Looking at 2 consecutive processing, we can see that systemd does the following: - reads /proc/self/mountinfo - reads /run/mount/utab - reads all block devices in /run/udev - send D-Bus messages for each device unit The code being called when a mount event happens is shown below: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 1711 static int mount_dispatch_io(sd_event_source *source, int fd, uint32_t revents, void *userdata) { 1712 _cleanup_set_free_ Set *around = NULL, *gone = NULL; 1713 Manager *m = userdata; 1714 const char *what; 1715 Iterator i; 1716 Unit *u; 1717 int r; 1718 1719 assert(m); 1720 assert(revents & (EPOLLPRI | EPOLLIN)); 1721 1722 /* The manager calls this for every fd event happening on the 1723 * /proc/self/mountinfo file, which informs us about mounting 1724 * table changes, and for /run/mount events which we watch 1725 * for mount options. */ 1726 1727 if (fd == m->utab_inotify_fd) { : -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- In our scenario, we can see that the "m->utab_inotify_fd" filedescriptor is modified continuously, as shown in the 2 strace excerpts below ("st_atime" fiel, the rest is unchanged): -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- stat("/run/mount/utab", {st_dev=makedev(0, 20), st_ino=816636239, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=244, st_atime=2019/10/31-09:30:29.822785431, st_mtime=2019/10/30-12:03:59.681429926, st_ctime=2019/10/30-12:03:59.681429926}) later, next epoll_wait() return: stat("/run/mount/utab", {st_dev=makedev(0, 20), st_ino=816636239, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=244, st_atime=2019/10/31-09:30:30.293786124, st_mtime=2019/10/30-12:03:59.681429926, st_ctime=2019/10/30-12:03:59.681429926}) read(16</run/mount/utab>, "SRC=ociopf-p-150.dmz:/srv/nfs/oscbkp/app-prod-0z6f0qq0k2fz TARGET=/var/lib/origin/openshift.local.volumes/pods/28f79e7c-fb47-11e9-b012-005056832285/volumes/kubernetes.io~nfs/app-prod-0z6f0qq0k2fz ROOT=/ ATTRS=addr=10.10.10.10 \n", 4096) -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- Every time systemd reads /proc/1/mountinfo (due to having a new mount or one disappearing), the whole udev database is read, device descriptions built based on properties from udev database, all this to end up with doing nothing because the device description didn't change at all. Using systemtap, I found out (with my test build which doesn't send PropertiesChanged when systemd finds out device description didn't change, which is always the case in fact) that tons of strdup() and free() were still done for no good reason: During the night, top MALLOC callers: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- --- MALLOC 144336200 samples recorded: 15727136 hits for __strdup+0x1a [libc-2.17.so] udev_list_entry_add+0x96 [systemd] udev_device_add_property_from_string+0x5d [systemd] udev_device_read_db+0x19b [systemd] udev_device_get_properties_list_entry+0x8f [systemd] udev_device_get_property_value+0x13 [systemd] device_setup_unit+0xfd [systemd] device_found_node+0x174 [systemd] mount_load_proc_self_mountinfo+0x147 [systemd] mount_dispatch_io+0x67 [systemd] source_dispatch+0x1c0 [systemd] sd_event_dispatch+0x6a [systemd] manager_loop+0x403 [systemd] main+0x1e7b [system 15727136 hits for __strdup+0x1a [libc-2.17.so] udev_list_entry_add+0xb0 [systemd] udev_device_add_property_from_string+0x5d [systemd] udev_device_read_db+0x19b [systemd] udev_device_get_properties_list_entry+0x8f [systemd] udev_device_get_property_value+0x13 [systemd] device_setup_unit+0xfd [systemd] device_found_node+0x174 [systemd] mount_load_proc_self_mountinfo+0x147 [systemd] mount_dispatch_io+0x67 [systemd] source_dispatch+0x1c0 [systemd] sd_event_dispatch+0x6a [systemd] manager_loop+0x403 [systemd] main+0x1e7b [system 12249906 hits for _IO_vfscanf+0x1ff4 [libc-2.17.so] __vsscanf+0x67 [libc-2.17.so] sscanf+0x87 [libc-2.17.so] 0x7f1098d6d25a [libmount.so.1.1.0+0x1e25a] 12249906 hits for _IO_vfscanf+0x1ff4 [libc-2.17.so] __vsscanf+0x67 [libc-2.17.so] sscanf+0x87 [libc-2.17.so] 0x7f1098d6d1fc [libmount.so.1.1.0+0x1e1fc] -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- Everything there is related to udev, then mount processing. Same from FREE: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- --- FREE 268964508 samples recorded: 27328937 hits for udev_list_entry_delete+0x6e [systemd] udev_list_cleanup+0x56 [systemd] udev_device_unref+0x73 [systemd] device_found_node+0x12c [systemd] mount_load_proc_self_mountinfo+0x147 [systemd] mount_dispatch_io+0x67 [systemd] source_dispatch+0x1c0 [systemd] sd_event_dispatch+0x6a [systemd] manager_loop+0x403 [systemd] main+0x1e7b [systemd] __libc_start_main+0xf5 [libc-2.17.so] _start+0x29 [systemd] 27328937 hits for udev_list_cleanup+0x56 [systemd] udev_device_unref+0x73 [systemd] device_found_node+0x12c [systemd] mount_load_proc_self_mountinfo+0x147 [systemd] mount_dispatch_io+0x67 [systemd] source_dispatch+0x1c0 [systemd] sd_event_dispatch+0x6a [systemd] manager_loop+0x403 [systemd] main+0x1e7b [systemd] __libc_start_main+0xf5 [libc-2.17.so] _start+0x29 [systemd] 27328937 hits for udev_list_entry_delete+0x77 [systemd] udev_list_cleanup+0x56 [systemd] udev_device_unref+0x73 [systemd] device_found_node+0x12c [systemd] mount_load_proc_self_mountinfo+0x147 [systemd] mount_dispatch_io+0x67 [systemd] source_dispatch+0x1c0 [systemd] sd_event_dispatch+0x6a [systemd] manager_loop+0x403 [systemd] main+0x1e7b [systemd] __libc_start_main+0xf5 [libc-2.17.so] _start+0x29 [systemd] -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- Version-Release number of selected component (if applicable): systemd-219-67.el7_7.1 How reproducible: Easily - have reproduced in an internal lab Steps to Reproduce: 1. In an OCP 3.11 environment, create 200 pods on a single node that use a combination of overlay2, persistent volumes, configmaps, secrets and hostmounts to get 1500-2000 entries in /proc/1/mountinfo 2. Randomly create and delete pods 3. Monitor changes with strace and system tap Actual results: Highg CPU utilization which can lead to systemd overload causing connection timeouts and system crashes. Expected results: Stability Additional info:
I am working on a patch that will introduce possibility to set rate-limits on sd_event_source objects. We can then use the new API and apply rate-limiting on mountinfo event source. This should help us to workaround this issue until new fsinfo() kernel interface gives us possibility to receive only differential updates of the mount table.
(In reply to Michal Sekletar from comment #2) > I am working on a patch that will introduce possibility to set rate-limits > on sd_event_source objects. We can then use the new API and apply > rate-limiting on mountinfo event source. This should help us to workaround > this issue until new fsinfo() kernel interface gives us possibility to > receive only differential updates of the mount table. Right, the fsinfo() and kernelwatch notifications changes are proving difficult to get merged upstream. Linux wants to see that these changes will be used by user space and are sufficient to help before merging the changes. We have assumed that means writing libmount and systemd code that shows the proposed changes work to resolve this and similar problems which is what I'm doing. Now, I've found the systemd notifications code difficult to follow but I've finally I've got to a point where I'm implementing the mount table handling in systemd (Karel Zak has done a lot of work on the libmount side already) with the new system. That is quickly leading to questions about what the unit update handling functions do and there's also the detail of of the notifications behaviour and matching them to each other to get the notified entry only updated. I posted to the systemd-devel list describing what I'm doing and asked for help but unfortunately, at that point, I couldn't be specific about what I was asking so got virtually no replies. So, if anyone on this bug is familiar with the unit processing function calls under mount_dispatch_io() I would very much appreciate any information on what they are meant to do, even an overview of the unit update process that occurs here would be useful. Ian
(In reply to Ian Kent from comment #3) > > Linux wants to see that these changes will be used by user space and are > sufficient to help before merging the changes. We have assumed that means > writing libmount and systemd code that shows the proposed changes work to > resolve this and similar problems which is what I'm doing. > Of course that should be Linus not Linux, ;)
*** Bug 1871106 has been marked as a duplicate of this bug. ***
*** Bug 1878470 has been marked as a duplicate of this bug. ***
*** Bug 1879152 has been marked as a duplicate of this bug. ***
*** Bug 1931535 has been marked as a duplicate of this bug. ***
fix merged to github master branch -> https://github.com/systemd-rhel/rhel-8/pull/154
Hi, Michal and all: We met this issue on systemd-219-67.el7 (centos7) too. So is there any plan to port those patches back to systemd-219 (centos7)? Thanks.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: systemd security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1611