Bug 1947190

Summary: After upgrading to systemd 219-78.el7_9.3, intermittent excessive cpu usage and service timeouts
Product: Red Hat Enterprise Linux 7 Reporter: Matt Flusche <mflusche>
Component: systemdAssignee: systemd-maint
Status: CLOSED DUPLICATE QA Contact: Frantisek Sumsal <fsumsal>
Severity: high Docs Contact:
Priority: high    
Version: 7.9CC: bdobreli, bsawyers, camorris, chaithco, cmuresan, dtardon, rmetrich, systemd-maint-list, tcrider
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-06 09:19:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matt Flusche 2021-04-07 21:07:37 UTC
Description of problem:
After upgrading a set of openstack controllers to the latest rhel 7 patches, systemd is experiencing intermittent issues where it is consuming excessive cpu time and calls to systemd will timeout.  This will last for a period of time then the system will clear and function normally.  This occurs on multiple systems after the upgrade.

CPU usage from sosreport:
cat sos_commands/process/ps_auxwww |head -2
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1 32.8  0.0 227780 36716 ?        Rs   15:00 161:22 /usr/lib/systemd/systemd --switched-root --system --deserialize 22


Failed systemctl commands from sosreport:
$ cat sos_commands/systemd/systemctl_status_--all 
Failed to read server status: Connection timed out
Failed to list units: Connection timed out

$ cat sos_commands/systemd/systemctl_list-units
Failed to list units: Connection timed out



Version-Release number of selected component (if applicable):
RHEL 7.9 - current
ocd03-controller-1]$ cat uname 
Linux ocd03-controller-1 3.10.0-1160.15.2.el7.x86_64 #1 SMP Thu Jan 21 16:15:07 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
systemd-219-78.el7_9.2.x86_64


How reproducible:
intermittent in this specific environment.


Additional info:
I'll provide additional information in private comments

Comment 5 David Tardon 2021-04-09 09:54:28 UTC
I am pretty sure this is caused (again) by excessive parsing of mountinfo (bug 1819868). I checked one of the sosreports and there were 1731 docker instances running and 4372 mounts. If the docker instances were all started at the same time, that would certainly cause a lot of changes of mountinfo and therefore a lot of churn in systemd. strace of PID 1 would show that.

Comment 7 Charles Haithcock 2021-04-09 18:19:17 UTC
I contributed to the kernel side of this case and reviewed some of the perf data supplied. Note: 1. the perf data I analysed (output pasted below) is from perf2g.data attached to the case and this does not seem attached to the bz. 2. This is just profiling kernelspace function names of the functions taken. IE it does not have parameter listing for the functions (which is fine honestly). As such, some of the analysis is making logical assumptions inferred from other bits of the perf data. 


Below is a copy/past of the analysis I put into the case, though I am leaving out the perf data output for legibility reasons. If this is still needed, let me know and I can paste that here as well (though it can be kinda rough to read). 

 [A.0] Roughly 83.62% of all stacks are specific to the systemd process 
     [A.0-a] 33.54% of all stacks are systemd in kernelspace 
     [A.0-b] 50.08% of all stacks are systemd in userspace
 [A.1] Systemd is reading from mountinfo a _lot_
     [A.1-a] 24.09% of all stacks are systemd with the following backtrace:
             -> __GI___libc_read (userspace)
               -> system_call_fastpath (kernelspace)
                 -> sys_read
                   -> vfs_read
                     -> seq_read
                       -> m_show
                         -> show_mountinfo
     [A.1-b] Given [A.0-a] and [A.1-a], this means ~71% of all systemd's kernelspace
             stacks are just reading /proc/*/mountinfo 
 [A.2] Perf also shows a number of other stacks in kernelspace doing file 
       read-adjacent operations
     [A.2-a] For example, the next most common kernelspace backtrace is 
             -> __lxstat64 (userspace)
               -> system_call_fastpath
                 -> sys_newlstat
                   -> SYSC_newlstat
                     -> vfs_fstatat
                       -> user_path_at
                         -> user_path_at_empty
                           -> filename_lookup
                             -> path_lookupat
                               -> path_init
     [A.2-b] it is entirely possible the stacks in [A.2-a] is systemd simply 
             lstat-ing /proc/1/mountinfo before reading it. 
     [A.2-c] That being said, this is not a common stack trace; 3.22% of all 
             kernelspace stacks for systemd is the noted trace. That's only ~10%
             of all it's kernelspace stacks. 
     [A.2-d] Other stacks exist doing filesystem-adjacent things, however, so it
             is very likely systemd is simply reading /proc/*/mountinfo a _ton_.

Overall, the data strongly indicates running through /proc/*/mountinfo extremely
frequently. As suspected, an strace would likely tell us why. 


While I admittedly am _not_ familiar with the noted bz in c.5, the quick blurb about it in the comment seems to match with what is above.

Comment 8 Charles Haithcock 2021-04-09 18:19:28 UTC
I contributed to the kernel side of this case and reviewed some of the perf data supplied. Note: 1. the perf data I analysed (output pasted below) is from perf2g.data attached to the case and this does not seem attached to the bz. 2. This is just profiling kernelspace function names of the functions taken. IE it does not have parameter listing for the functions (which is fine honestly). As such, some of the analysis is making logical assumptions inferred from other bits of the perf data. 


Below is a copy/past of the analysis I put into the case, though I am leaving out the perf data output for legibility reasons. If this is still needed, let me know and I can paste that here as well (though it can be kinda rough to read). 

 [A.0] Roughly 83.62% of all stacks are specific to the systemd process 
     [A.0-a] 33.54% of all stacks are systemd in kernelspace 
     [A.0-b] 50.08% of all stacks are systemd in userspace
 [A.1] Systemd is reading from mountinfo a _lot_
     [A.1-a] 24.09% of all stacks are systemd with the following backtrace:
             -> __GI___libc_read (userspace)
               -> system_call_fastpath (kernelspace)
                 -> sys_read
                   -> vfs_read
                     -> seq_read
                       -> m_show
                         -> show_mountinfo
     [A.1-b] Given [A.0-a] and [A.1-a], this means ~71% of all systemd's kernelspace
             stacks are just reading /proc/*/mountinfo 
 [A.2] Perf also shows a number of other stacks in kernelspace doing file 
       read-adjacent operations
     [A.2-a] For example, the next most common kernelspace backtrace is 
             -> __lxstat64 (userspace)
               -> system_call_fastpath
                 -> sys_newlstat
                   -> SYSC_newlstat
                     -> vfs_fstatat
                       -> user_path_at
                         -> user_path_at_empty
                           -> filename_lookup
                             -> path_lookupat
                               -> path_init
     [A.2-b] it is entirely possible the stacks in [A.2-a] is systemd simply 
             lstat-ing /proc/1/mountinfo before reading it. 
     [A.2-c] That being said, this is not a common stack trace; 3.22% of all 
             kernelspace stacks for systemd is the noted trace. That's only ~10%
             of all it's kernelspace stacks. 
     [A.2-d] Other stacks exist doing filesystem-adjacent things, however, so it
             is very likely systemd is simply reading /proc/*/mountinfo a _ton_.

Overall, the data strongly indicates running through /proc/*/mountinfo extremely
frequently. As suspected, an strace would likely tell us why. 


While I admittedly am _not_ familiar with the noted bz in c.5, the quick blurb about it in the comment seems to match with what is above.

Comment 9 Renaud Métrich 2021-04-09 19:11:53 UTC
(In reply to David Tardon from comment #5)
> I am pretty sure this is caused (again) by excessive parsing of mountinfo
> (bug 1819868). I checked one of the sosreports and there were 1731 docker
> instances running and 4372 mounts. If the docker instances were all started
> at the same time, that would certainly cause a lot of changes of mountinfo
> and therefore a lot of churn in systemd. strace of PID 1 would show that.

If I'm not mistaken, docker instances do a "mount/unmount" every 3 seconds of some tmpfs filesystem as a kind of health check.
With such many instances, clearly systemd will just spend its time reading /proc/1/mountinfo ...

Comment 18 Renaud Métrich 2021-05-06 09:19:02 UTC
The root cause is exactly BZ #1828758, the difference being it's not users but scopes that triggers this (which is the same actually).

*** This bug has been marked as a duplicate of bug 1828758 ***