| Summary: | systemd performance degradation with thousands of units (systemctl times out; pid1 high CPU usage when should be idle) | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Ryan Sawhill <rsawhill> |
| Component: | systemd | Assignee: | systemd-maint |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | qe-baseos-daemons |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 7.2 | CC: | dtardon, fsumsal, kwalker, systemd-maint-list |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-01-02 09:52:04 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | |||
| Bug Blocks: | 1203710, 1298243, 1398314, 1420851, 1451294 | ||
|
Description
Ryan Sawhill
2016-01-26 02:47:29 UTC
Looks like I have a correction to make. With: > Experienced on latest (systemd-219-19.el7.x86_64) Regarding this comment: > Notice that even after all sockets are closed PID1 still pegs CPU I've noticed that this particular point is no longer the case with systemd-219-19.el7 in RHEL 7.2 -- my reproducer script led me to believe this was the case because PID1 spends a lot of CPU handling all the instance units & their sockets (the mechanism I used to generate tons of units). A little while after running my reproducer script the systemd CPU usage settles back down to nothing. Furthermore, while systemctl commands certainly take considerably longer than normal, they do not fail. Going back and testing RHEL 7.1 now. > Furthermore, while systemctl commands certainly take considerably longer than normal, they do not fail.
* They do not fail after systemd CPU usage settles back down to a normal level.
I tested this again today on the latest systemd available (219-19.el7_2.7) and I wasn't able to clearly reproduce it. Of course systemctl still starts slowing down when there are thousands and thousands of units, but it's not nearly as dramatic as it was with systemd pre-RHEL7.2. For the record: I ran into other problems eventually (where systemd-logind and tons of other things on the system started complaining "Argument list too long") but that was after such a crazy-high number of failed units that I don't think we need to look into it. That said, it sure would be nice if the systemd project could put forth some official guidance for this kind of stuff. Or perhaps configure systemd to automatically trigger reset-failed when things get past a certain limit. PS: The "Argument list too long" stuff starts happening after 65,500 connections are made to the sysd-failtester.socket in that reproducer script posted earlier (i.e., after 65k failed units were present). I tried to reproduce this issue on both RHEL 7.2 and RHEL 7.5 with following results:
Note: CPU usage settles at 0% after a few seconds after finishing each reproducer.
Specs: 1 CPU system with 2 GB RAM
## Reproducer 1
Spawner script:
# for i in {1..25000}; do systemd-run --remain --unit "test-$i" /bin/false; done
RHEL 7.2 (systemd-219-19.el7.x86_64)
------------------------------------
Before reproducer:
# systemctl --all | wc -l
185
# time systemctl status > /dev/null
real 0m0.013s
user 0m0.001s
sys 0m0.011s
After reproducer:
# systemctl --all | wc -l
25186
# time systemctl status > /dev/null
real 0m0.648s
user 0m0.166s
sys 0m0.476s
RHEL 7.5 (systemd-219-57.el7.x86_64)
------------------------------------
Before reproducer:
# systemctl --all | wc -l
201
# time systemctl status > /dev/null
real 0m0.008s
user 0m0.001s
sys 0m0.005s
After reproducer:
# systemctl --all | wc -l
25202
# time systemctl status > /dev/null
real 0m0.920s
user 0m0.202s
sys 0m0.710s
## Reproducer 2 (see comment 0)
Script:
curl -o sysd-failtester.sh http://people.redhat.com/rsawhill/sysd-failtester.sh
Note:
reproducer was manually stopped at 25000 units
RHEL 7.2 (systemd-219-19.el7.x86_64)
------------------------------------
Before reproducer:
# systemctl --all | wc -l
185
# time systemctl status > /dev/null
real 0m0.014s
user 0m0.003s
sys 0m0.009s
After reproducer:
# systemctl --all | wc -l
25187
# time systemctl show > /dev/null
real 0m0.003s
user 0m0.002s
sys 0m0.000s
RHEL 7.5 (systemd-219-57.el7.x86_64)
------------------------------------
Before reproducer:
# systemctl -all | wc -l
202
# time systemctl show > /dev/null
real 0m0.004s
user 0m0.002s
sys 0m0.001s
After reproducer:
# systemctl -all | wc -l
25337
# time systemctl show > /dev/null
real 0m0.003s
user 0m0.001s
sys 0m0.001s
|