Description of problem: Currently, it is not easy to display the cgroup resource consumption limit that a given service (cgroup) is not allowed to exceed. For example, cgroup memory controller is hierarchical and memory allocation is constrained by cgroup limits set up also on levels above the current level, hence looking only at single cgroup is not enough. Version-Release number of selected component (if applicable): systemd-252-15.el9 How reproducible: deterministic Steps to Reproduce: 1. Log in as root 2. systemctl set-property user-0.slice MemoryLimit=500M Actual results: $ cat /sys/fs/cgroup/user.slice/memory.max max $ cat /sys/fs/cgroup/user.slice/user-0.slice/memory.max 524288000 $ cat /sys/fs/cgroup/user.slice/user-0.slice/session-43.scope/memory.max max $ systemd-cgls ... ├─user.slice (#1236) │ → user.invocation_id: 52637e2cc24f4c659a09749680820dc5 │ → trusted.invocation_id: 52637e2cc24f4c659a09749680820dc5 │ └─user-0.slice (#42278) │ → user.invocation_id: 0edf11261bac4d85975271fbd55f7714 │ → trusted.invocation_id: 0edf11261bac4d85975271fbd55f7714 │ ├─session-43.scope (#42574) │ │ ├─794118 sshd: root [priv] │ │ ├─794121 sshd: root@pts/0 │ │ ├─794122 -bash │ │ ├─794472 systemd-cgls │ │ └─794473 less Expected results: User might not be aware that actual maximum memory limit is set to 500MB on the cgroup level above his user session scope and he might think that he is able to allocate more than that. We should introduce a new switch in systemd-cgls that would display actual limits for each level. Additional info:
Actually it is already possible to get maximum memory limit for a service that respects cgroup settings in parent units and also current memory consumption of sibling cgroups. Limit is exposed as MemoryAvailable= unit property and it is also displayed in systemctl status output. # systemd-run --unit c.service --property Slice=a-b.slice sleep infinity # systemd-cgls /sys/fs/cgroup/a.slice/ Directory /sys/fs/cgroup/a.slice/: └─a-b.slice (#4645) → user.invocation_id: 5c4e8aa08b604b959881d4dfb0efa90e → trusted.invocation_id: 5c4e8aa08b604b959881d4dfb0efa90e └─c.service (#4679) → user.invocation_id: 5049c3e6f19d44a9aaf4015c990e9846 → trusted.invocation_id: 5049c3e6f19d44a9aaf4015c990e9846 └─4382 /usr/bin/sleep infinity # systemctl set-property c.service MemoryMax=500M # systemctl set-property a-b.slice MemoryMax=300M # systemctl set-property a.slice MemoryMax=100M # systemctl status c.service ● c.service - /usr/bin/sleep infinity Loaded: loaded (/run/systemd/transient/c.service; transient) Transient: yes Drop-In: /run/systemd/transient/c.service.d └─50-MemoryMax.conf Active: active (running) since Wed 2023-07-05 08:00:31 EDT; 1min 12s ago Main PID: 4382 (sleep) Tasks: 1 (limit: 11116) Memory: 200.0K (max: 500.0M available: 99.7M) CPU: 2ms CGroup: /a.slice/a-b.slice/c.service └─4382 /usr/bin/sleep infinity After executing above commands the effective maximum limit that c.service can allocate is 99.7M because some memory is already consumed by sleep process.
Hi Michal, You are right about cgroups v2 however in RHEL 8 with cgroups v1 this is broken. I’m using Red Hat Enterprise Linux release 8.7 (Ootpa) and created a “limited.slice” in systemd which limits TasksMax to 5. Then I created “someprocs.service” which is running in the limited.slice. The someprocs.service unit file doesn’t contain any limits at all. If I query the service properties for someprocs: [root@ip-172-31-33-24 ~]# systemctl status someprocs ● someprocs.service - Some processes as daemon Loaded: loaded (/etc/systemd/system/someprocs.service; disabled; vendor preset: disabled) Active: active (running) since Tue 2023-07-18 07:54:48 UTC; 8s ago Process: 5303 ExecStart=/bin/bash /home/ec2-user/test/startprocs (code=exited, status=0/SUCCESS) Tasks: 3 (limit: 23049) Memory: 668.0K CGroup: /limited.slice/someprocs.service ├─5305 /home/ec2-user/test/someproc 10 ├─5307 /home/ec2-user/test/someproc 20 └─5309 /home/ec2-user/test/someproc 30 It shows Tasks: 3 / Limit: 23049 . The parent cgroup is displayed correctly however the tasks limit is wrong and doesn’t factor in the more stringent limit of the parent cgroup. Looking at the limited.slice we can see the effective limit [root@ip-172-31-33-24 ~]# systemctl status limited.slice ● limited.slice - Slice with limited resources Loaded: loaded (/etc/systemd/system/limited.slice; static; vendor preset: disabled) Active: active since Tue 2023-07-18 07:49:22 UTC; 4min 5s ago Tasks: 3 (limit: 5) Memory: 728.0K CPU: 19ms CGroup: /limited.slice └─someprocs.service ├─5305 /home/ec2-user/test/someproc 10 ├─5307 /home/ec2-user/test/someproc 20 └─5309 /home/ec2-user/test/someproc 30 Tasks: 3 / Limit: 5 !! And this limit is enforced for someprocs - as soon as I try to fork more than 5 processes I receive a “device busy” error. But the limit is invisible to my processes. Running getrlimit(RLIMIT_NPROC) in someproc returns “14406” which is just as wrong as “23049” derived from systemd properties... As far as I can tell there are two use cases: 1. I’m a developer and want to determine effective limits for my application at runtime. This is especially tricky for memory (e.g. getrlimit and systemd properties report infinity/unlimited but malloc() + access result in OOM-kill) 2. I’m in support and ask my customer to run a support script collecting system information to determine if my application was running as intended or if the environment was somehow limited I’m not at all implying that we recommend using limits - ideally our applications run unlimited... but you never know which hardening guides, policies and third party security tools were implemented in customer setups. So we’d like to have a generic interface independent of the underlying technology (cgroup_v1, cgroup_v2, whatever) to query effective limits and obviously the results should be accurate. Thanks, -Martin