Bug 2210307

Summary: RFE: Better visibility of currently configured cgroup limits
Product: Red Hat Enterprise Linux 9 Reporter: Michal Sekletar <msekleta>
Component: systemdAssignee: Michal Sekletar <msekleta>
Status: NEW --- QA Contact: Frantisek Sumsal <fsumsal>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 9.2CC: alexander.hass, bfinger, fkrska, martin.tegtmeier, michael.trapp, systemd-maint-list
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michal Sekletar 2023-05-26 14:08:57 UTC
Description of problem:
Currently, it is not easy to display the cgroup resource consumption limit that a given service (cgroup) is not allowed to exceed. For example, cgroup memory controller is hierarchical and memory allocation is constrained by cgroup limits set up also on levels above the current level, hence looking only at single cgroup is not enough. 

Version-Release number of selected component (if applicable):
systemd-252-15.el9


How reproducible:
deterministic

Steps to Reproduce:
1. Log in as root
2. systemctl set-property user-0.slice MemoryLimit=500M

Actual results:
$ cat /sys/fs/cgroup/user.slice/memory.max 
max
$ cat /sys/fs/cgroup/user.slice/user-0.slice/memory.max 
524288000
$ cat /sys/fs/cgroup/user.slice/user-0.slice/session-43.scope/memory.max 
max

$ systemd-cgls
...
├─user.slice (#1236)
│ → user.invocation_id: 52637e2cc24f4c659a09749680820dc5
│ → trusted.invocation_id: 52637e2cc24f4c659a09749680820dc5
│ └─user-0.slice (#42278)
│   → user.invocation_id: 0edf11261bac4d85975271fbd55f7714
│   → trusted.invocation_id: 0edf11261bac4d85975271fbd55f7714
│   ├─session-43.scope (#42574)
│   │ ├─794118 sshd: root [priv]
│   │ ├─794121 sshd: root@pts/0
│   │ ├─794122 -bash
│   │ ├─794472 systemd-cgls
│   │ └─794473 less


Expected results:
User might not be aware that actual maximum memory limit is set to 500MB on the cgroup level above his user session scope and he might think that he is able to allocate more than that. We should introduce a new switch in systemd-cgls that would display actual limits for each level.


Additional info:

Comment 4 Michal Sekletar 2023-07-05 12:06:31 UTC
Actually it is already possible to get maximum memory limit for a service that respects cgroup settings in parent units and also current memory consumption of sibling cgroups. Limit is exposed as MemoryAvailable= unit property and it is also displayed in systemctl status output.

# systemd-run --unit c.service --property Slice=a-b.slice sleep infinity

# systemd-cgls /sys/fs/cgroup/a.slice/
Directory /sys/fs/cgroup/a.slice/:
└─a-b.slice (#4645)
  → user.invocation_id: 5c4e8aa08b604b959881d4dfb0efa90e
  → trusted.invocation_id: 5c4e8aa08b604b959881d4dfb0efa90e
  └─c.service (#4679)
    → user.invocation_id: 5049c3e6f19d44a9aaf4015c990e9846
    → trusted.invocation_id: 5049c3e6f19d44a9aaf4015c990e9846
    └─4382 /usr/bin/sleep infinity

# systemctl set-property c.service MemoryMax=500M
# systemctl set-property a-b.slice MemoryMax=300M
# systemctl set-property a.slice MemoryMax=100M

# systemctl status c.service
● c.service - /usr/bin/sleep infinity
     Loaded: loaded (/run/systemd/transient/c.service; transient)
  Transient: yes
    Drop-In: /run/systemd/transient/c.service.d
             └─50-MemoryMax.conf
     Active: active (running) since Wed 2023-07-05 08:00:31 EDT; 1min 12s ago
   Main PID: 4382 (sleep)
      Tasks: 1 (limit: 11116)
     Memory: 200.0K (max: 500.0M available: 99.7M)
        CPU: 2ms
     CGroup: /a.slice/a-b.slice/c.service
             └─4382 /usr/bin/sleep infinity

After executing above commands the effective maximum limit that c.service can allocate is 99.7M because some memory is already consumed by sleep process.

Comment 5 Martin Tegtmeier 2023-07-20 08:42:48 UTC
Hi Michal,

You are right about cgroups v2 however in RHEL 8 with cgroups v1 this is broken.

I’m using Red Hat Enterprise Linux release 8.7 (Ootpa) and created a “limited.slice” in systemd which limits TasksMax to 5.

Then I created “someprocs.service” which is running in the limited.slice. The someprocs.service unit file doesn’t contain any limits at all.
 
If I query the service properties for someprocs:
 
[root@ip-172-31-33-24 ~]# systemctl status someprocs
● someprocs.service - Some processes as daemon
   Loaded: loaded (/etc/systemd/system/someprocs.service; disabled; vendor preset: disabled)
   Active: active (running) since Tue 2023-07-18 07:54:48 UTC; 8s ago
  Process: 5303 ExecStart=/bin/bash /home/ec2-user/test/startprocs (code=exited, status=0/SUCCESS)
    Tasks: 3 (limit: 23049)
   Memory: 668.0K
   CGroup: /limited.slice/someprocs.service
           ├─5305 /home/ec2-user/test/someproc 10
           ├─5307 /home/ec2-user/test/someproc 20
           └─5309 /home/ec2-user/test/someproc 30
 
It shows Tasks: 3 / Limit: 23049 . The parent cgroup is displayed correctly however the tasks limit is wrong and doesn’t factor in the more stringent limit of the parent cgroup.

Looking at the limited.slice we can see the effective limit
 
[root@ip-172-31-33-24 ~]# systemctl status limited.slice
● limited.slice - Slice with limited resources
   Loaded: loaded (/etc/systemd/system/limited.slice; static; vendor preset: disabled)
   Active: active since Tue 2023-07-18 07:49:22 UTC; 4min 5s ago
    Tasks: 3 (limit: 5)
   Memory: 728.0K
      CPU: 19ms
   CGroup: /limited.slice
           └─someprocs.service
             ├─5305 /home/ec2-user/test/someproc 10
             ├─5307 /home/ec2-user/test/someproc 20
             └─5309 /home/ec2-user/test/someproc 30
 
Tasks: 3 / Limit: 5 !!
And this limit is enforced for someprocs - as soon as I try to fork more than 5 processes I receive a “device busy” error. But the limit is invisible to my processes. Running getrlimit(RLIMIT_NPROC) in someproc returns “14406” which is just as wrong as “23049” derived from systemd properties... 

As far as I can tell there are two use cases: 
1. I’m a developer and want to determine effective limits for my application at runtime. This is especially tricky for memory (e.g. getrlimit and systemd properties report infinity/unlimited but malloc() + access result in OOM-kill)
2. I’m in support and ask my customer to run a support script collecting system information to determine if my application was running as intended or if the environment was somehow limited

I’m not at all implying that we recommend using limits - ideally our applications run unlimited... but you never know which hardening guides, policies and third party security tools were implemented in customer setups.

So we’d like to have a generic interface independent of the underlying technology (cgroup_v1, cgroup_v2, whatever) to query effective limits and obviously the results should be accurate.

Thanks,
   -Martin