Bug 1599387 - Slice with systemd-run does not set memory quota correctly
Summary: Slice with systemd-run does not set memory quota correctly
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 30
Hardware: x86_64
OS: Unspecified
high
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-09 16:40 UTC by Severin Gehwolf
Modified: 2020-05-26 18:26 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-26 18:26:21 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1649796 1 None None None 2021-02-20 07:54:57 UTC
openjdk bug system JDK-8217338 0 None None None 2019-04-10 07:43:30 UTC

Internal Links: 1649796

Description Severin Gehwolf 2018-07-09 16:40:28 UTC
Description of problem:
When setting a slice with a memory limit, this limit seems to be wrongly or incompletely propagating in the sysfs tree.

According to man cgroups, section "/proc/[pid]/cgroup", the third field in that file is supposed to be:

"""
This field contains the pathname of the control group in the hierarchy to which the process belongs.  This pathname is relative to the mount point of the hierarchy.
"""

Now on a slice with a 10M memory limit I see this in the memory.limit_in_bytes file:

# cat /sys/fs/cgroup/memory/user.slice/user-cg.slice/run-r69d519a8b9cb4711824562802c752767.scope/memory.limit_in_bytes
9223372036854771712

So it appears to be unlimited, while it should be capped at 10485760.

Version-Release number of selected component (if applicable):
$ rpm -q systemd
systemd-238-8.git0e0aa59.fc28.x86_64

How reproducible:
100%

Steps to Reproduce:
$ cat /etc/systemd/system/user-cg.slice 
[Unit]
Description=Demo cgroup
Before=slices.target

[Slice]
MemoryAccounting=true
MemoryLimit=10M
$ sudo systemctl daemon-reload && sudo systemctl restart user-cg.slice
$ sudo systemd-run --slice user-cg.slice --scope bash
# basepath=$(grep cgroup /proc/self/mountinfo | grep memory | cut -d' ' -f5)
# hierarchypath=$(grep memory /proc/self/cgroup | cut -d':' -f3)
# cat $basepath/$hierarchypath/memory.limit_in_bytes

Actual results:
9223372036854771712

Expected results:
10485760

Additional info:
The file at /sys/fs/cgroup/memory//user.slice/user-cg.slice/memory.limit_in_bytes actually contains the correct value. If so, then the value in /proc/self/cgroup is wrong which contains the run-rea0c52af97a04d67aefcb93486fa5385.scope path.

Results are also the same when the unit file uses MemoryMax=10M over deprecated MemoryLimit=10M

Comment 1 Severin Gehwolf 2018-08-24 09:20:02 UTC
Any thoughts about this from systemd maintainers? This affects OpenJDK when being run in a systemd slice.

Comment 2 Marko Myllynen 2018-10-04 14:46:59 UTC
This is still an issue on Fedora 29 Beta / systemd-239-3.fc29.x86_64.

Comment 3 Severin Gehwolf 2018-10-19 12:00:26 UTC
From Lukas Nykryn:

"""
I think the problem is only how cgroup display the information. The limit actually is applied.
I have run something in the shell that allocated the memory and in status I saw    Memory: 9.9M

So I would suggest reassigning this to the kernel.
"""

Comment 4 Severin Gehwolf 2018-10-19 12:01:47 UTC
Re-assigning to component kernel as per comment 3.

Comment 5 Jeremy Cline 2018-10-19 14:47:35 UTC
Hey folks,

This looks to work as expected in the latest 4.19 kernels. F29 should be rebased to 4.19 in a couple weeks, and since it is actually enforced I don't think it's worth tracking down the commit that fixed this.

Comment 6 Severin Gehwolf 2018-10-19 14:56:36 UTC
(In reply to Jeremy Cline from comment #5)
> Hey folks,
> 
> This looks to work as expected in the latest 4.19 kernels. F29 should be
> rebased to 4.19 in a couple weeks, and since it is actually enforced I don't
> think it's worth tracking down the commit that fixed this.

This breaks automatic container detection in language runtimes (e.g. OpenJDK). It sizes it's structures according to the container limit. If it's not able to detect that it's in a container but actually is, then all bets are off. It looks to the user as if some application gets killed seemingly randomly without knowing why. So to that extent, I'm not sure I agree with "it's not worth tracking down the commit that fixed this".

For the OpenJDK in systemd-slice use-case observability is important.

Comment 7 Jeremy Cline 2018-10-19 15:28:31 UTC
> This breaks automatic container detection in language runtimes (e.g. OpenJDK). It sizes it's structures according to the container limit. If it's not able to detect that it's in a container but actually is, then all bets are off. It looks to the user as if some application gets killed seemingly randomly without knowing why. So to that extent, I'm not sure I agree with "it's not worth tracking down the commit that fixed this".

I'm happy to backport a patch if you or someone else wants to bisect it, but it's already fixed upstream and it'll be fixed in stable Fedora in a few weeks when we rebase to 4.19.

Comment 8 Marko Myllynen 2018-10-19 17:12:24 UTC
(In reply to Jeremy Cline from comment #5)
> 
> This looks to work as expected in the latest 4.19 kernels. F29 should be
> rebased to 4.19 in a couple weeks, and since it is actually enforced I don't
> think it's worth tracking down the commit that fixed this.

How did you test this? Using the second-latest Koji kernel build on an up-to-date Fedora 29 I see:

root@localhost:~# uname -r                                 
4.19.0-0.rc8.git3.1.fc30.x86_64
root@localhost:~# cat /etc/systemd/system/user-cg.slice    
[Unit]
Description=Demo cgroup
Before=slices.target

[Slice]
MemoryAccounting=true
MemoryLimit=10M
root@localhost:~# systemctl daemon-reload                       
root@localhost:~# systemctl restart user-cg.slice               
root@localhost:~# systemd-run --slice user-cg.slice --scope bash
Running scope as unit: run-r2344f8a82b1641cb8f254a05689747cc.scope
root@localhost:~# basepath=$(grep cgroup /proc/self/mountinfo | grep memory | cut -d' ' -f5)
root@localhost:~# hierarchypath=$(grep memory /proc/self/cgroup | cut -d':' -f3)
root@localhost:~# cat $basepath/$hierarchypath/memory.limit_in_bytes
9223372036854771712
root@localhost:~# cat /sys/fs/cgroup/memory/user.slice/user-cg.slice/memory.limit_in_bytes
10485760
root@localhost:~# cat /proc/self/cgroup
11:freezer:/
10:pids:/user.slice/user-cg.slice/run-r2344f8a82b1641cb8f254a05689747cc.scope
9:cpu,cpuacct:/
8:devices:/user.slice
7:perf_event:/
6:blkio:/
5:memory:/user.slice/user-cg.slice/run-r2344f8a82b1641cb8f254a05689747cc.scope
4:cpuset:/
3:hugetlb:/
2:net_cls,net_prio:/
1:name=systemd:/user.slice/user-cg.slice/run-r2344f8a82b1641cb8f254a05689747cc.scope
0::/user.slice/user-cg.slice/run-r2344f8a82b1641cb8f254a05689747cc.scope
root@localhost:~#

Comment 9 Jeremy Cline 2018-10-19 18:33:54 UTC
Ah, you're right.

So, I dug into this a bit and looking at the code, there's an entry in memory.stat with "hierarchical_memory_limit" that correctly computes the limit taking the group hierarchy into account.

What I can't find is whether memory.limit_in_bytes is *supposed* to be the limit from that cgroup accounting for its parents when hierarchy is on or whether it's supposed to just show what the limit that particular cgroup imposes. Is there a reason to not use "hierarchical_memory_limit" from memory.stat?

Comment 10 Severin Gehwolf 2018-10-30 11:09:01 UTC
(In reply to Jeremy Cline from comment #9)
> Is there a reason to not use "hierarchical_memory_limit" from
> memory.stat?

I don't understand this question. What is 'memory.stat'?

Comment 11 Jeremy Cline 2018-10-30 12:51:26 UTC
(In reply to Severin Gehwolf from comment #10)
> (In reply to Jeremy Cline from comment #9)
> > Is there a reason to not use "hierarchical_memory_limit" from
> > memory.stat?
> 
> I don't understand this question. What is 'memory.stat'?

The file in the cgroup, e.g. /sys/fs/cgroup/memory/user.slice/user-cg.slice/run-r3654855cc3cd48a392f045011d3086ad.scope/memory.stat

Comment 12 Severin Gehwolf 2018-10-30 15:51:19 UTC
(In reply to Jeremy Cline from comment #11)
> (In reply to Severin Gehwolf from comment #10)
> > (In reply to Jeremy Cline from comment #9)
> > > Is there a reason to not use "hierarchical_memory_limit" from
> > > memory.stat?
> > 
> > I don't understand this question. What is 'memory.stat'?
> 
> The file in the cgroup, e.g.
> /sys/fs/cgroup/memory/user.slice/user-cg.slice/run-
> r3654855cc3cd48a392f045011d3086ad.scope/memory.stat

OK, yes that seems to contain the right value. However, so does /sys/fs/cgroup/memory/user.slice/user-cg.slice/memory.limit_in_bytes. The reason we cannot just use some other value is that it's the same code in OpenJDK which handles Docker/podman memory CPU limits which should work for systemd too. That code relies on file 'memory.limit_in_bytes' containing the right value.

I see two ways to "fix" this:

a)
fix the memory controller hierarchy to report '/user.slice/user-cg.slice' instead of /user.slice/user-cg.slice/run-rea23ce64f69a439ab39f835f05853939.scope

b)
actually have the right value in /sys/fs/cgroup/memory/user.slice/user-cg.slice/run-rea23ce64f69a439ab39f835f05853939.scope/memory.limit_in_bytes instead of /sys/fs/cgroup/memory/user.slice/user-cg.slice/memory.limit_in_bytes


[root@p50-laptop hotspot]# grep hierarchical_memory_limit /sys/fs/cgroup/memory/user.slice/user-cg.slice/run-rea23ce64f69a439ab39f835f05853939.scope/memory.stat
hierarchical_memory_limit 10485760
[root@p50-laptop hotspot]# cat /sys/fs/cgroup/memory/user.slice/user-cg.slice/memory.limit_in_bytes 
10485760
[root@p50-laptop hotspot]# grep memory /proc/self/cgroup 
6:memory:/user.slice/user-cg.slice/run-rea23ce64f69a439ab39f835f05853939.scope
[root@p50-laptop hotspot]# cat /sys/fs/cgroup/memory/user.slice/user-cg.slice/run-rea23ce64f69a439ab39f835f05853939.scope/memory.limit_in_bytes 
9223372036854771712

Comment 15 Jeremy Cline 2018-10-30 20:39:14 UTC
(In reply to Severin Gehwolf from comment #12)
> (In reply to Jeremy Cline from comment #11)
> > (In reply to Severin Gehwolf from comment #10)
> > > (In reply to Jeremy Cline from comment #9)
> > > > Is there a reason to not use "hierarchical_memory_limit" from
> > > > memory.stat?
> > > 
> > > I don't understand this question. What is 'memory.stat'?
> > 
> > The file in the cgroup, e.g.
> > /sys/fs/cgroup/memory/user.slice/user-cg.slice/run-
> > r3654855cc3cd48a392f045011d3086ad.scope/memory.stat
> 
> OK, yes that seems to contain the right value. However, so does
> /sys/fs/cgroup/memory/user.slice/user-cg.slice/memory.limit_in_bytes. The
> reason we cannot just use some other value is that it's the same code in
> OpenJDK which handles Docker/podman memory CPU limits which should work for
> systemd too. That code relies on file 'memory.limit_in_bytes' containing the
> right value.
> 
> I see two ways to "fix" this:
> 
> a)
> fix the memory controller hierarchy to report '/user.slice/user-cg.slice'
> instead of
> /user.slice/user-cg.slice/run-rea23ce64f69a439ab39f835f05853939.scope

I don't think this is a practical approach for several reasons, the most important of which is that it's not true since the process really is in /user.slice/user-cg.slice/run-rea23ce64f69a439ab39f835f05853939.scope/.

> 
> b)
> actually have the right value in
> /sys/fs/cgroup/memory/user.slice/user-cg.slice/run-
> rea23ce64f69a439ab39f835f05853939.scope/memory.limit_in_bytes instead of
> /sys/fs/cgroup/memory/user.slice/user-cg.slice/memory.limit_in_bytes

This is the crux of the problem. The cgroup v1 memory interface is not, as far as I can tell, fully documented anywhere. There is a (in its own words) hopelessly outdated document[0], but it does not indicate the expected behavior of memory.limit_in_bytes. However, it's clear from the actual behavior that it defaults to max. This is true on RHEL 7 and Fedora 29.

I don't know the internals of docker/podman so I can't say what they're doing, but perhaps they are setting the child cgroup's limit to the same as its parent? If OpenJDK used "hierarchical_memory_limit" it should work for both, though.

All that said, I don't think this is so much a bug as it is a confusing and undocumented interface. I don't think memory.limit_in_bytes is _supposed_ to account for hierarchy. It defaults to max and only provides what that limit that cgroup is applying. I also don't think there's any chance to change its behavior for two reasons:

* It's the legacy interface.

* It's a public API and this would be a breaking change in its behavior.

For what it's worth, the cgroup v2 documentation _does_ document the expected default of memory_max[1] which is max (unlimited).


[0] https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
[1] https://www.kernel.org/doc/html/v4.19/admin-guide/cgroup-v2.html#memory-interface-files

Comment 17 Severin Gehwolf 2019-01-18 12:37:19 UTC
Apparently this issue surfaced between kernel 4.15 (good) and 4.18 (bad).

Comment 18 Laura Abbott 2019-04-09 20:44:32 UTC
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs.
 
Fedora XX has now been rebased to 5.0.6  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 30, and are still experiencing this issue, please change the version to Fedora 30.
 
If you experience different issues, please open a new bug report for those.

Comment 19 Severin Gehwolf 2019-04-10 07:43:31 UTC
I don't have 5.0.6, but 5.0.5 still shows the issue.

# uname -a
Linux t580-laptop 5.0.5-200.fc29.x86_64 #1 SMP Wed Mar 27 20:58:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Issue still present with this kernel version.

I'm also adding the OpenJDK bug which, once fixed, would work around this issue.

Comment 20 Justin M. Forbes 2019-08-20 17:40:05 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs.

Fedora 29 has now been rebased to 5.2.9-100.fc29.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 30, and are still experiencing this issue, please change the version to Fedora 30.

If you experience different issues, please open a new bug report for those.

Comment 21 Justin M. Forbes 2019-09-17 20:03:21 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Comment 22 Severin Gehwolf 2019-09-18 08:03:39 UTC
Still reproduces with 5.2.13-200 on F30. Re-opening.

# uname -a
Linux t580-laptop 5.2.13-200.fc30.x86_64 #1 SMP Fri Sep 6 14:30:40 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

$ sudo systemctl daemon-reload && sudo systemctl restart user-cg.slice
$ sudo systemd-run --slice user-cg.slice --scope bash
Running scope as unit: run-rf1d61aa7408a46838acf6010b8993821.scope
# basepath=$(grep cgroup /proc/self/mountinfo | grep memory | cut -d' ' -f5)
# hierarchypath=$(grep memory /proc/self/cgroup | cut -d':' -f3)
# cat $basepath/$hierarchypath/memory.limit_in_bytes
9223372036854771712

FWIW:

# grep hierarchical_memory_limit $basepath/$hierarchypath/memory.stat 
hierarchical_memory_limit 10485760

Comment 23 Justin M. Forbes 2020-03-03 16:27:17 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 30 kernel bugs.

Fedora 30 has now been rebased to 5.5.7-100.fc30.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 31, and are still experiencing this issue, please change the version to Fedora 31.

If you experience different issues, please open a new bug report for those.

Comment 24 Severin Gehwolf 2020-03-12 09:43:29 UTC
Still reproducible with 5.5.8-100

$ uname -a
Linux t580-laptop 5.5.8-100.fc30.x86_64 #1 SMP Thu Mar 5 21:55:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ sudo systemctl daemon-reload && sudo systemctl restart user-cg.slice

$ sudo systemd-run --slice user-cg.slice --scope bash
Running scope as unit: run-r321de94d87de41fe95911726d08b68a6.scope
# basepath=$(grep cgroup /proc/self/mountinfo | grep memory | cut -d' ' -f5)
# hierarchypath=$(grep memory /proc/self/cgroup | cut -d':' -f3)
# cat $basepath/$hierarchypath/memory.limit_in_bytes
9223372036854771712

# grep hierarchical_memory_limit $basepath/$hierarchypath/memory.stat 
hierarchical_memory_limit 10485760

Comment 25 Ben Cotton 2020-04-30 20:21:25 UTC
This message is a reminder that Fedora 30 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 30 on 2020-05-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '30'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 30 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 26 Ben Cotton 2020-05-26 18:26:21 UTC
Fedora 30 changed to end-of-life (EOL) status on 2020-05-26. Fedora 30 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.