Description of problem: When setting a slice with a memory limit, this limit seems to be wrongly or incompletely propagating in the sysfs tree. According to man cgroups, section "/proc/[pid]/cgroup", the third field in that file is supposed to be: """ This field contains the pathname of the control group in the hierarchy to which the process belongs. This pathname is relative to the mount point of the hierarchy. """ Now on a slice with a 10M memory limit I see this in the memory.limit_in_bytes file: # cat /sys/fs/cgroup/memory/user.slice/user-cg.slice/run-r69d519a8b9cb4711824562802c752767.scope/memory.limit_in_bytes 9223372036854771712 So it appears to be unlimited, while it should be capped at 10485760. Version-Release number of selected component (if applicable): $ rpm -q systemd systemd-238-8.git0e0aa59.fc28.x86_64 How reproducible: 100% Steps to Reproduce: $ cat /etc/systemd/system/user-cg.slice [Unit] Description=Demo cgroup Before=slices.target [Slice] MemoryAccounting=true MemoryLimit=10M $ sudo systemctl daemon-reload && sudo systemctl restart user-cg.slice $ sudo systemd-run --slice user-cg.slice --scope bash # basepath=$(grep cgroup /proc/self/mountinfo | grep memory | cut -d' ' -f5) # hierarchypath=$(grep memory /proc/self/cgroup | cut -d':' -f3) # cat $basepath/$hierarchypath/memory.limit_in_bytes Actual results: 9223372036854771712 Expected results: 10485760 Additional info: The file at /sys/fs/cgroup/memory//user.slice/user-cg.slice/memory.limit_in_bytes actually contains the correct value. If so, then the value in /proc/self/cgroup is wrong which contains the run-rea0c52af97a04d67aefcb93486fa5385.scope path. Results are also the same when the unit file uses MemoryMax=10M over deprecated MemoryLimit=10M
Any thoughts about this from systemd maintainers? This affects OpenJDK when being run in a systemd slice.
This is still an issue on Fedora 29 Beta / systemd-239-3.fc29.x86_64.
From Lukas Nykryn: """ I think the problem is only how cgroup display the information. The limit actually is applied. I have run something in the shell that allocated the memory and in status I saw Memory: 9.9M So I would suggest reassigning this to the kernel. """
Re-assigning to component kernel as per comment 3.
Hey folks, This looks to work as expected in the latest 4.19 kernels. F29 should be rebased to 4.19 in a couple weeks, and since it is actually enforced I don't think it's worth tracking down the commit that fixed this.
(In reply to Jeremy Cline from comment #5) > Hey folks, > > This looks to work as expected in the latest 4.19 kernels. F29 should be > rebased to 4.19 in a couple weeks, and since it is actually enforced I don't > think it's worth tracking down the commit that fixed this. This breaks automatic container detection in language runtimes (e.g. OpenJDK). It sizes it's structures according to the container limit. If it's not able to detect that it's in a container but actually is, then all bets are off. It looks to the user as if some application gets killed seemingly randomly without knowing why. So to that extent, I'm not sure I agree with "it's not worth tracking down the commit that fixed this". For the OpenJDK in systemd-slice use-case observability is important.
> This breaks automatic container detection in language runtimes (e.g. OpenJDK). It sizes it's structures according to the container limit. If it's not able to detect that it's in a container but actually is, then all bets are off. It looks to the user as if some application gets killed seemingly randomly without knowing why. So to that extent, I'm not sure I agree with "it's not worth tracking down the commit that fixed this". I'm happy to backport a patch if you or someone else wants to bisect it, but it's already fixed upstream and it'll be fixed in stable Fedora in a few weeks when we rebase to 4.19.
(In reply to Jeremy Cline from comment #5) > > This looks to work as expected in the latest 4.19 kernels. F29 should be > rebased to 4.19 in a couple weeks, and since it is actually enforced I don't > think it's worth tracking down the commit that fixed this. How did you test this? Using the second-latest Koji kernel build on an up-to-date Fedora 29 I see: root@localhost:~# uname -r 4.19.0-0.rc8.git3.1.fc30.x86_64 root@localhost:~# cat /etc/systemd/system/user-cg.slice [Unit] Description=Demo cgroup Before=slices.target [Slice] MemoryAccounting=true MemoryLimit=10M root@localhost:~# systemctl daemon-reload root@localhost:~# systemctl restart user-cg.slice root@localhost:~# systemd-run --slice user-cg.slice --scope bash Running scope as unit: run-r2344f8a82b1641cb8f254a05689747cc.scope root@localhost:~# basepath=$(grep cgroup /proc/self/mountinfo | grep memory | cut -d' ' -f5) root@localhost:~# hierarchypath=$(grep memory /proc/self/cgroup | cut -d':' -f3) root@localhost:~# cat $basepath/$hierarchypath/memory.limit_in_bytes 9223372036854771712 root@localhost:~# cat /sys/fs/cgroup/memory/user.slice/user-cg.slice/memory.limit_in_bytes 10485760 root@localhost:~# cat /proc/self/cgroup 11:freezer:/ 10:pids:/user.slice/user-cg.slice/run-r2344f8a82b1641cb8f254a05689747cc.scope 9:cpu,cpuacct:/ 8:devices:/user.slice 7:perf_event:/ 6:blkio:/ 5:memory:/user.slice/user-cg.slice/run-r2344f8a82b1641cb8f254a05689747cc.scope 4:cpuset:/ 3:hugetlb:/ 2:net_cls,net_prio:/ 1:name=systemd:/user.slice/user-cg.slice/run-r2344f8a82b1641cb8f254a05689747cc.scope 0::/user.slice/user-cg.slice/run-r2344f8a82b1641cb8f254a05689747cc.scope root@localhost:~#
Ah, you're right. So, I dug into this a bit and looking at the code, there's an entry in memory.stat with "hierarchical_memory_limit" that correctly computes the limit taking the group hierarchy into account. What I can't find is whether memory.limit_in_bytes is *supposed* to be the limit from that cgroup accounting for its parents when hierarchy is on or whether it's supposed to just show what the limit that particular cgroup imposes. Is there a reason to not use "hierarchical_memory_limit" from memory.stat?
(In reply to Jeremy Cline from comment #9) > Is there a reason to not use "hierarchical_memory_limit" from > memory.stat? I don't understand this question. What is 'memory.stat'?
(In reply to Severin Gehwolf from comment #10) > (In reply to Jeremy Cline from comment #9) > > Is there a reason to not use "hierarchical_memory_limit" from > > memory.stat? > > I don't understand this question. What is 'memory.stat'? The file in the cgroup, e.g. /sys/fs/cgroup/memory/user.slice/user-cg.slice/run-r3654855cc3cd48a392f045011d3086ad.scope/memory.stat
(In reply to Jeremy Cline from comment #11) > (In reply to Severin Gehwolf from comment #10) > > (In reply to Jeremy Cline from comment #9) > > > Is there a reason to not use "hierarchical_memory_limit" from > > > memory.stat? > > > > I don't understand this question. What is 'memory.stat'? > > The file in the cgroup, e.g. > /sys/fs/cgroup/memory/user.slice/user-cg.slice/run- > r3654855cc3cd48a392f045011d3086ad.scope/memory.stat OK, yes that seems to contain the right value. However, so does /sys/fs/cgroup/memory/user.slice/user-cg.slice/memory.limit_in_bytes. The reason we cannot just use some other value is that it's the same code in OpenJDK which handles Docker/podman memory CPU limits which should work for systemd too. That code relies on file 'memory.limit_in_bytes' containing the right value. I see two ways to "fix" this: a) fix the memory controller hierarchy to report '/user.slice/user-cg.slice' instead of /user.slice/user-cg.slice/run-rea23ce64f69a439ab39f835f05853939.scope b) actually have the right value in /sys/fs/cgroup/memory/user.slice/user-cg.slice/run-rea23ce64f69a439ab39f835f05853939.scope/memory.limit_in_bytes instead of /sys/fs/cgroup/memory/user.slice/user-cg.slice/memory.limit_in_bytes [root@p50-laptop hotspot]# grep hierarchical_memory_limit /sys/fs/cgroup/memory/user.slice/user-cg.slice/run-rea23ce64f69a439ab39f835f05853939.scope/memory.stat hierarchical_memory_limit 10485760 [root@p50-laptop hotspot]# cat /sys/fs/cgroup/memory/user.slice/user-cg.slice/memory.limit_in_bytes 10485760 [root@p50-laptop hotspot]# grep memory /proc/self/cgroup 6:memory:/user.slice/user-cg.slice/run-rea23ce64f69a439ab39f835f05853939.scope [root@p50-laptop hotspot]# cat /sys/fs/cgroup/memory/user.slice/user-cg.slice/run-rea23ce64f69a439ab39f835f05853939.scope/memory.limit_in_bytes 9223372036854771712
(In reply to Severin Gehwolf from comment #12) > (In reply to Jeremy Cline from comment #11) > > (In reply to Severin Gehwolf from comment #10) > > > (In reply to Jeremy Cline from comment #9) > > > > Is there a reason to not use "hierarchical_memory_limit" from > > > > memory.stat? > > > > > > I don't understand this question. What is 'memory.stat'? > > > > The file in the cgroup, e.g. > > /sys/fs/cgroup/memory/user.slice/user-cg.slice/run- > > r3654855cc3cd48a392f045011d3086ad.scope/memory.stat > > OK, yes that seems to contain the right value. However, so does > /sys/fs/cgroup/memory/user.slice/user-cg.slice/memory.limit_in_bytes. The > reason we cannot just use some other value is that it's the same code in > OpenJDK which handles Docker/podman memory CPU limits which should work for > systemd too. That code relies on file 'memory.limit_in_bytes' containing the > right value. > > I see two ways to "fix" this: > > a) > fix the memory controller hierarchy to report '/user.slice/user-cg.slice' > instead of > /user.slice/user-cg.slice/run-rea23ce64f69a439ab39f835f05853939.scope I don't think this is a practical approach for several reasons, the most important of which is that it's not true since the process really is in /user.slice/user-cg.slice/run-rea23ce64f69a439ab39f835f05853939.scope/. > > b) > actually have the right value in > /sys/fs/cgroup/memory/user.slice/user-cg.slice/run- > rea23ce64f69a439ab39f835f05853939.scope/memory.limit_in_bytes instead of > /sys/fs/cgroup/memory/user.slice/user-cg.slice/memory.limit_in_bytes This is the crux of the problem. The cgroup v1 memory interface is not, as far as I can tell, fully documented anywhere. There is a (in its own words) hopelessly outdated document[0], but it does not indicate the expected behavior of memory.limit_in_bytes. However, it's clear from the actual behavior that it defaults to max. This is true on RHEL 7 and Fedora 29. I don't know the internals of docker/podman so I can't say what they're doing, but perhaps they are setting the child cgroup's limit to the same as its parent? If OpenJDK used "hierarchical_memory_limit" it should work for both, though. All that said, I don't think this is so much a bug as it is a confusing and undocumented interface. I don't think memory.limit_in_bytes is _supposed_ to account for hierarchy. It defaults to max and only provides what that limit that cgroup is applying. I also don't think there's any chance to change its behavior for two reasons: * It's the legacy interface. * It's a public API and this would be a breaking change in its behavior. For what it's worth, the cgroup v2 documentation _does_ document the expected default of memory_max[1] which is max (unlimited). [0] https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt [1] https://www.kernel.org/doc/html/v4.19/admin-guide/cgroup-v2.html#memory-interface-files
Apparently this issue surfaced between kernel 4.15 (good) and 4.18 (bad).
We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs. Fedora XX has now been rebased to 5.0.6 Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 30, and are still experiencing this issue, please change the version to Fedora 30. If you experience different issues, please open a new bug report for those.
I don't have 5.0.6, but 5.0.5 still shows the issue. # uname -a Linux t580-laptop 5.0.5-200.fc29.x86_64 #1 SMP Wed Mar 27 20:58:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Issue still present with this kernel version. I'm also adding the OpenJDK bug which, once fixed, would work around this issue.
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There are a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs. Fedora 29 has now been rebased to 5.2.9-100.fc29. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 30, and are still experiencing this issue, please change the version to Fedora 30. If you experience different issues, please open a new bug report for those.
*********** MASS BUG UPDATE ************** This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.
Still reproduces with 5.2.13-200 on F30. Re-opening. # uname -a Linux t580-laptop 5.2.13-200.fc30.x86_64 #1 SMP Fri Sep 6 14:30:40 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux $ sudo systemctl daemon-reload && sudo systemctl restart user-cg.slice $ sudo systemd-run --slice user-cg.slice --scope bash Running scope as unit: run-rf1d61aa7408a46838acf6010b8993821.scope # basepath=$(grep cgroup /proc/self/mountinfo | grep memory | cut -d' ' -f5) # hierarchypath=$(grep memory /proc/self/cgroup | cut -d':' -f3) # cat $basepath/$hierarchypath/memory.limit_in_bytes 9223372036854771712 FWIW: # grep hierarchical_memory_limit $basepath/$hierarchypath/memory.stat hierarchical_memory_limit 10485760
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There are a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 30 kernel bugs. Fedora 30 has now been rebased to 5.5.7-100.fc30. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 31, and are still experiencing this issue, please change the version to Fedora 31. If you experience different issues, please open a new bug report for those.
Still reproducible with 5.5.8-100 $ uname -a Linux t580-laptop 5.5.8-100.fc30.x86_64 #1 SMP Thu Mar 5 21:55:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux $ sudo systemctl daemon-reload && sudo systemctl restart user-cg.slice $ sudo systemd-run --slice user-cg.slice --scope bash Running scope as unit: run-r321de94d87de41fe95911726d08b68a6.scope # basepath=$(grep cgroup /proc/self/mountinfo | grep memory | cut -d' ' -f5) # hierarchypath=$(grep memory /proc/self/cgroup | cut -d':' -f3) # cat $basepath/$hierarchypath/memory.limit_in_bytes 9223372036854771712 # grep hierarchical_memory_limit $basepath/$hierarchypath/memory.stat hierarchical_memory_limit 10485760
This message is a reminder that Fedora 30 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 30 on 2020-05-26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '30'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 30 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 30 changed to end-of-life (EOL) status on 2020-05-26. Fedora 30 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.