Red Hat Bugzilla – Bug 1464909
Kubelet should support specifying the top level cgroups for KubeReserved/SystemReserved
Last modified: 2017-07-10 14:53:02 EDT
Description of problem:
Specified "system-reserved" and "system-reserved-cgroup" parameters, I didn't find any absolute name (here is /system-reserved-test) of the top level cgroup. If "system-reserved-cgroup" flag was not provided, reserved compute resources were not reflected in the system.slice either.
The same with "kube-reserved"
Anything I misunderstood or configured incorrectly?
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Set system-reserved and system-reserved-cgroup to node-config.yaml, then restart node service
# systemctl restart atomic-openshift-node
2. Check node capacity and allocatable
# oc describe node <node>| grep -A7 Capacity
3. Check memory limit in the cgroup
# ls /sys/fs/cgroup/memory
# cat /sys/fs/cgroup/memory/system.slice/memory.limit_in_bytes
4. Only set system-reserved to node-config.yaml, then restart node service
# systemctl restart atomic-openshift-node
5. Check memory limit in the cgroup again
2. [root@ip-172-18-12-156 ~]# oc describe node | grep -A7 Capacity
3. [root@ip-172-18-3-157 ~]# ls /sys/fs/cgroup/memory | grep system-reserved-test
[root@ip-172-18-3-157 ~]# ls /sys/fs/cgroup/memory/system.slice | grep system-reserved-test
[root@ip-172-18-3-157 ~]# cat /sys/fs/cgroup/memory/system.slice/memory.limit_in_bytes
5. The same result with step 3
3. There should be a directory/file named system-reserved-test.
/system-reserved-test/memory.limit_in_bytes should be (kube-reserved) + (system-reserved) = 0(not specified) + 400Mi = 400Mi = 419430400
5. /sys/fs/cgroup/memory/system.slice/memory.limit_in_bytes should be equal to 400Mi
According to the Document: https://github.com/openshift/openshift-docs/pull/4532/files,
Example 2. Node Cgroup Settings
- "true" (1)
- "systemd" (2)
- "pods" (3)
3. A comma-delimited list of scopes for where the node should enforce node resource constraints. Valid values are pods, system-reserved, and kube-reserved. The default is pods. We do not recommend users change this value.
Optionally, the node can be made to enforce kube-reserved and system-reserved by specifying those tokens in the enforce-node-allocatable flag. If specified, the corresponding --kube-reserved-cgroup or --system-reserved-cgroup needs to be provided. In future releases, the node and container runtime will be packaged in a common cgroup separate from system.slice. Until that time, we do not recommend users change the default value of enforce-node-allocatable flag.
While enforce-node-allocatable set to "pods" will create the kubepods.slice and run pods inside it, the "kube-reserved" and "system-reserved" work differently.
The kube-reserved-cgroup and system-reserved-cgroup flags tell the kubelet in which cgroups the kube and system services are _already running_. The cgroups must be created and set up ahead of time, likely by systemd, and the *-cgroup flags just inform the kubelet about which cgroups already contain the kube and system services. The kubelet does not create these.
In this model, system-reserved.slice (for example), kube-reserved.slice, and kubepods.slice would all be peers in the hierarchy each consuming disjoint sets of the total system resources.
Enabling a *-reserve means "save this amount of resource". Turning on enforce-node-allocatable for that *-reserve means "save me this amount of resource and don't let me use more", which is problematic if the users haven't profiled the kubelet/docker/system services to see how much resource they use. If they aren't careful, they can set the reserve too low and OOM kill the node, docker, or other system service.
That is why we don't recommend enabling it at this point.
In fact, I'm not sure if we even have a procedure for enabling it at all since it would require kubepods.slice become a peer of system.slice and the kubelet/node/docker being placed in a new kube.slice peer, for example.
At this time, we do not recommend setting `enforce-node-allocatable` to any value other than "pods". The `system-reserved` and `kube-reserved` values are only used to reduce the amount of allocatable space that can be scheduled to the node.
If/when we do support setting `enforce-node-allocatable` to additional values other than pods, the `system-reserved-cgroup` would align with `/system.slice` and the `kube-reserved-cgroup` would align with `/runtime.slice` which does not yet exist. The runtime.slice would be the cgroup that packages openshift node and the container runtime (i.e docker, etc.). To support this deployment model, more packaging and install work is needed.