This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1464909 - Kubelet should support specifying the top level cgroups for KubeReserved/SystemReserved
Kubelet should support specifying the top level cgroups for KubeReserved/Syst...
Status: CLOSED NOTABUG
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod (Show other bugs)
3.6.0
Unspecified Unspecified
low Severity low
: ---
: ---
Assigned To: Derek Carr
Qixuan Wang
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-26 04:43 EDT by Qixuan Wang
Modified: 2017-07-10 14:53 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-07-10 14:53:02 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Qixuan Wang 2017-06-26 04:43:39 EDT
Description of problem:
Specified "system-reserved" and "system-reserved-cgroup" parameters, I didn't find any absolute name (here is /system-reserved-test) of the top level cgroup. If "system-reserved-cgroup" flag was not provided, reserved compute resources were not reflected in the system.slice either.
The same with "kube-reserved"
Anything I misunderstood or configured incorrectly?


Version-Release number of selected component (if applicable):
openshift v3.6.121
kubernetes v1.6.1+5115d708d7
etcd 3.2.0

How reproducible:
Always

Steps to Reproduce:
1. Set system-reserved and system-reserved-cgroup to node-config.yaml, then restart node service

[Node]
kubeletArguments:
  cgroups-per-qos:
  - "true"
  cgroup-driver:
  - "systemd"
  enforce-node-allocatable:
  - "system-reserved"
  system-reserved:
  - "cpu=200m,memory=400Mi"
  system-reserved-cgroup:
  - "/system-reserved-test"
  experimental-allocatable-ignore-eviction:
  - 'true'

# systemctl restart atomic-openshift-node

    
2. Check node capacity and allocatable
# oc describe node <node>| grep -A7 Capacity
 

3. Check memory limit in the cgroup 
# ls /sys/fs/cgroup/memory 
# cat /sys/fs/cgroup/memory/system.slice/memory.limit_in_bytes


4. Only set system-reserved to node-config.yaml, then restart node service
[Node]
kubeletArguments:
  cgroups-per-qos:
  - "true"
  cgroup-driver:
  - "systemd"
  enforce-node-allocatable:
  - "system-reserved" 
  system-reserved:
  - "cpu=200m,memory=400Mi"
  experimental-allocatable-ignore-eviction:
  - 'true'

# systemctl restart atomic-openshift-node


5. Check memory limit in the cgroup again


Actual results:
2. [root@ip-172-18-12-156 ~]# oc describe node | grep -A7 Capacity
Capacity:
 cpu:        1
 memory:    3688620Ki
 pods:        250
Allocatable:
 cpu:        800m
 memory:    3279020Ki
 pods:        250


3. [root@ip-172-18-3-157 ~]# ls /sys/fs/cgroup/memory | grep system-reserved-test
[root@ip-172-18-3-157 ~]# ls /sys/fs/cgroup/memory/system.slice  | grep system-reserved-test
[root@ip-172-18-3-157 ~]# cat /sys/fs/cgroup/memory/system.slice/memory.limit_in_bytes
9223372036854771712

5. The same result with step 3


Expected results:
2. Correct
3. There should be a directory/file named system-reserved-test.
/system-reserved-test/memory.limit_in_bytes should be (kube-reserved) + (system-reserved) = 0(not specified) + 400Mi = 400Mi = 419430400
5. /sys/fs/cgroup/memory/system.slice/memory.limit_in_bytes should be equal to 400Mi


Additional info:
According to the Document: https://github.com/openshift/openshift-docs/pull/4532/files,

Example 2. Node Cgroup Settings

kubeletArguments:
  cgroups-per-qos:
    - "true" (1)
  cgroup-driver:
    - "systemd" (2)
  enforce-node-allocatable:
    - "pods" (3)

3. A comma-delimited list of scopes for where the node should enforce node resource constraints. Valid values are pods, system-reserved, and kube-reserved. The default is pods. We do not recommend users change this value.

Optionally, the node can be made to enforce kube-reserved and system-reserved by specifying those tokens in the enforce-node-allocatable flag. If specified, the corresponding --kube-reserved-cgroup or --system-reserved-cgroup needs to be provided. In future releases, the node and container runtime will be packaged in a common cgroup separate from system.slice. Until that time, we do not recommend users change the default value of enforce-node-allocatable flag.
Comment 1 Seth Jennings 2017-06-26 11:50:23 EDT
While enforce-node-allocatable set to "pods" will create the kubepods.slice and run pods inside it, the "kube-reserved" and "system-reserved" work differently.

The kube-reserved-cgroup and system-reserved-cgroup flags tell the kubelet in which cgroups the kube and system services are _already running_.  The cgroups must be created and set up ahead of time, likely by systemd, and the *-cgroup flags just inform the kubelet about which cgroups already contain the kube and system services.  The kubelet does not create these.

In this model, system-reserved.slice (for example), kube-reserved.slice, and kubepods.slice would all be peers in the hierarchy each consuming disjoint sets of the total system resources.

Enabling a *-reserve means "save this amount of resource".  Turning on enforce-node-allocatable for that *-reserve means "save me this amount of resource and don't let me use more", which is problematic if the users haven't profiled the kubelet/docker/system services to see how much resource they use.  If they aren't careful, they can set the reserve too low and OOM kill the node, docker, or other system service.

That is why we don't recommend enabling it at this point.

In fact, I'm not sure if we even have a procedure for enabling it at all since it would require kubepods.slice become a peer of system.slice and the kubelet/node/docker being placed in a new kube.slice peer, for example.
Comment 2 Derek Carr 2017-07-10 14:53:02 EDT
At this time, we do not recommend setting `enforce-node-allocatable` to any value other than "pods".  The `system-reserved` and `kube-reserved` values are only used to reduce the amount of allocatable space that can be scheduled to the node.

If/when we do support setting `enforce-node-allocatable` to additional values other than pods, the `system-reserved-cgroup` would align with `/system.slice` and the `kube-reserved-cgroup` would align with `/runtime.slice` which does not yet exist.  The runtime.slice would be the cgroup that packages openshift node and the container runtime (i.e docker, etc.).  To support this deployment model, more packaging and install work is needed.

Note You need to log in before you can comment on or make changes to this bug.