Bug 1464367

Summary: Incorrect memory limit calculation for kubepod in the cgroup hierarchy
Product: OpenShift Container Platform Reporter: Qixuan Wang <qixuan.wang>
Component: NodeAssignee: Derek Carr <decarr>
Status: CLOSED NOTABUG QA Contact: Qixuan Wang <qixuan.wang>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: aos-bugs, jokerman, mmccomas, sjenning
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-23 20:46:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Qixuan Wang 2017-06-23 09:12:56 UTC
Description of problem:
Compared the value of /sys/fs/cgroup/memory/kubepod.slice/memory.limit_in_bytes with experimental-allocatable-ignore-eviction enable and disable, I found that the result is always equal to Node.Capacity not Node.Allocatable. (according to https://github.com/kubernetes/community/pull/348/files , line#170: kubepods or kubepods.slice (Node Allocatable enforced here by Kubelet))


Version-Release number of selected component (if applicable):
openshift v3.6.121
kubernetes v1.6.1+5115d708d7
etcd 3.2.0

How reproducible:
Always

Steps to Reproduce:
1. Although eviction threshold is not configured to kubelet, it is enabled with memory.available<100Mi by default

2. Check memory limit in node description and cgroup
# oc describe node <node> | grep -A7 Capacity
# cat /sys/fs/cgroup/memory/kubepods.slice/memory.limit_in_bytes

3. Add the following to [Node]node-config.yaml and restart node service
kubeletArguments:
experimental-allocatable-ignore-eviction:
- 'true'
# systemctl restart atomic-openshift-node

4. Check memory limit in node description and cgroup again


Actual results:
2. [root@ip-172-18-12-156 ~]# oc describe node <node> | grep -A7 Capacity
Capacity:
 cpu:        1
 memory:    3688620Ki
 pods:        250
Allocatable:
 cpu:        1
 memory:    3586220Ki
 pods:        250

[root@ip-172-18-3-157 ~]# cat /sys/fs/cgroup/memory/kubepods.slice/memory.limit_in_bytes
3777146880

3777146880 = 3688620Ki = Capacity   # I think it's wrong

4. [root@ip-172-18-12-156 ~]# oc describe node <node> | grep -A7 Capacity
Capacity:
 cpu:        1
 memory:    3688620Ki
 pods:        250
Allocatable:
 cpu:        1
 memory:    3688620Ki
 pods:        250

[root@ip-172-18-3-157 node]# cat /sys/fs/cgroup/memory/kubepods.slice/memory.limit_in_bytes
3777146880

3777146880 = 3688620Ki = Capacity  # Correct


Expected results:
2. Take default eviction threshold(<100Mi) into consideration, the node Capacity/Allocatable in node description is correct. Besides, /sys/fs/cgroup/memory/kubepods.slice/memory.limit_in_bytes should be equal to Node.Allocatable(3586220Ki=3672289280)


Additional info:

Comment 1 Seth Jennings 2017-06-23 20:46:11 UTC
This is by design.  It is confusing though.

In order for the hard eviction threshold to be triggered, the kubepods cgroups must be allowed to exceed the threshold.  So eviction-hard is subtracted from capacity to calculate allocatable, but it is not used in the calculation for memory.limit_in_bytes.  kube- and system- reserved will be subtracted from capacity to calculate both allocatable AND the memory.limit_in_bytes.