Bug 1999603
| Summary: | Memory Manager allows Guaranteed QoS Pod with hugepages requested is exactly equal to the left over Hugepages | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Niranjan Mallapadi Raghavender <mniranja> |
| Component: | Node | Assignee: | Artyom <alukiano> |
| Node sub component: | Memory manager | QA Contact: | Niranjan Mallapadi Raghavender <mniranja> |
| Status: | CLOSED ERRATA | Docs Contact: | Padraig O'Grady <pogrady> |
| Severity: | medium | ||
| Priority: | medium | CC: | aos-bugs, nagrawal, tsweeney |
| Version: | 4.10 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.10.0 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Known Issue | |
| Doc Text: |
Cause:
Bug under the memory manager code.
Consequence:
The memory manager can pin the container under the guaranteed pod with resources that the single NUMA node can satisfy to more than one NUMA node.
Workaround (if any):
Do not start guaranteed pods with containers memory resource requests bigger than the single NUMA node can provide.
Result:
You can have undesired NUMA pinning for the container that can lead to latency or performance degradation.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-03-10 16:06:30 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The bug should be fixed once the OpenShift is rebased on top of Kubernetes 1.23. Not completed this sprint. The rebase is completed and the bug should be fixed as a result of rebasing. Versions:
========
oc version
Client Version: 4.10.0-0.nightly-2022-02-02-000921
Server Version: 4.10.0-0.nightly-2022-02-03-220350
Kubernetes Version: v1.23.3+b63be7f
PAO Version:
"msg": {
"architecture": "x86_64",
"build-date": "2022-02-02T19:59:27.762163",
"com.redhat.build-host": "cpt-1005.osbs.prod.upshift.rdu2.redhat.com",
"com.redhat.component": "performance-addon-operator-container",
"com.redhat.license_terms": "https://www.redhat.com/agreements",
"description": "performance-addon-operator",
"distribution-scope": "public",
"io.k8s.description": "performance-addon-operator",
"io.k8s.display-name": "performance-addon-operator",
"io.openshift.expose-services": "",
"io.openshift.maintainer.component": "Performance Addon Operator",
"io.openshift.maintainer.product": "OpenShift Container Platform",
"io.openshift.tags": "operator",
"maintainer": "openshift-operators",
"name": "openshift4/performance-addon-rhel8-operator",
"release": "28",
"summary": "performance-addon-operator",
"upstream-vcs-ref": "7e40c978acca61ea540fb10b34e826474d6a93cf",
"upstream-vcs-type": "git",
"upstream-version": "0.0.41001-2-g7e40c978",
"url": "https://access.redhat.com/containers/#/registry.access.redhat.com/openshift4/performance-addon-rhel8-operator/images/v4.10.0-28",
"vcs-ref": "8473aa2255f73db5523c2a665256ed6297a99025",
"vcs-type": "git",
"vendor": "Red Hat, Inc.",
"version": "v4.10.0"
Steps:
1. create a performance profile as show below:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: performance
spec:
cpu:
isolated: 5-19,45-59,20-39,60-79
reserved: 0-4,40-44
hugepages:
defaultHugepagesSize: 1G
pages:
- count: 20
size: 2M
net:
userLevelNetworking: true
nodeSelector:
node-role.kubernetes.io/workercnf: ""
numa:
topologyPolicy: restricted
realTimeKernel:
enabled: true
2. Create a pod using below specs:
apiVersion: v1
kind: Pod
metadata:
name: pod1
spec:
containers:
- name: example-pod1
image: fedora:latest
command:
- sleep
- inf
volumeMounts:
- mountPath: /hugepages-2Mi
name: hugepage-2mi
resources:
limits:
hugepages-2Mi: 24Mi
memory: "24Mi"
cpu: "2"
requests:
hugepages-2Mi: 24Mi
memory: "24Mi"
cpu: "2"
nodeSelector:
kubernetes.io/hostname: "worker-0"
volumes:
- name: hugepage-2mi
emptyDir:
medium: HugePages-2Mi
3. check Pod status:
[root@registry bz-1999603]# oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod1 1/1 Running 0 3m2s 10.128.2.26 worker-0 <none> <none>
[root@registry bz-1999603]# oc debug node/worker-0
4. Get Memory Manager state file:
sh-4.4# cat memory_manager_state
{"policyName":"Static","machineState":{"0":{"numberOfAssignments":2,"memoryMap":{"hugepages-1Gi":{"total":0,"systemReserved":0,"allocatable":0,"reserved":0,"free":0},"hugepages-2Mi":{"total":20971520,"systemReserved":0,"allocatable":20971520,"reserved":20971520,"free":0},"memory":{"total":270146011136,"systemReserved":1153433600,"allocatable":268971606016,"reserved":25165824,"free":268946440192}},"cells":[0,1]},"1":{"numberOfAssignments":2,"memoryMap":{"hugepages-1Gi":{"total":0,"systemReserved":0,"allocatable":0,"reserved":0,"free":0},"hugepages-2Mi":{"total":20971520,"systemReserved":0,"allocatable":20971520,"reserved":4194304,"free":16777216},"memory":{"total":270531715072,"systemReserved":0,"allocatable":270510743552,"reserved":0,"free":270510743552}},"cells":[0,1]}},"entries":{"a9feb7f2-a1d4-4f7f-ae5f-ba5c2b60b254":{"example-pod1":[{"numaAffinity":[0,1],"type":"hugepages-2Mi","size":25165824},{"numaAffinity":[0,1],"type":"memory","size":25165824}]}},"checksum":279997132}sh-4.4#
5. Get cpus and numa nodes used by pod1
[root@registry bz-1999603]# oc exec -ti pods/pod1 -- bash -c "cat /sys/fs/cgroup/cpuset/cpuset.cpus"
5,45
[root@registry bz-1999603]# oc exec -ti pods/pod1 -- bash -c "cat /sys/fs/cgroup/cpuset/cpuset.mems"
0-1
6. Create Pod2. with below spec:
[root@registry bz-1999603]# cat test2.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod2
spec:
containers:
- name: example-pod2
image: fedora:latest
command:
- sleep
- inf
volumeMounts:
- mountPath: /hugepages-2Mi
name: hugepage-2mi
resources:
limits:
hugepages-2Mi: 16Mi
memory: "16Mi"
cpu: "2"
requests:
hugepages-2Mi: 16Mi
memory: "16Mi"
cpu: "2"
nodeSelector:
kubernetes.io/hostname: "worker-0"
volumes:
- name: hugepage-2mi
emptyDir:
medium: HugePages-2Mi
7. Create of pod2 should fail:
[root@registry bz-1999603]# oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod1 1/1 Running 0 8m11s 10.128.2.26 worker-0 <none> <none>
pod2 0/1 ContainerStatusUnknown 0 13s 10.128.2.27 worker-0 <none> <none>
8. Check the pod2 status
[root@registry bz-1999603]# oc describe pods/pod2
Name: pod2
Namespace: default
Priority: 0
Node: worker-0/10.46.80.2
Start Time: Fri, 04 Feb 2022 04:23:52 -0500
Labels: <none>
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["10.128.2.27/23"],"mac_address":"0a:58:0a:80:02:1b","gateway_ips":["10.128.2.1"],"ip_address":"10.128.2.27/23"...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.128.2.27"
],
"mac": "0a:58:0a:80:02:1b",
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.128.2.27"
],
"mac": "0a:58:0a:80:02:1b",
"default": true,
"dns": {}
}]
Status: Failed
Reason: TopologyAffinityError
Message: Pod Resources cannot be allocated with Topology locality
IP: 10.128.2.27
IPs:
IP: 10.128.2.27
Containers:
example-pod2:
Container ID:
Image: fedora:latest
Image ID:
Port: <none>
Host Port: <none>
Command:
sleep
inf
State: Terminated
Reason: ContainerStatusUnknown
Message: The container could not be located when the pod was terminated
Exit Code: 137
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: False
Restart Count: 0
Limits:
cpu: 2
hugepages-2Mi: 16Mi
memory: 16Mi
Requests:
cpu: 2
hugepages-2Mi: 16Mi
memory: 16Mi
Environment: <none>
Mounts:
/hugepages-2Mi from hugepage-2mi (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5529z (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
hugepage-2mi:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: HugePages-2Mi
SizeLimit: <unset>
kube-api-access-5529z:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: Guaranteed
Node-Selectors: kubernetes.io/hostname=worker-0
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 22s default-scheduler Successfully assigned default/pod2 to worker-0
Warning TopologyAffinityError 23s kubelet Resources cannot be allocated with Topology locality
Normal AddedInterface 20s multus Add eth0 [10.128.2.27/23] from ovn-kubernetes
Normal Pulling 20s kubelet Pulling image "fedora:latest"
Normal Pulled 16s kubelet Successfully pulled image "fedora:latest" in 3.597402945s
Warning Failed 15s kubelet Error: container create failed: parent closed synchronisation channel
As seen pod2 gets rejected
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |
#### What happened: Setup: 1. 3 Master nodes and 2 Worker Nodes. 2. Kubernetes Version: v1.22.0-rc.0+5c2f7cd 3. Configuring Memory Manager on Kubernetes Version (v1.22.0-rc.0+5c2f7cd) and Nodes configured with Hugepages of size 2M on 2 Worker nodes. 4. Each Worker node having 2 Numa Nodes (numa node0 and numa node1) 5. Number of Hugepages of size 2M are 10 on each Numa node sh-4.4# cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 10 sh-4.4# cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages 10 After configuring Memory , where Hugepages are 20M (2M * 10) on each Numa node. we create QoS pod1 pod which consumes 24Mi Hugepages (of size 2M) Pod Spec: <snip> apiVersion: v1 kind: Pod metadata: name: pod1 spec: containers: - name: example image: fedora:latest command: - sleep - inf volumeMounts: - mountPath: /hugepages-2Mi name: hugepage-2mi resources: limits: hugepages-2Mi: 24Mi memory: "24Mi" cpu: "2" requests: hugepages-2Mi: 24Mi memory: "24Mi" cpu: "2" volumes: - name: hugepage-2mi emptyDir: medium: HugePages-2Mi </snip> List pods <snip> $ oc get pods NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod1 1/1 Running 0 13s 10.128.2.109 worker1.example.org <none> <none> </snip> Memory Manager state file: <snip> {"policyName":"Static","machineState":{"0":{"numberOfAssignments":2,"memoryMap":{"hugepages-1Gi":{"total":0,"systemReserved":0,"allocatable":0,"reserved":0,"free":0},"hugepages-2Mi":{"total":20971520,"systemReserved":0,"allocatable":20971520,"reserved":20971520,"free":0},"memory":{"total":270146174976,"systemReserved":1153433600,"allocatable":268971769856,"reserved":104857600,"free":268866912256}},"cells":[0,1]},"1":{"numberOfAssignments":2,"memoryMap":{"hugepages-1Gi":{"total":0,"systemReserved":0,"allocatable":0,"reserved":0,"free":0},"hugepages-2Mi":{"total":20971520,"systemReserved":0,"allocatable":20971520,"reserved":4194304,"free":16777216},"memory":{"total":270531874816,"systemReserved":0,"allocatable":270510903296,"reserved":0,"free":270510903296}},"cells":[0,1]}},"entries":{"bbf8fd78-3c9d-4924-b4a2-450caaca6da3":{"example":[{"numaAffinity":[0,1],"type":"hugepages-2Mi","size":25165824},{"numaAffinity":[0,1],"type":"memory","size":104857600}]}},"checksum":1759502831} </snip> Now create a QoS Guaranteed Pod2 requesting 16Mi . On the same worker node. Pod spec: <snip> apiVersion: v1 kind: Pod metadata: name: pod2 spec: containers: - name: example image: fedora:latest command: - sleep - inf volumeMounts: - mountPath: /hugepages-2Mi name: hugepage-2mi resources: limits: hugepages-2Mi: "16Mi" memory: "100Mi" cpu: "2" requests: hugepages-2Mi: "16Mi" memory: "100Mi" cpu: "2" nodeSelector: kubernetes.io/hostname: "worker1.example.org " volumes: - name: hugepage-2mi emptyDir: medium: HugePages-2Mi </snip> List pods: <snip> NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod1 1/1 Running 0 4m39s 10.128.2.109 worker1.example.org <none> <none> pod2 1/1 Running 0 11s 10.128.2.111 worker1.example.org <none> <none> </snip> Memory Manager state file after pod2 is deployed. <snip> {"policyName":"Static","machineState":{"0":{"numberOfAssignments":4,"memoryMap":{"hugepages-1Gi":{"total":0,"systemReserved":0,"allocatable":0,"reserved":0,"free":0},"hugepages-2Mi":{"total":20971520,"systemReserved":0,"allocatable":20971520,"reserved":20971520,"free":0},"memory":{"total":270146174976,"systemReserved":1153433600,"allocatable":268971769856,"reserved":209715200,"free":268762054656}},"cells":[0,1]},"1":{"numberOfAssignments":4,"memoryMap":{"hugepages-1Gi":{"total":0,"systemReserved":0,"allocatable":0,"reserved":0,"free":0},"hugepages-2Mi":{"total":20971520,"systemReserved":0,"allocatable":20971520,"reserved":20971520,"free":0},"memory":{"total":270531874816,"systemReserved":0,"allocatable":270510903296,"reserved":0,"free":270510903296}},"cells":[0,1]}},"entries":{"90a01c04-cfc4-401b-bdb8-85667384f002":{"example":[{"numaAffinity":[0,1],"type":"hugepages-2Mi","size":16777216},{"numaAffinity":[0,1],"type":"memory","size":104857600}]},"bbf8fd78-3c9d-4924-b4a2-450caaca6da3":{"example":[{"numaAffinity":[0,1],"type":"hugepages-2Mi","size":25165824},{"numaAffinity":[0,1],"type":"memory","size":104857600}]}},"checksum":3930981289} </snip> #### What you expected to happen: Pod2 should be rejected. #### How to reproduce it (as minimally and precisely as possible): Steps provided above #### Anything else we need to know?: #### Environment: - Kubernetes version (use `kubectl version`): v1.22.0-rc.0+5c2f7cd - Cloud provider or hardware configuration: Openshift - OS (e.g: `cat /etc/os-release`):Red Hat Enterprise Linux CoreOS release 4.9 - Kernel (e.g. `uname -a`): Linux helix02.lab.eng.tlv2.redhat.com 4.18.0-305.12.1.rt7.84.el8_4.x86_64 #1 SMP PREEMPT_RT Thu Jul 29 14:18:12 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux Kubelet configuration <snip> spec: kubeletConfig: apiVersion: kubelet.config.k8s.io/v1beta1 authentication: anonymous: {} webhook: cacheTTL: 0s x509: {} authorization: webhook: cacheAuthorizedTTL: 0s cacheUnauthorizedTTL: 0s cpuManagerPolicy: static cpuManagerReconcilePeriod: 5s evictionHard: memory.available: 100Mi evictionPressureTransitionPeriod: 0s fileCheckFrequency: 0s httpCheckFrequency: 0s imageMinimumGCAge: 0s kind: KubeletConfiguration kubeReserved: cpu: 1000m memory: 500Mi logging: {} memoryManagerPolicy: Static nodeStatusReportFrequency: 0s nodeStatusUpdateFrequency: 0s reservedMemory: - limits: memory: 1100Mi numaNode: 0 reservedSystemCPUs: 0-4,40-44 runtimeRequestTimeout: 0s shutdownGracePeriod: 0s shutdownGracePeriodCriticalPods: 0s streamingConnectionIdleTimeout: 0s syncFrequency: 0s systemReserved: cpu: 1000m memory: 500Mi topologyManagerPolicy: restricted volumeStatsAggPeriod: 0s machineConfigPoolSelector: matchLabels: machineconfiguration.openshift.io/role: worker-cnf </snip>