Bug 1999603 - Memory Manager allows Guaranteed QoS Pod with hugepages requested is exactly equal to the left over Hugepages
Summary: Memory Manager allows Guaranteed QoS Pod with hugepages requested is exactly ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.10
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 4.10.0
Assignee: Artyom
QA Contact: Niranjan Mallapadi Raghavender
Padraig O'Grady
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-31 12:12 UTC by Niranjan Mallapadi Raghavender
Modified: 2022-03-10 16:06 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Cause: Bug under the memory manager code. Consequence: The memory manager can pin the container under the guaranteed pod with resources that the single NUMA node can satisfy to more than one NUMA node. Workaround (if any): Do not start guaranteed pods with containers memory resource requests bigger than the single NUMA node can provide. Result: You can have undesired NUMA pinning for the container that can lead to latency or performance degradation.
Clone Of:
Environment:
Last Closed: 2022-03-10 16:06:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:06:50 UTC

Description Niranjan Mallapadi Raghavender 2021-08-31 12:12:14 UTC
#### What happened:
Setup:
1. 3 Master nodes and 2 Worker Nodes. 
2. Kubernetes Version: v1.22.0-rc.0+5c2f7cd
3. Configuring Memory Manager on Kubernetes Version (v1.22.0-rc.0+5c2f7cd)  and Nodes configured with Hugepages of size 2M on 2 Worker nodes. 
4. Each Worker node having 2 Numa Nodes (numa node0 and numa node1)
5. Number of Hugepages of size 2M are 10 on each Numa node 

sh-4.4# cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages      
10
sh-4.4# cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages 
10

After configuring Memory , where Hugepages are 20M (2M * 10)  on each Numa node. we create QoS pod1 pod which consumes 24Mi Hugepages (of size 2M) 
Pod Spec:

<snip>

apiVersion: v1
kind: Pod
metadata:
  name: pod1
spec:
  containers:
  - name: example
    image: fedora:latest
    command:
    - sleep
    - inf
    volumeMounts:
    - mountPath: /hugepages-2Mi
      name: hugepage-2mi
    resources:
      limits:
        hugepages-2Mi: 24Mi
        memory: "24Mi"
        cpu: "2"
      requests:
        hugepages-2Mi: 24Mi
        memory: "24Mi"
        cpu: "2"
  volumes:
  - name: hugepage-2mi
    emptyDir:
      medium: HugePages-2Mi

</snip>

List pods

<snip>

$ oc get pods

NAME   READY   STATUS    RESTARTS   AGE   IP             NODE                              NOMINATED NODE   READINESS GATES
pod1   1/1     Running   0          13s   10.128.2.109   worker1.example.org           <none>           <none>

</snip>

Memory Manager state file:

<snip>

{"policyName":"Static","machineState":{"0":{"numberOfAssignments":2,"memoryMap":{"hugepages-1Gi":{"total":0,"systemReserved":0,"allocatable":0,"reserved":0,"free":0},"hugepages-2Mi":{"total":20971520,"systemReserved":0,"allocatable":20971520,"reserved":20971520,"free":0},"memory":{"total":270146174976,"systemReserved":1153433600,"allocatable":268971769856,"reserved":104857600,"free":268866912256}},"cells":[0,1]},"1":{"numberOfAssignments":2,"memoryMap":{"hugepages-1Gi":{"total":0,"systemReserved":0,"allocatable":0,"reserved":0,"free":0},"hugepages-2Mi":{"total":20971520,"systemReserved":0,"allocatable":20971520,"reserved":4194304,"free":16777216},"memory":{"total":270531874816,"systemReserved":0,"allocatable":270510903296,"reserved":0,"free":270510903296}},"cells":[0,1]}},"entries":{"bbf8fd78-3c9d-4924-b4a2-450caaca6da3":{"example":[{"numaAffinity":[0,1],"type":"hugepages-2Mi","size":25165824},{"numaAffinity":[0,1],"type":"memory","size":104857600}]}},"checksum":1759502831}
</snip>

Now create a QoS Guaranteed Pod2 requesting 16Mi . On the same worker node. 

Pod spec:

<snip>
apiVersion: v1
kind: Pod
metadata:
  name: pod2
spec:
  containers:
  - name: example
    image: fedora:latest
    command:
    - sleep
    - inf
    volumeMounts:
    - mountPath: /hugepages-2Mi
      name: hugepage-2mi
    resources:
      limits:
        hugepages-2Mi: "16Mi"
        memory: "100Mi"
        cpu: "2"
      requests:
        hugepages-2Mi: "16Mi"
        memory: "100Mi"
        cpu: "2"
  nodeSelector:
    kubernetes.io/hostname: "worker1.example.org "
  volumes:
  - name: hugepage-2mi
    emptyDir:
      medium: HugePages-2Mi
</snip>

List pods:

<snip>

NAME   READY   STATUS    RESTARTS   AGE     IP             NODE                              NOMINATED NODE   READINESS GATES
pod1   1/1     Running   0          4m39s   10.128.2.109   worker1.example.org              <none>                     <none>
pod2   1/1     Running   0          11s     10.128.2.111   worker1.example.org                 <none>                     <none>

</snip>

Memory Manager state file after pod2 is deployed.

<snip>
{"policyName":"Static","machineState":{"0":{"numberOfAssignments":4,"memoryMap":{"hugepages-1Gi":{"total":0,"systemReserved":0,"allocatable":0,"reserved":0,"free":0},"hugepages-2Mi":{"total":20971520,"systemReserved":0,"allocatable":20971520,"reserved":20971520,"free":0},"memory":{"total":270146174976,"systemReserved":1153433600,"allocatable":268971769856,"reserved":209715200,"free":268762054656}},"cells":[0,1]},"1":{"numberOfAssignments":4,"memoryMap":{"hugepages-1Gi":{"total":0,"systemReserved":0,"allocatable":0,"reserved":0,"free":0},"hugepages-2Mi":{"total":20971520,"systemReserved":0,"allocatable":20971520,"reserved":20971520,"free":0},"memory":{"total":270531874816,"systemReserved":0,"allocatable":270510903296,"reserved":0,"free":270510903296}},"cells":[0,1]}},"entries":{"90a01c04-cfc4-401b-bdb8-85667384f002":{"example":[{"numaAffinity":[0,1],"type":"hugepages-2Mi","size":16777216},{"numaAffinity":[0,1],"type":"memory","size":104857600}]},"bbf8fd78-3c9d-4924-b4a2-450caaca6da3":{"example":[{"numaAffinity":[0,1],"type":"hugepages-2Mi","size":25165824},{"numaAffinity":[0,1],"type":"memory","size":104857600}]}},"checksum":3930981289}

</snip>

#### What you expected to happen:
Pod2 should be rejected. 

#### How to reproduce it (as minimally and precisely as possible):
Steps provided above
#### Anything else we need to know?:

#### Environment:
- Kubernetes version (use `kubectl version`): v1.22.0-rc.0+5c2f7cd
- Cloud provider or hardware configuration: Openshift
- OS (e.g: `cat /etc/os-release`):Red Hat Enterprise Linux CoreOS release 4.9
- Kernel (e.g. `uname -a`): Linux helix02.lab.eng.tlv2.redhat.com 4.18.0-305.12.1.rt7.84.el8_4.x86_64 #1 SMP PREEMPT_RT Thu Jul 29 14:18:12 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

Kubelet configuration 

<snip>
spec:
  kubeletConfig:
    apiVersion: kubelet.config.k8s.io/v1beta1
    authentication:
      anonymous: {}
      webhook:
        cacheTTL: 0s
      x509: {}
    authorization:
      webhook:
        cacheAuthorizedTTL: 0s
        cacheUnauthorizedTTL: 0s
    cpuManagerPolicy: static
    cpuManagerReconcilePeriod: 5s
    evictionHard:
      memory.available: 100Mi
    evictionPressureTransitionPeriod: 0s
    fileCheckFrequency: 0s
    httpCheckFrequency: 0s
    imageMinimumGCAge: 0s
    kind: KubeletConfiguration
    kubeReserved:
      cpu: 1000m
      memory: 500Mi
    logging: {}
    memoryManagerPolicy: Static
    nodeStatusReportFrequency: 0s
    nodeStatusUpdateFrequency: 0s
    reservedMemory:
    - limits:
        memory: 1100Mi
      numaNode: 0
    reservedSystemCPUs: 0-4,40-44
    runtimeRequestTimeout: 0s
    shutdownGracePeriod: 0s
    shutdownGracePeriodCriticalPods: 0s
    streamingConnectionIdleTimeout: 0s
    syncFrequency: 0s
    systemReserved:
      cpu: 1000m
      memory: 500Mi
    topologyManagerPolicy: restricted
    volumeStatsAggPeriod: 0s
  machineConfigPoolSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker-cnf

</snip>

Comment 2 Artyom 2021-12-19 09:49:45 UTC
The bug should be fixed once the OpenShift is rebased on top of Kubernetes 1.23.

Comment 3 Tom Sweeney 2022-01-06 15:41:51 UTC
Not completed this sprint.

Comment 4 Artyom 2022-01-31 09:00:26 UTC
The rebase is completed and the bug should be fixed as a result of rebasing.

Comment 6 Niranjan Mallapadi Raghavender 2022-02-04 09:35:11 UTC
Versions:
========
oc version
Client Version: 4.10.0-0.nightly-2022-02-02-000921
Server Version: 4.10.0-0.nightly-2022-02-03-220350
Kubernetes Version: v1.23.3+b63be7f

PAO Version:

    "msg": {
        "architecture": "x86_64",
        "build-date": "2022-02-02T19:59:27.762163",
        "com.redhat.build-host": "cpt-1005.osbs.prod.upshift.rdu2.redhat.com",
        "com.redhat.component": "performance-addon-operator-container",
        "com.redhat.license_terms": "https://www.redhat.com/agreements",
        "description": "performance-addon-operator",
        "distribution-scope": "public",
        "io.k8s.description": "performance-addon-operator",
        "io.k8s.display-name": "performance-addon-operator",
        "io.openshift.expose-services": "",
        "io.openshift.maintainer.component": "Performance Addon Operator",
        "io.openshift.maintainer.product": "OpenShift Container Platform",
        "io.openshift.tags": "operator",
        "maintainer": "openshift-operators",
        "name": "openshift4/performance-addon-rhel8-operator",
        "release": "28",
        "summary": "performance-addon-operator",
        "upstream-vcs-ref": "7e40c978acca61ea540fb10b34e826474d6a93cf",
        "upstream-vcs-type": "git",
        "upstream-version": "0.0.41001-2-g7e40c978",
        "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/openshift4/performance-addon-rhel8-operator/images/v4.10.0-28",
        "vcs-ref": "8473aa2255f73db5523c2a665256ed6297a99025",
        "vcs-type": "git",
        "vendor": "Red Hat, Inc.",
        "version": "v4.10.0"


Steps:

1. create a performance profile as show below:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: performance
spec:
  cpu:
    isolated: 5-19,45-59,20-39,60-79
    reserved: 0-4,40-44
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 20
      size: 2M
  net:
    userLevelNetworking: true
  nodeSelector:
    node-role.kubernetes.io/workercnf: ""
  numa:
    topologyPolicy: restricted
  realTimeKernel:
    enabled: true

2. Create a pod using below specs:

apiVersion: v1
kind: Pod
metadata:
  name: pod1
spec:
  containers:
  - name: example-pod1
    image: fedora:latest
    command:
    - sleep
    - inf
    volumeMounts:
    - mountPath: /hugepages-2Mi
      name: hugepage-2mi
    resources:
      limits:
        hugepages-2Mi: 24Mi
        memory: "24Mi"
        cpu: "2"
      requests:
        hugepages-2Mi: 24Mi
        memory: "24Mi"
        cpu: "2"
  nodeSelector:
    kubernetes.io/hostname: "worker-0"
  volumes:
  - name: hugepage-2mi
    emptyDir:
      medium: HugePages-2Mi

3. check Pod status:

[root@registry bz-1999603]# oc get pods -o wide
NAME   READY   STATUS    RESTARTS   AGE    IP            NODE       NOMINATED NODE   READINESS GATES
pod1   1/1     Running   0          3m2s   10.128.2.26   worker-0   <none>           <none>
[root@registry bz-1999603]# oc debug node/worker-0

4. Get Memory Manager state file:

sh-4.4# cat memory_manager_state
{"policyName":"Static","machineState":{"0":{"numberOfAssignments":2,"memoryMap":{"hugepages-1Gi":{"total":0,"systemReserved":0,"allocatable":0,"reserved":0,"free":0},"hugepages-2Mi":{"total":20971520,"systemReserved":0,"allocatable":20971520,"reserved":20971520,"free":0},"memory":{"total":270146011136,"systemReserved":1153433600,"allocatable":268971606016,"reserved":25165824,"free":268946440192}},"cells":[0,1]},"1":{"numberOfAssignments":2,"memoryMap":{"hugepages-1Gi":{"total":0,"systemReserved":0,"allocatable":0,"reserved":0,"free":0},"hugepages-2Mi":{"total":20971520,"systemReserved":0,"allocatable":20971520,"reserved":4194304,"free":16777216},"memory":{"total":270531715072,"systemReserved":0,"allocatable":270510743552,"reserved":0,"free":270510743552}},"cells":[0,1]}},"entries":{"a9feb7f2-a1d4-4f7f-ae5f-ba5c2b60b254":{"example-pod1":[{"numaAffinity":[0,1],"type":"hugepages-2Mi","size":25165824},{"numaAffinity":[0,1],"type":"memory","size":25165824}]}},"checksum":279997132}sh-4.4#

5. Get cpus and numa nodes used by pod1

[root@registry bz-1999603]# oc exec -ti pods/pod1 -- bash -c "cat /sys/fs/cgroup/cpuset/cpuset.cpus"
5,45
[root@registry bz-1999603]# oc exec -ti pods/pod1 -- bash -c "cat /sys/fs/cgroup/cpuset/cpuset.mems"
0-1

6. Create Pod2. with below spec:

[root@registry bz-1999603]# cat test2.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: pod2
spec:
  containers:
  - name: example-pod2
    image: fedora:latest
    command:
    - sleep
    - inf
    volumeMounts:
    - mountPath: /hugepages-2Mi
      name: hugepage-2mi
    resources:
      limits:
        hugepages-2Mi: 16Mi
        memory: "16Mi"
        cpu: "2"
      requests:
        hugepages-2Mi: 16Mi
        memory: "16Mi"
        cpu: "2"
  nodeSelector:
    kubernetes.io/hostname: "worker-0"
  volumes:
  - name: hugepage-2mi
    emptyDir:
      medium: HugePages-2Mi

7. Create of pod2 should fail:

[root@registry bz-1999603]# oc get pods -o wide
NAME   READY   STATUS                   RESTARTS   AGE     IP            NODE       NOMINATED NODE   READINESS GATES
pod1   1/1     Running                  0          8m11s   10.128.2.26   worker-0   <none>           <none>
pod2   0/1     ContainerStatusUnknown   0          13s     10.128.2.27   worker-0   <none>           <none>

8. Check the pod2 status 

[root@registry bz-1999603]# oc describe pods/pod2
Name:         pod2
Namespace:    default
Priority:     0
Node:         worker-0/10.46.80.2
Start Time:   Fri, 04 Feb 2022 04:23:52 -0500
Labels:       <none>
Annotations:  k8s.ovn.org/pod-networks:
                {"default":{"ip_addresses":["10.128.2.27/23"],"mac_address":"0a:58:0a:80:02:1b","gateway_ips":["10.128.2.1"],"ip_address":"10.128.2.27/23"...
              k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "ovn-kubernetes",
                    "interface": "eth0",
                    "ips": [
                        "10.128.2.27"
                    ],
                    "mac": "0a:58:0a:80:02:1b",
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "ovn-kubernetes",
                    "interface": "eth0",
                    "ips": [
                        "10.128.2.27"
                    ],
                    "mac": "0a:58:0a:80:02:1b",
                    "default": true,
                    "dns": {}
                }]
Status:       Failed
Reason:       TopologyAffinityError
Message:      Pod Resources cannot be allocated with Topology locality
IP:           10.128.2.27   
IPs:
  IP:  10.128.2.27
Containers:
  example-pod2:
    Container ID:
    Image:         fedora:latest
    Image ID:
    Port:          <none>   
    Host Port:     <none>   
    Command:
      sleep
      inf
    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False   
    Restart Count:  0
    Limits:
      cpu:            2
      hugepages-2Mi:  16Mi  
      memory:         16Mi  
    Requests:
      cpu:            2
      hugepages-2Mi:  16Mi  
      memory:         16Mi  
    Environment:      <none>
    Mounts:
      /hugepages-2Mi from hugepage-2mi (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5529z (ro)
Conditions:
  Type              Status  
  Initialized       True
  Ready             False   
  ContainersReady   False   
  PodScheduled      True
Volumes:
  hugepage-2mi:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     HugePages-2Mi
    SizeLimit:  <unset>
  kube-api-access-5529z:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Guaranteed
Node-Selectors:              kubernetes.io/hostname=worker-0
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Events:
  Type     Reason                 Age   From               Message
  ----     ------                 ----  ----               -------
  Normal   Scheduled              22s   default-scheduler  Successfully assigned default/pod2 to worker-0
  Warning  TopologyAffinityError  23s   kubelet            Resources cannot be allocated with Topology locality
  Normal   AddedInterface         20s   multus             Add eth0 [10.128.2.27/23] from ovn-kubernetes
  Normal   Pulling                20s   kubelet            Pulling image "fedora:latest"
  Normal   Pulled                 16s   kubelet            Successfully pulled image "fedora:latest" in 3.597402945s
  Warning  Failed                 15s   kubelet            Error: container create failed: parent closed synchronisation channel

As seen pod2 gets rejected

Comment 8 errata-xmlrpc 2022-03-10 16:06:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.