1813567 – Topology Manager policy not respected when creating pods concurrently

Bug 1813567 - Topology Manager policy not respected when creating pods concurrently

Summary: Topology Manager policy not respected when creating pods concurrently

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.5
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Victor Pickard
QA Contact:	Walid A.
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1834979
TreeView+	depends on / blocked

Reported:	2020-03-14 13:38 UTC by zenghui.shi
Modified:	2023-10-06 19:25 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1834979 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:20:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:20:31 UTC

Description zenghui.shi 2020-03-14 13:38:29 UTC

Description of problem:

When creating several pods "simutaneously" on node with cpu policy (static) and topology policy (single-numa-node) enabled, device and cpu alignment on single numa node is not guaranteed. Pod can still be created even topology affinity is not met.


Version-Release number of selected component (if applicable):

Client Version: 4.5.0-0.nightly-2020-03-12-003015
Server Version: 4.5.0-0.nightly-2020-03-12-003015
Kubernetes Version: v1.17.1

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:

pod can be admitted

Expected results:

topology affinity error when admitting pod


Additional info:

Test was done with SR-IOV device on baremetal OCP deployment.

SRIOV Device on NUMA node 1:

[core@ci-worker-0 ~]$ ip link show ens787f1
5: ens787f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 3c:fd:fe:cd:35:49 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 1 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 2 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 3 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off

[core@ci-worker-0 ~]$ cat /sys/class/net/ens787f1/device/numa_node  
1


CPUs on each NUMA node:

[core@ci-worker-0 ~]$ lscpu | grep NUMA
NUMA node(s):        2
NUMA node0 CPU(s):   0-17,36-53
NUMA node1 CPU(s):   18-35,54-71


Create two pods simutaneously with cmd `oc create -f pod1.yaml -f pod2.yaml`
1) pod1 request 30 cpus + 1 sriov device
2) pod2 request 10 cpus + 1 sriov device

pod1 spec:

apiVersion: v1
kind: Pod
metadata:
  name: testpod1
  annotations:
    k8s.v1.cni.cncf.io/networks: 'sriov-intel'  <== sriov net-attach-def
spec:
  nodeSelector:
    kubernetes.io/hostname: ci-worker-0         <== create pod on worker-0
  containers:
  - name: appcntr1
    image: zenghui/centos-dpdk
    imagePullPolicy: IfNotPresent
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        openshift.io/intelnics: 1              <== request SR-IOV device
        cpu: 30                                <== request 10 CPUs
        memory: 1000Mi
      limits:
        openshift.io/intelnics: 1
        cpu: 30
        memory: 1000Mi

pod2 spec:

apiVersion: v1
kind: Pod
metadata:
  name: testpod2
  annotations:
    k8s.v1.cni.cncf.io/networks: 'sriov-intel'  <== sriov net-attach-def
spec:
  nodeSelector:
    kubernetes.io/hostname: ci-worker-0         <== create pod on worker-0
  containers:
  - name: appcntr1
    image: zenghui/centos-dpdk
    imagePullPolicy: IfNotPresent
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        openshift.io/intelnics: 1              <== request SR-IOV device
        cpu: 10                                <== request 10 CPUs
        memory: 1000Mi
      limits:
        openshift.io/intelnics: 1
        cpu: 10
        memory: 1000Mi


Both pod1 and pod2 can be created without hitting topology affinity error. [unexpected]

# oc get pods -o wide
NAME       READY   STATUS    RESTARTS   AGE     IP            NODE          NOMINATED NODE   READINESS GATES
testpod1   1/1     Running   0          3m19s   10.128.2.39   ci-worker-0   <none>           <none>
testpod2   1/1     Running   0          3m19s   10.128.2.40   ci-worker-0   <none>           <none>

[root@nfvpe-03 templates]# oc exec testpod1 -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
1-2,23-35,37-38,59-71      <== pod1 created successfully with cpus on both NUMA
[root@nfvpe-03 templates]# oc exec testpod2 -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
18-22,54-58          


Delete pod1.yaml and re-create it
1) `oc delete -f pod1.yaml`
2) `oc create -f pod1.yaml`

# oc get pods -o wide
NAME       READY   STATUS                    RESTARTS   AGE     IP            NODE          NOMINATED NODE   READINESS GATES
testpod1   0/1     Topology Affinity Error   0          6s      <none>        ci-worker-0   <none>           <none>
testpod2   1/1     Running                   0          7m37s   10.128.2.40   ci-worker-0   <none>           <none>


Pod1 admit failed with topology affinity error. [expected]

Comment 1 Deepthi Dharwar 2020-03-16 06:52:59 UTC

Recently a bug was fixed upstream interms of adding a mutex. This should fix this issue.
https://github.com/kubernetes/kubernetes/commit/e8538d9b76abe944a61eab10bfc2a580974f25fd
Better to check if this bug fix was back ported.

Comment 2 Victor Pickard 2020-03-16 20:34:50 UTC

This appears to be an issue in CPU Manager. I tried this on u/s k8s against release-1.16 and also against tip of master (1.18). 

In 1.16, I see that the cpu assignment for testpod1 (below) has CPUs from both NUMA nodes. I did not see (could not reproduce) this issue in 1.18.

If I delete testpod1, and recreate it, CPU assignment is correct (single numa node)

[root@nfvsdn-20 hack]# kubectl create -f ~/pod1.yaml -f ~/pod2.yaml
pod/testpod1 created
pod/testpod2 created
[root@nfvsdn-20 hack]# kubectl get pods --all-namespaces
NAMESPACE     NAME                        READY   STATUS    RESTARTS   AGE
default       testpod1                    1/1     Running   0          3s
default       testpod2                    1/1     Running   0          3s
kube-system   kube-dns-68496566b5-5bj98   3/3     Running   0          8m54s
[root@nfvsdn-20 hack]# kubectl exec testpod1 -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
1,3,5,7,9,12-13,15,17,19
[root@nfvsdn-20 hack]# kubectl exec testpod2 -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
2,4,6,8,10,14,16,18,20,22
[root@nfvsdn-20 hack]# lscpu |grep NUMA
NUMA node(s):          2
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23

[root@nfvsdn-20 hack]# cat ~/pod1.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: testpod1
spec:
  containers:
  - name: appcntr1
    image: zenghui/centos-dpdk
    imagePullPolicy: IfNotPresent
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        cpu: 10                                
        memory: 1000Mi
      limits:
        cpu: 10
        memory: 1000Mi
[root@nfvsdn-20 hack]# 

[root@nfvsdn-20 hack]# cat ~/pod2.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: testpod2
spec:
  containers:
  - name: appcntr1
    image: zenghui/centos-dpdk
    imagePullPolicy: IfNotPresent
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        cpu: 10
        memory: 1000Mi
      limits:
        cpu: 10
        memory: 1000Mi
[root@nfvsdn-20 hack]#

Comment 3 Victor Pickard 2020-03-19 17:33:12 UTC

I had a closer look at this issue to make sure I understood why this is broken in 1.16 and works in 1.18.

In 1.18, there were several fixes/enhancements to Topology Manager and the registered hint providers (CPU Manager, Device Manager). Specifically, https://github.com/kubernetes/kubernetes/pull/87759, which resolves issue https://github.com/kubernetes/kubernetes/issues/83476, is exactly the issue reported in this BZ.

The summary of the above issue is: 
Hint generation and resource allocation is unreliable for all but 1st container (or pod)

So, we have to decide if we want to get all of the changes related to https://github.com/kubernetes/kubernetes/pull/87759 (and dependent PRs) cherry-picked to OCP 4.3 (once Topology Manager is GA in OCP 4.5). 

I'll include some logs here for reference. Basically, in 1.16, sequence is:

1. Generate hint for testpod1
2. Generate hint for testpod2
3. Allocate cpus for testpod1
4. Allocate cpus for testpod2  <<== here, there are not enough CPUs on NUMA 0, so some are allocated from NUMA 1

1.16 logs for Topology Manager and CPU Manager when creating 2 pods at the same time (kubectl create -f pod1.yaml -f pod2.yaml).

Notice in this log sequence, that 

I0317 17:32:40.943414   46476 topology_manager.go:308] [topologymanager] Topology Admit Handler
I0317 17:32:40.943430   46476 topology_manager.go:317] [topologymanager] Pod QoS Level: Guaranteed
I0317 17:32:40.943446   46476 topology_manager.go:194] [topologymanager] Hint Provider has no preference for NUMA affinity with any resource
I0317 17:32:40.943532   46476 topology_hints.go:60] [cpumanager] TopologyHints generated for pod 'testpod1', container 'appcntr1': [{0000000000000000000000000000000000000000000000000000000000000001 true} {0000000000000000000000000000000000000000000000000000000000000010 true} {0000000000000000000000000000000000000000000000000000000000000011 false}]



The hint calculated for testpod1
********************************* 

I0317 17:32:40.943583   46476 topology_manager.go:285] [topologymanager] ContainerTopologyHint: {0000000000000000000000000000000000000000000000000000000000000001 true}


I0317 17:32:40.943707   46476 topology_manager.go:329] [topologymanager] Topology Affinity for Pod: d9725981-d53e-4bd7-8f6a-e2a28d0ff304 are map[appcntr1:{0000000000000000000000000000000000000000000000000000000000000001 true}]
I0317 17:32:40.946107   46476 topology_manager.go:308] [topologymanager] Topology Admit Handler
I0317 17:32:40.946123   46476 topology_manager.go:317] [topologymanager] Pod QoS Level: Guaranteed
I0317 17:32:40.946159   46476 topology_manager.go:194] [topologymanager] Hint Provider has no preference for NUMA affinity with any resource
I0317 17:32:40.946245   46476 topology_hints.go:60] [cpumanager] TopologyHints generated for pod 'testpod2', container 'appcntr1': [{0000000000000000000000000000000000000000000000000000000000000001 true} {0000000000000000000000000000000000000000000000000000000000000010 true} {0000000000000000000000000000000000000000000000000000000000000011 false}]



The hint calculated for testpod2, Notice, same as hint for testpod1, which is the root of the problem,
as there are not enough CPUs available on this NUMA node, so CPU Manager will allocate CPUs from other
NUMA. 
******************************************************************************************************
I0317 17:32:40.946279   46476 topology_manager.go:285] [topologymanager] ContainerTopologyHint: {0000000000000000000000000000000000000000000000000000000000000001 true}



I0317 17:32:40.946296   46476 topology_manager.go:329] [topologymanager] Topology Affinity for Pod: 91ef9e1f-eb23-40ee-b33e-77f154859e38 are map[appcntr1:{0000000000000000000000000000000000000000000000000000000000000001 true}]
I0317 17:32:42.197351   46476 policy_static.go:195] [cpumanager] static policy: AddContainer (pod: testpod1, container: appcntr1, container id: e6f74fdbf7fcc55adb85811e8a5d83c6bcbf9e74cc3db7d697cbcc4d92d82f9b)
I0317 17:32:42.197407   46476 policy_static.go:221] [cpumanager] Pod d9725981-d53e-4bd7-8f6a-e2a28d0ff304, Container appcntr1 Topology Affinity is: {0000000000000000000000000000000000000000000000000000000000000001 true}
I0317 17:32:42.197475   46476 policy_static.go:254] [cpumanager] allocateCpus: (numCPUs: 10, socket: 0000000000000000000000000000000000000000000000000000000000000001)
I0317 17:32:42.197645   46476 state_mem.go:84] [cpumanager] updated default cpuset: "0-1,3,5,7,9,11-13,15,17,19,21,23"
I0317 17:32:42.198877   46476 policy_static.go:287] [cpumanager] allocateCPUs: returning "2,4,6,8,10,14,16,18,20,22"


I0317 17:32:42.198910   46476 state_mem.go:76] [cpumanager] updated desired cpuset (container id: e6f74fdbf7fcc55adb85811e8a5d83c6bcbf9e74cc3db7d697cbcc4d92d82f9b, cpuset: "2,4,6,8,10,14,16,18,20,22")
I0317 17:32:42.227317   46476 policy_static.go:195] [cpumanager] static policy: AddContainer (pod: testpod2, container: appcntr1, container id: 41ee4e2904d1287bd705dc650462bdeea70a8c6865a1f3e8b2c5ba39e561a0e3)
I0317 17:32:42.227355   46476 policy_static.go:221] [cpumanager] Pod 91ef9e1f-eb23-40ee-b33e-77f154859e38, Container appcntr1 Topology Affinity is: {0000000000000000000000000000000000000000000000000000000000000001 true}
I0317 17:32:42.227374   46476 policy_static.go:254] [cpumanager] allocateCpus: (numCPUs: 10, socket: 0000000000000000000000000000000000000000000000000000000000000001)
I0317 17:32:42.227514   46476 state_mem.go:84] [cpumanager] updated default cpuset: "0,11,21,23"
I0317 17:32:42.228133   46476 policy_static.go:287] [cpumanager] allocateCPUs: returning "1,3,5,7,9,12-13,15,17,19"


I0317 17:32:42.228164   46476 state_mem.go:76] [cpumanager] updated desired cpuset (container id: 41ee4e2904d1287bd705dc650462bdeea70a8c6865a1f3e8b2c5ba39e561a0e3, cpuset: "1,3,5,7,9,12-13,15,17,19")



1.18 logs for same scenario. Notice, the sequence here is:

1. Generate hints for pod1
2. Allocate resources for pod1
3. Generate hints for pod2
4. Allocate resources for pod2

I0317 20:19:35.992273   10554 topology_manager.go:233] [topologymanager] Topology Admit Handler
I0317 20:19:35.992326   10554 topology_manager.go:180] [topologymanager] TopologyHints for pod 'testpod1', container 'appcntr1': map[]
I0317 20:19:35.992488   10554 policy_static.go:329] [cpumanager] TopologyHints generated for pod 'testpod1', container 'appcntr1': [{01 true} {10 true} {11 false}]
I0317 20:19:35.992531   10554 topology_manager.go:180] [topologymanager] TopologyHints for pod 'testpod1', container 'appcntr1': map[cpu:[{01 true} {10 true} {11 false}]]
I0317 20:19:35.992581   10554 policy.go:70] [topologymanager] Hint Provider has no preference for NUMA affinity with any resource

Hint for testpod1
******************
I0317 20:19:35.992600   10554 topology_manager.go:199] [topologymanager] ContainerTopologyHint: {01 true}
I0317 20:19:35.992614   10554 topology_manager.go:258] [topologymanager] Topology Affinity for (pod: 4d4fe6a9-ce0b-42b8-b097-764cfdbe8440 container: appcntr1): {01 true}
I0317 20:19:35.993434   10554 policy_static.go:193] [cpumanager] static policy: Allocate (pod: testpod1, container: appcntr1)
I0317 20:19:35.993466   10554 policy_static.go:203] [cpumanager] Pod 4d4fe6a9-ce0b-42b8-b097-764cfdbe8440, Container appcntr1 Topology Affinity is: {01 true}
I0317 20:19:35.993503   10554 policy_static.go:239] [cpumanager] allocateCpus: (numCPUs: 10, socket: 01)

Allocate
********
I0317 20:19:35.993733   10554 state_mem.go:88] [cpumanager] updated default cpuset: "0-1,3,5,7,9,11-13,15,17,19,21,23"
I0317 20:19:35.994799   10554 policy_static.go:272] [cpumanager] allocateCPUs: returning "2,4,6,8,10,14,16,18,20,22"
I0317 20:19:35.995109   10554 state_mem.go:80] [cpumanager] updated desired cpuset (pod: 4d4fe6a9-ce0b-42b8-b097-764cfdbe8440, container: appcntr1, cpuset: "2,4,6,8,10,14,16,18,20,22")
I0317 20:19:35.997666   10554 topology_manager.go:233] [topologymanager] Topology Admit Handler
I0317 20:19:35.997770   10554 topology_manager.go:180] [topologymanager] TopologyHints for pod 'testpod2', container 'appcntr1': map[]
I0317 20:19:35.998118   10554 policy_static.go:329] [cpumanager] TopologyHints generated for pod 'testpod2', container 'appcntr1': [{10 true} {11 false}]
I0317 20:19:35.998218   10554 topology_manager.go:180] [topologymanager] TopologyHints for pod 'testpod2', container 'appcntr1': map[cpu:[{10 true} {11 false}]]
I0317 20:19:35.998278   10554 policy.go:70] [topologymanager] Hint Provider has no preference for NUMA affinity with any resource

Hint for testpod2
******************
I0317 20:19:35.998314   10554 topology_manager.go:199] [topologymanager] ContainerTopologyHint: {10 true}
I0317 20:19:35.998364   10554 topology_manager.go:258] [topologymanager] Topology Affinity for (pod: b8bc3582-f330-4da0-8c0e-fba14e278e7d container: appcntr1): {10 true}
I0317 20:19:35.999325   10554 policy_static.go:193] [cpumanager] static policy: Allocate (pod: testpod2, container: appcntr1)
I0317 20:19:35.999367   10554 policy_static.go:203] [cpumanager] Pod b8bc3582-f330-4da0-8c0e-fba14e278e7d, Container appcntr1 Topology Affinity is: {10 true}
I0317 20:19:35.999508   10554 policy_static.go:239] [cpumanager] allocateCpus: (numCPUs: 10, socket: 10)
I0317 20:19:36.000065   10554 state_mem.go:88] [cpumanager] updated default cpuset: "0,11-12,23"

Allocate
*********
I0317 20:19:36.000674   10554 policy_static.go:272] [cpumanager] allocateCPUs: returning "1,3,5,7,9,13,15,17,19,21"
I0317 20:19:36.000704   10554 state_mem.go:80] [cpumanager] updated desired cpuset (pod: b8bc3582-f330-4da0-8c0e-fba14e278e7d, container: appcntr1, cpuset: "1,3,5,7,9,13,15,17,19,21")

Comment 5 Victor Pickard 2020-05-12 16:36:42 UTC

This issue should be resolved in 4.5 now with Topology Manager getting to Beta status. Can you please retest to confirm?

Also, we need to clone this issue to 4.4, as this issue definitely exists in that release, and will not be fixed without a pretty significant backport, which, at this time, has been deferred.

Comment 10 Walid A. 2020-06-10 07:25:38 UTC

Verified on OCP 4.5.0-0.nightly-2020-05-14-021132 baremetal IPI installed cluster, 3 masters and 2 workers node.

# oc create -f pod1_20cpu_4sriov.yaml -f pod2_18cpu_4sriov.yaml
pod/pod1cpu20sriov4 created
pod/pod2cpu18sriov4 created


# oc get pods
NAME              READY   STATUS    RESTARTS   AGE
pod1cpu20sriov4   1/1     Running   0          7s
pod2cpu18sriov4   1/1     Running   0          7s

# oc exec pod2cpu18sriov4 -- cat /proc/self/status | grep "Cpus_allowed_list"
Cpus_allowed_list:	21,23,25,27,29,31,33,35,37,61,63,65,67,69,71,73,75,77

# oc exec pod1cpu20sriov4 -- cat /proc/self/status | grep "Cpus_allowed_list"
Cpus_allowed_list:	1,3,5,7,9,11,13,15,17,19,41,43,45,47,49,51,53,55,57,59


---- delete one pod: oc delete -f pod1_20cpu_4sriov.yaml

---- recreate the pod: oc create -f pod1_20cpu_4sriov.yaml

Both pod1cpu20sriov4 and pod2cpu20sriov4 were running and allocated CPUs from Numa Node 1.

# oc get pods
NAME              READY   STATUS    RESTARTS   AGE
pod1cpu20sriov4   1/1     Running   0          6m48s
pod2cpu20sriov4   1/1     Running   0          5s

Comment 12 errata-xmlrpc 2020-07-13 17:20:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.