Bug 1834979

Summary:	Topology Manager policy not respected when creating pods concurrently
Product:	OpenShift Container Platform	Reporter:	Victor Pickard <vpickard>
Component:	Node	Assignee:	Victor Pickard <vpickard>
Status:	CLOSED WONTFIX	QA Contact:	Sunil Choudhary <schoudha>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.4	CC:	aos-bugs, asimonel, ddharwar, jokerman, mburke, rphillips, schoudha, vpickard, zshi
Target Milestone:	---
Target Release:	4.5.0
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	Cause: An issue in Topology Manager could result in NUMA resources not being aligned to the same NUMA node if Guaranteed QoS pods are created simultaneously on the same node. Consequence: Resources requested in the pod spec may not be NUMA aligned. Workaround (if any): Do not create pods with Guaranteed QoS concurrently on the same node. If this does occur, delete and recreate the pod. Result: Pod resources should be NUMA aligned after deleting and recreating the pod with Guaranteed Qos resource requests.	Story Points:	---
Clone Of:	1813567	Environment:
Last Closed:	2020-05-12 19:35:05 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1813567
Bug Blocks:

Description Victor Pickard 2020-05-12 19:25:26 UTC

+++ This bug was initially created as a clone of Bug #1813567 +++

Description of problem:

When creating several pods "simutaneously" on node with cpu policy (static) and topology policy (single-numa-node) enabled, device and cpu alignment on single numa node is not guaranteed. Pod can still be created even topology affinity is not met.


Version-Release number of selected component (if applicable):

Client Version: 4.5.0-0.nightly-2020-03-12-003015
Server Version: 4.5.0-0.nightly-2020-03-12-003015
Kubernetes Version: v1.17.1

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:

pod can be admitted

Expected results:

topology affinity error when admitting pod


Additional info:

Test was done with SR-IOV device on baremetal OCP deployment.

SRIOV Device on NUMA node 1:

[core@ci-worker-0 ~]$ ip link show ens787f1
5: ens787f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 3c:fd:fe:cd:35:49 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 1 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 2 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 3 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off

[core@ci-worker-0 ~]$ cat /sys/class/net/ens787f1/device/numa_node  
1


CPUs on each NUMA node:

[core@ci-worker-0 ~]$ lscpu | grep NUMA
NUMA node(s):        2
NUMA node0 CPU(s):   0-17,36-53
NUMA node1 CPU(s):   18-35,54-71


Create two pods simutaneously with cmd `oc create -f pod1.yaml -f pod2.yaml`
1) pod1 request 30 cpus + 1 sriov device
2) pod2 request 10 cpus + 1 sriov device

pod1 spec:

apiVersion: v1
kind: Pod
metadata:
  name: testpod1
  annotations:
    k8s.v1.cni.cncf.io/networks: 'sriov-intel'  <== sriov net-attach-def
spec:
  nodeSelector:
    kubernetes.io/hostname: ci-worker-0         <== create pod on worker-0
  containers:
  - name: appcntr1
    image: zenghui/centos-dpdk
    imagePullPolicy: IfNotPresent
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        openshift.io/intelnics: 1              <== request SR-IOV device
        cpu: 30                                <== request 10 CPUs
        memory: 1000Mi
      limits:
        openshift.io/intelnics: 1
        cpu: 30
        memory: 1000Mi

pod2 spec:

apiVersion: v1
kind: Pod
metadata:
  name: testpod2
  annotations:
    k8s.v1.cni.cncf.io/networks: 'sriov-intel'  <== sriov net-attach-def
spec:
  nodeSelector:
    kubernetes.io/hostname: ci-worker-0         <== create pod on worker-0
  containers:
  - name: appcntr1
    image: zenghui/centos-dpdk
    imagePullPolicy: IfNotPresent
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        openshift.io/intelnics: 1              <== request SR-IOV device
        cpu: 10                                <== request 10 CPUs
        memory: 1000Mi
      limits:
        openshift.io/intelnics: 1
        cpu: 10
        memory: 1000Mi


Both pod1 and pod2 can be created without hitting topology affinity error. [unexpected]

# oc get pods -o wide
NAME       READY   STATUS    RESTARTS   AGE     IP            NODE          NOMINATED NODE   READINESS GATES
testpod1   1/1     Running   0          3m19s   10.128.2.39   ci-worker-0   <none>           <none>
testpod2   1/1     Running   0          3m19s   10.128.2.40   ci-worker-0   <none>           <none>

[root@nfvpe-03 templates]# oc exec testpod1 -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
1-2,23-35,37-38,59-71      <== pod1 created successfully with cpus on both NUMA
[root@nfvpe-03 templates]# oc exec testpod2 -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
18-22,54-58          


Delete pod1.yaml and re-create it
1) `oc delete -f pod1.yaml`
2) `oc create -f pod1.yaml`

# oc get pods -o wide
NAME       READY   STATUS                    RESTARTS   AGE     IP            NODE          NOMINATED NODE   READINESS GATES
testpod1   0/1     Topology Affinity Error   0          6s      <none>        ci-worker-0   <none>           <none>
testpod2   1/1     Running                   0          7m37s   10.128.2.40   ci-worker-0   <none>           <none>


Pod1 admit failed with topology affinity error. [expected]

--- Additional comment from Deepthi Dharwar on 2020-03-16 06:52:59 UTC ---

Recently a bug was fixed upstream interms of adding a mutex. This should fix this issue.
https://github.com/kubernetes/kubernetes/commit/e8538d9b76abe944a61eab10bfc2a580974f25fd
Better to check if this bug fix was back ported.

--- Additional comment from Victor Pickard on 2020-03-16 20:34:50 UTC ---


This appears to be an issue in CPU Manager. I tried this on u/s k8s against release-1.16 and also against tip of master (1.18). 

In 1.16, I see that the cpu assignment for testpod1 (below) has CPUs from both NUMA nodes. I did not see (could not reproduce) this issue in 1.18.

If I delete testpod1, and recreate it, CPU assignment is correct (single numa node)

[root@nfvsdn-20 hack]# kubectl create -f ~/pod1.yaml -f ~/pod2.yaml
pod/testpod1 created
pod/testpod2 created
[root@nfvsdn-20 hack]# kubectl get pods --all-namespaces
NAMESPACE     NAME                        READY   STATUS    RESTARTS   AGE
default       testpod1                    1/1     Running   0          3s
default       testpod2                    1/1     Running   0          3s
kube-system   kube-dns-68496566b5-5bj98   3/3     Running   0          8m54s
[root@nfvsdn-20 hack]# kubectl exec testpod1 -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
1,3,5,7,9,12-13,15,17,19
[root@nfvsdn-20 hack]# kubectl exec testpod2 -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
2,4,6,8,10,14,16,18,20,22
[root@nfvsdn-20 hack]# lscpu |grep NUMA
NUMA node(s):          2
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23

[root@nfvsdn-20 hack]# cat ~/pod1.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: testpod1
spec:
  containers:
  - name: appcntr1
    image: zenghui/centos-dpdk
    imagePullPolicy: IfNotPresent
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        cpu: 10                                
        memory: 1000Mi
      limits:
        cpu: 10
        memory: 1000Mi
[root@nfvsdn-20 hack]# 

[root@nfvsdn-20 hack]# cat ~/pod2.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: testpod2
spec:
  containers:
  - name: appcntr1
    image: zenghui/centos-dpdk
    imagePullPolicy: IfNotPresent
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        cpu: 10
        memory: 1000Mi
      limits:
        cpu: 10
        memory: 1000Mi
[root@nfvsdn-20 hack]#

--- Additional comment from Victor Pickard on 2020-03-19 17:33:12 UTC ---

I had a closer look at this issue to make sure I understood why this is broken in 1.16 and works in 1.18.

In 1.18, there were several fixes/enhancements to Topology Manager and the registered hint providers (CPU Manager, Device Manager). Specifically, https://github.com/kubernetes/kubernetes/pull/87759, which resolves issue https://github.com/kubernetes/kubernetes/issues/83476, is exactly the issue reported in this BZ.

The summary of the above issue is: 
Hint generation and resource allocation is unreliable for all but 1st container (or pod)

So, we have to decide if we want to get all of the changes related to https://github.com/kubernetes/kubernetes/pull/87759 (and dependent PRs) cherry-picked to OCP 4.3 (once Topology Manager is GA in OCP 4.5). 

I'll include some logs here for reference. Basically, in 1.16, sequence is:

1. Generate hint for testpod1
2. Generate hint for testpod2
3. Allocate cpus for testpod1
4. Allocate cpus for testpod2  <<== here, there are not enough CPUs on NUMA 0, so some are allocated from NUMA 1

1.16 logs for Topology Manager and CPU Manager when creating 2 pods at the same time (kubectl create -f pod1.yaml -f pod2.yaml).

Notice in this log sequence, that 

I0317 17:32:40.943414   46476 topology_manager.go:308] [topologymanager] Topology Admit Handler
I0317 17:32:40.943430   46476 topology_manager.go:317] [topologymanager] Pod QoS Level: Guaranteed
I0317 17:32:40.943446   46476 topology_manager.go:194] [topologymanager] Hint Provider has no preference for NUMA affinity with any resource
I0317 17:32:40.943532   46476 topology_hints.go:60] [cpumanager] TopologyHints generated for pod 'testpod1', container 'appcntr1': [{0000000000000000000000000000000000000000000000000000000000000001 true} {0000000000000000000000000000000000000000000000000000000000000010 true} {0000000000000000000000000000000000000000000000000000000000000011 false}]



The hint calculated for testpod1
********************************* 

I0317 17:32:40.943583   46476 topology_manager.go:285] [topologymanager] ContainerTopologyHint: {0000000000000000000000000000000000000000000000000000000000000001 true}


I0317 17:32:40.943707   46476 topology_manager.go:329] [topologymanager] Topology Affinity for Pod: d9725981-d53e-4bd7-8f6a-e2a28d0ff304 are map[appcntr1:{0000000000000000000000000000000000000000000000000000000000000001 true}]
I0317 17:32:40.946107   46476 topology_manager.go:308] [topologymanager] Topology Admit Handler
I0317 17:32:40.946123   46476 topology_manager.go:317] [topologymanager] Pod QoS Level: Guaranteed
I0317 17:32:40.946159   46476 topology_manager.go:194] [topologymanager] Hint Provider has no preference for NUMA affinity with any resource
I0317 17:32:40.946245   46476 topology_hints.go:60] [cpumanager] TopologyHints generated for pod 'testpod2', container 'appcntr1': [{0000000000000000000000000000000000000000000000000000000000000001 true} {0000000000000000000000000000000000000000000000000000000000000010 true} {0000000000000000000000000000000000000000000000000000000000000011 false}]



The hint calculated for testpod2, Notice, same as hint for testpod1, which is the root of the problem,
as there are not enough CPUs available on this NUMA node, so CPU Manager will allocate CPUs from other
NUMA. 
******************************************************************************************************
I0317 17:32:40.946279   46476 topology_manager.go:285] [topologymanager] ContainerTopologyHint: {0000000000000000000000000000000000000000000000000000000000000001 true}



I0317 17:32:40.946296   46476 topology_manager.go:329] [topologymanager] Topology Affinity for Pod: 91ef9e1f-eb23-40ee-b33e-77f154859e38 are map[appcntr1:{0000000000000000000000000000000000000000000000000000000000000001 true}]
I0317 17:32:42.197351   46476 policy_static.go:195] [cpumanager] static policy: AddContainer (pod: testpod1, container: appcntr1, container id: e6f74fdbf7fcc55adb85811e8a5d83c6bcbf9e74cc3db7d697cbcc4d92d82f9b)
I0317 17:32:42.197407   46476 policy_static.go:221] [cpumanager] Pod d9725981-d53e-4bd7-8f6a-e2a28d0ff304, Container appcntr1 Topology Affinity is: {0000000000000000000000000000000000000000000000000000000000000001 true}
I0317 17:32:42.197475   46476 policy_static.go:254] [cpumanager] allocateCpus: (numCPUs: 10, socket: 0000000000000000000000000000000000000000000000000000000000000001)
I0317 17:32:42.197645   46476 state_mem.go:84] [cpumanager] updated default cpuset: "0-1,3,5,7,9,11-13,15,17,19,21,23"
I0317 17:32:42.198877   46476 policy_static.go:287] [cpumanager] allocateCPUs: returning "2,4,6,8,10,14,16,18,20,22"


I0317 17:32:42.198910   46476 state_mem.go:76] [cpumanager] updated desired cpuset (container id: e6f74fdbf7fcc55adb85811e8a5d83c6bcbf9e74cc3db7d697cbcc4d92d82f9b, cpuset: "2,4,6,8,10,14,16,18,20,22")
I0317 17:32:42.227317   46476 policy_static.go:195] [cpumanager] static policy: AddContainer (pod: testpod2, container: appcntr1, container id: 41ee4e2904d1287bd705dc650462bdeea70a8c6865a1f3e8b2c5ba39e561a0e3)
I0317 17:32:42.227355   46476 policy_static.go:221] [cpumanager] Pod 91ef9e1f-eb23-40ee-b33e-77f154859e38, Container appcntr1 Topology Affinity is: {0000000000000000000000000000000000000000000000000000000000000001 true}
I0317 17:32:42.227374   46476 policy_static.go:254] [cpumanager] allocateCpus: (numCPUs: 10, socket: 0000000000000000000000000000000000000000000000000000000000000001)
I0317 17:32:42.227514   46476 state_mem.go:84] [cpumanager] updated default cpuset: "0,11,21,23"
I0317 17:32:42.228133   46476 policy_static.go:287] [cpumanager] allocateCPUs: returning "1,3,5,7,9,12-13,15,17,19"


I0317 17:32:42.228164   46476 state_mem.go:76] [cpumanager] updated desired cpuset (container id: 41ee4e2904d1287bd705dc650462bdeea70a8c6865a1f3e8b2c5ba39e561a0e3, cpuset: "1,3,5,7,9,12-13,15,17,19")



1.18 logs for same scenario. Notice, the sequence here is:

1. Generate hints for pod1
2. Allocate resources for pod1
3. Generate hints for pod2
4. Allocate resources for pod2

I0317 20:19:35.992273   10554 topology_manager.go:233] [topologymanager] Topology Admit Handler
I0317 20:19:35.992326   10554 topology_manager.go:180] [topologymanager] TopologyHints for pod 'testpod1', container 'appcntr1': map[]
I0317 20:19:35.992488   10554 policy_static.go:329] [cpumanager] TopologyHints generated for pod 'testpod1', container 'appcntr1': [{01 true} {10 true} {11 false}]
I0317 20:19:35.992531   10554 topology_manager.go:180] [topologymanager] TopologyHints for pod 'testpod1', container 'appcntr1': map[cpu:[{01 true} {10 true} {11 false}]]
I0317 20:19:35.992581   10554 policy.go:70] [topologymanager] Hint Provider has no preference for NUMA affinity with any resource

Hint for testpod1
******************
I0317 20:19:35.992600   10554 topology_manager.go:199] [topologymanager] ContainerTopologyHint: {01 true}
I0317 20:19:35.992614   10554 topology_manager.go:258] [topologymanager] Topology Affinity for (pod: 4d4fe6a9-ce0b-42b8-b097-764cfdbe8440 container: appcntr1): {01 true}
I0317 20:19:35.993434   10554 policy_static.go:193] [cpumanager] static policy: Allocate (pod: testpod1, container: appcntr1)
I0317 20:19:35.993466   10554 policy_static.go:203] [cpumanager] Pod 4d4fe6a9-ce0b-42b8-b097-764cfdbe8440, Container appcntr1 Topology Affinity is: {01 true}
I0317 20:19:35.993503   10554 policy_static.go:239] [cpumanager] allocateCpus: (numCPUs: 10, socket: 01)

Allocate
********
I0317 20:19:35.993733   10554 state_mem.go:88] [cpumanager] updated default cpuset: "0-1,3,5,7,9,11-13,15,17,19,21,23"
I0317 20:19:35.994799   10554 policy_static.go:272] [cpumanager] allocateCPUs: returning "2,4,6,8,10,14,16,18,20,22"
I0317 20:19:35.995109   10554 state_mem.go:80] [cpumanager] updated desired cpuset (pod: 4d4fe6a9-ce0b-42b8-b097-764cfdbe8440, container: appcntr1, cpuset: "2,4,6,8,10,14,16,18,20,22")
I0317 20:19:35.997666   10554 topology_manager.go:233] [topologymanager] Topology Admit Handler
I0317 20:19:35.997770   10554 topology_manager.go:180] [topologymanager] TopologyHints for pod 'testpod2', container 'appcntr1': map[]
I0317 20:19:35.998118   10554 policy_static.go:329] [cpumanager] TopologyHints generated for pod 'testpod2', container 'appcntr1': [{10 true} {11 false}]
I0317 20:19:35.998218   10554 topology_manager.go:180] [topologymanager] TopologyHints for pod 'testpod2', container 'appcntr1': map[cpu:[{10 true} {11 false}]]
I0317 20:19:35.998278   10554 policy.go:70] [topologymanager] Hint Provider has no preference for NUMA affinity with any resource

Hint for testpod2
******************
I0317 20:19:35.998314   10554 topology_manager.go:199] [topologymanager] ContainerTopologyHint: {10 true}
I0317 20:19:35.998364   10554 topology_manager.go:258] [topologymanager] Topology Affinity for (pod: b8bc3582-f330-4da0-8c0e-fba14e278e7d container: appcntr1): {10 true}
I0317 20:19:35.999325   10554 policy_static.go:193] [cpumanager] static policy: Allocate (pod: testpod2, container: appcntr1)
I0317 20:19:35.999367   10554 policy_static.go:203] [cpumanager] Pod b8bc3582-f330-4da0-8c0e-fba14e278e7d, Container appcntr1 Topology Affinity is: {10 true}
I0317 20:19:35.999508   10554 policy_static.go:239] [cpumanager] allocateCpus: (numCPUs: 10, socket: 10)
I0317 20:19:36.000065   10554 state_mem.go:88] [cpumanager] updated default cpuset: "0,11-12,23"

Allocate
*********
I0317 20:19:36.000674   10554 policy_static.go:272] [cpumanager] allocateCPUs: returning "1,3,5,7,9,13,15,17,19,21"
I0317 20:19:36.000704   10554 state_mem.go:80] [cpumanager] updated desired cpuset (pod: b8bc3582-f330-4da0-8c0e-fba14e278e7d, container: appcntr1, cpuset: "1,3,5,7,9,13,15,17,19,21")

--- Additional comment from Ryan Phillips on 2020-05-12 16:01:01 UTC ---

Victor: Can we close this bug?

--- Additional comment from Victor Pickard on 2020-05-12 16:36:42 UTC ---

This issue should be resolved in 4.5 now with Topology Manager getting to Beta status. Can you please retest to confirm?

Also, we need to clone this issue to 4.4, as this issue definitely exists in that release, and will not be fixed without a pretty significant backport, which, at this time, has been deferred.

--- Additional comment from Victor Pickard on 2020-05-12 16:37:50 UTC ---

(In reply to Ryan Phillips from comment #4)
> Victor: Can we close this bug?

Hi Ryan,
Yes, this issue should be resolved now in 4.5. However, per my last comment, we need to clone this to 4.4, so we have the issue documented.

Comment 1 Victor Pickard 2020-05-12 19:28:01 UTC

This issue is resolved in 4.5 with the k8s rebase that pulls in k8s 1.18. 

In order to resolve this in 4.4.z, a pretty significant backport would be required. We are not planning to fix this in 4.4, so we will add documentation on how to avoid this scenario, and also, a work around if the bug is encountered.

Comment 2 Victor Pickard 2020-05-12 19:35:05 UTC

The work around for this bug is as follows:

1. Don't spin up multiple pods on a node simultaneously. This is likely to trigger this issue, and resources would not be NUMA aligned.


2. If NUMA resource alignment fails due to 1 above, the work around is to delete the pod, then create the pod again.

Comment 3 Victor Pickard 2020-05-12 19:44:15 UTC

Let me adjust the workaround text to make it a little clearer.

1. Don't spin up multiple pods with a Guaranteed QoS on a node simultaneously. This is likely to trigger this issue, and resources may not be NUMA aligned as requested in the pod spec.

2. If this bug is encountered, the work around is to delete and then recreate the pod.

Comment 4 Michael Burke 2020-05-13 14:11:16 UTC

PR to add this to Known Issues in the 4.4 release notes with the workaround.
https://github.com/openshift/openshift-docs/pull/22058

Comment 5 Michael Burke 2020-05-13 14:12:08 UTC

Sunil -- Can you take a look at the release note, linked in comment 4?