+++ This bug was initially created as a clone of Bug #1813567 +++ Description of problem: When creating several pods "simutaneously" on node with cpu policy (static) and topology policy (single-numa-node) enabled, device and cpu alignment on single numa node is not guaranteed. Pod can still be created even topology affinity is not met. Version-Release number of selected component (if applicable): Client Version: 4.5.0-0.nightly-2020-03-12-003015 Server Version: 4.5.0-0.nightly-2020-03-12-003015 Kubernetes Version: v1.17.1 How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: pod can be admitted Expected results: topology affinity error when admitting pod Additional info: Test was done with SR-IOV device on baremetal OCP deployment. SRIOV Device on NUMA node 1: [core@ci-worker-0 ~]$ ip link show ens787f1 5: ens787f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 3c:fd:fe:cd:35:49 brd ff:ff:ff:ff:ff:ff vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off vf 1 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off vf 2 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off vf 3 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off [core@ci-worker-0 ~]$ cat /sys/class/net/ens787f1/device/numa_node 1 CPUs on each NUMA node: [core@ci-worker-0 ~]$ lscpu | grep NUMA NUMA node(s): 2 NUMA node0 CPU(s): 0-17,36-53 NUMA node1 CPU(s): 18-35,54-71 Create two pods simutaneously with cmd `oc create -f pod1.yaml -f pod2.yaml` 1) pod1 request 30 cpus + 1 sriov device 2) pod2 request 10 cpus + 1 sriov device pod1 spec: apiVersion: v1 kind: Pod metadata: name: testpod1 annotations: k8s.v1.cni.cncf.io/networks: 'sriov-intel' <== sriov net-attach-def spec: nodeSelector: kubernetes.io/hostname: ci-worker-0 <== create pod on worker-0 containers: - name: appcntr1 image: zenghui/centos-dpdk imagePullPolicy: IfNotPresent command: [ "/bin/bash", "-c", "--" ] args: [ "while true; do sleep 300000; done;" ] resources: requests: openshift.io/intelnics: 1 <== request SR-IOV device cpu: 30 <== request 10 CPUs memory: 1000Mi limits: openshift.io/intelnics: 1 cpu: 30 memory: 1000Mi pod2 spec: apiVersion: v1 kind: Pod metadata: name: testpod2 annotations: k8s.v1.cni.cncf.io/networks: 'sriov-intel' <== sriov net-attach-def spec: nodeSelector: kubernetes.io/hostname: ci-worker-0 <== create pod on worker-0 containers: - name: appcntr1 image: zenghui/centos-dpdk imagePullPolicy: IfNotPresent command: [ "/bin/bash", "-c", "--" ] args: [ "while true; do sleep 300000; done;" ] resources: requests: openshift.io/intelnics: 1 <== request SR-IOV device cpu: 10 <== request 10 CPUs memory: 1000Mi limits: openshift.io/intelnics: 1 cpu: 10 memory: 1000Mi Both pod1 and pod2 can be created without hitting topology affinity error. [unexpected] # oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES testpod1 1/1 Running 0 3m19s 10.128.2.39 ci-worker-0 <none> <none> testpod2 1/1 Running 0 3m19s 10.128.2.40 ci-worker-0 <none> <none> [root@nfvpe-03 templates]# oc exec testpod1 -- cat /sys/fs/cgroup/cpuset/cpuset.cpus 1-2,23-35,37-38,59-71 <== pod1 created successfully with cpus on both NUMA [root@nfvpe-03 templates]# oc exec testpod2 -- cat /sys/fs/cgroup/cpuset/cpuset.cpus 18-22,54-58 Delete pod1.yaml and re-create it 1) `oc delete -f pod1.yaml` 2) `oc create -f pod1.yaml` # oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES testpod1 0/1 Topology Affinity Error 0 6s <none> ci-worker-0 <none> <none> testpod2 1/1 Running 0 7m37s 10.128.2.40 ci-worker-0 <none> <none> Pod1 admit failed with topology affinity error. [expected] --- Additional comment from Deepthi Dharwar on 2020-03-16 06:52:59 UTC --- Recently a bug was fixed upstream interms of adding a mutex. This should fix this issue. https://github.com/kubernetes/kubernetes/commit/e8538d9b76abe944a61eab10bfc2a580974f25fd Better to check if this bug fix was back ported. --- Additional comment from Victor Pickard on 2020-03-16 20:34:50 UTC --- This appears to be an issue in CPU Manager. I tried this on u/s k8s against release-1.16 and also against tip of master (1.18). In 1.16, I see that the cpu assignment for testpod1 (below) has CPUs from both NUMA nodes. I did not see (could not reproduce) this issue in 1.18. If I delete testpod1, and recreate it, CPU assignment is correct (single numa node) [root@nfvsdn-20 hack]# kubectl create -f ~/pod1.yaml -f ~/pod2.yaml pod/testpod1 created pod/testpod2 created [root@nfvsdn-20 hack]# kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default testpod1 1/1 Running 0 3s default testpod2 1/1 Running 0 3s kube-system kube-dns-68496566b5-5bj98 3/3 Running 0 8m54s [root@nfvsdn-20 hack]# kubectl exec testpod1 -- cat /sys/fs/cgroup/cpuset/cpuset.cpus 1,3,5,7,9,12-13,15,17,19 [root@nfvsdn-20 hack]# kubectl exec testpod2 -- cat /sys/fs/cgroup/cpuset/cpuset.cpus 2,4,6,8,10,14,16,18,20,22 [root@nfvsdn-20 hack]# lscpu |grep NUMA NUMA node(s): 2 NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23 [root@nfvsdn-20 hack]# cat ~/pod1.yaml apiVersion: v1 kind: Pod metadata: name: testpod1 spec: containers: - name: appcntr1 image: zenghui/centos-dpdk imagePullPolicy: IfNotPresent command: [ "/bin/bash", "-c", "--" ] args: [ "while true; do sleep 300000; done;" ] resources: requests: cpu: 10 memory: 1000Mi limits: cpu: 10 memory: 1000Mi [root@nfvsdn-20 hack]# [root@nfvsdn-20 hack]# cat ~/pod2.yaml apiVersion: v1 kind: Pod metadata: name: testpod2 spec: containers: - name: appcntr1 image: zenghui/centos-dpdk imagePullPolicy: IfNotPresent command: [ "/bin/bash", "-c", "--" ] args: [ "while true; do sleep 300000; done;" ] resources: requests: cpu: 10 memory: 1000Mi limits: cpu: 10 memory: 1000Mi [root@nfvsdn-20 hack]# --- Additional comment from Victor Pickard on 2020-03-19 17:33:12 UTC --- I had a closer look at this issue to make sure I understood why this is broken in 1.16 and works in 1.18. In 1.18, there were several fixes/enhancements to Topology Manager and the registered hint providers (CPU Manager, Device Manager). Specifically, https://github.com/kubernetes/kubernetes/pull/87759, which resolves issue https://github.com/kubernetes/kubernetes/issues/83476, is exactly the issue reported in this BZ. The summary of the above issue is: Hint generation and resource allocation is unreliable for all but 1st container (or pod) So, we have to decide if we want to get all of the changes related to https://github.com/kubernetes/kubernetes/pull/87759 (and dependent PRs) cherry-picked to OCP 4.3 (once Topology Manager is GA in OCP 4.5). I'll include some logs here for reference. Basically, in 1.16, sequence is: 1. Generate hint for testpod1 2. Generate hint for testpod2 3. Allocate cpus for testpod1 4. Allocate cpus for testpod2 <<== here, there are not enough CPUs on NUMA 0, so some are allocated from NUMA 1 1.16 logs for Topology Manager and CPU Manager when creating 2 pods at the same time (kubectl create -f pod1.yaml -f pod2.yaml). Notice in this log sequence, that I0317 17:32:40.943414 46476 topology_manager.go:308] [topologymanager] Topology Admit Handler I0317 17:32:40.943430 46476 topology_manager.go:317] [topologymanager] Pod QoS Level: Guaranteed I0317 17:32:40.943446 46476 topology_manager.go:194] [topologymanager] Hint Provider has no preference for NUMA affinity with any resource I0317 17:32:40.943532 46476 topology_hints.go:60] [cpumanager] TopologyHints generated for pod 'testpod1', container 'appcntr1': [{0000000000000000000000000000000000000000000000000000000000000001 true} {0000000000000000000000000000000000000000000000000000000000000010 true} {0000000000000000000000000000000000000000000000000000000000000011 false}] The hint calculated for testpod1 ********************************* I0317 17:32:40.943583 46476 topology_manager.go:285] [topologymanager] ContainerTopologyHint: {0000000000000000000000000000000000000000000000000000000000000001 true} I0317 17:32:40.943707 46476 topology_manager.go:329] [topologymanager] Topology Affinity for Pod: d9725981-d53e-4bd7-8f6a-e2a28d0ff304 are map[appcntr1:{0000000000000000000000000000000000000000000000000000000000000001 true}] I0317 17:32:40.946107 46476 topology_manager.go:308] [topologymanager] Topology Admit Handler I0317 17:32:40.946123 46476 topology_manager.go:317] [topologymanager] Pod QoS Level: Guaranteed I0317 17:32:40.946159 46476 topology_manager.go:194] [topologymanager] Hint Provider has no preference for NUMA affinity with any resource I0317 17:32:40.946245 46476 topology_hints.go:60] [cpumanager] TopologyHints generated for pod 'testpod2', container 'appcntr1': [{0000000000000000000000000000000000000000000000000000000000000001 true} {0000000000000000000000000000000000000000000000000000000000000010 true} {0000000000000000000000000000000000000000000000000000000000000011 false}] The hint calculated for testpod2, Notice, same as hint for testpod1, which is the root of the problem, as there are not enough CPUs available on this NUMA node, so CPU Manager will allocate CPUs from other NUMA. ****************************************************************************************************** I0317 17:32:40.946279 46476 topology_manager.go:285] [topologymanager] ContainerTopologyHint: {0000000000000000000000000000000000000000000000000000000000000001 true} I0317 17:32:40.946296 46476 topology_manager.go:329] [topologymanager] Topology Affinity for Pod: 91ef9e1f-eb23-40ee-b33e-77f154859e38 are map[appcntr1:{0000000000000000000000000000000000000000000000000000000000000001 true}] I0317 17:32:42.197351 46476 policy_static.go:195] [cpumanager] static policy: AddContainer (pod: testpod1, container: appcntr1, container id: e6f74fdbf7fcc55adb85811e8a5d83c6bcbf9e74cc3db7d697cbcc4d92d82f9b) I0317 17:32:42.197407 46476 policy_static.go:221] [cpumanager] Pod d9725981-d53e-4bd7-8f6a-e2a28d0ff304, Container appcntr1 Topology Affinity is: {0000000000000000000000000000000000000000000000000000000000000001 true} I0317 17:32:42.197475 46476 policy_static.go:254] [cpumanager] allocateCpus: (numCPUs: 10, socket: 0000000000000000000000000000000000000000000000000000000000000001) I0317 17:32:42.197645 46476 state_mem.go:84] [cpumanager] updated default cpuset: "0-1,3,5,7,9,11-13,15,17,19,21,23" I0317 17:32:42.198877 46476 policy_static.go:287] [cpumanager] allocateCPUs: returning "2,4,6,8,10,14,16,18,20,22" I0317 17:32:42.198910 46476 state_mem.go:76] [cpumanager] updated desired cpuset (container id: e6f74fdbf7fcc55adb85811e8a5d83c6bcbf9e74cc3db7d697cbcc4d92d82f9b, cpuset: "2,4,6,8,10,14,16,18,20,22") I0317 17:32:42.227317 46476 policy_static.go:195] [cpumanager] static policy: AddContainer (pod: testpod2, container: appcntr1, container id: 41ee4e2904d1287bd705dc650462bdeea70a8c6865a1f3e8b2c5ba39e561a0e3) I0317 17:32:42.227355 46476 policy_static.go:221] [cpumanager] Pod 91ef9e1f-eb23-40ee-b33e-77f154859e38, Container appcntr1 Topology Affinity is: {0000000000000000000000000000000000000000000000000000000000000001 true} I0317 17:32:42.227374 46476 policy_static.go:254] [cpumanager] allocateCpus: (numCPUs: 10, socket: 0000000000000000000000000000000000000000000000000000000000000001) I0317 17:32:42.227514 46476 state_mem.go:84] [cpumanager] updated default cpuset: "0,11,21,23" I0317 17:32:42.228133 46476 policy_static.go:287] [cpumanager] allocateCPUs: returning "1,3,5,7,9,12-13,15,17,19" I0317 17:32:42.228164 46476 state_mem.go:76] [cpumanager] updated desired cpuset (container id: 41ee4e2904d1287bd705dc650462bdeea70a8c6865a1f3e8b2c5ba39e561a0e3, cpuset: "1,3,5,7,9,12-13,15,17,19") 1.18 logs for same scenario. Notice, the sequence here is: 1. Generate hints for pod1 2. Allocate resources for pod1 3. Generate hints for pod2 4. Allocate resources for pod2 I0317 20:19:35.992273 10554 topology_manager.go:233] [topologymanager] Topology Admit Handler I0317 20:19:35.992326 10554 topology_manager.go:180] [topologymanager] TopologyHints for pod 'testpod1', container 'appcntr1': map[] I0317 20:19:35.992488 10554 policy_static.go:329] [cpumanager] TopologyHints generated for pod 'testpod1', container 'appcntr1': [{01 true} {10 true} {11 false}] I0317 20:19:35.992531 10554 topology_manager.go:180] [topologymanager] TopologyHints for pod 'testpod1', container 'appcntr1': map[cpu:[{01 true} {10 true} {11 false}]] I0317 20:19:35.992581 10554 policy.go:70] [topologymanager] Hint Provider has no preference for NUMA affinity with any resource Hint for testpod1 ****************** I0317 20:19:35.992600 10554 topology_manager.go:199] [topologymanager] ContainerTopologyHint: {01 true} I0317 20:19:35.992614 10554 topology_manager.go:258] [topologymanager] Topology Affinity for (pod: 4d4fe6a9-ce0b-42b8-b097-764cfdbe8440 container: appcntr1): {01 true} I0317 20:19:35.993434 10554 policy_static.go:193] [cpumanager] static policy: Allocate (pod: testpod1, container: appcntr1) I0317 20:19:35.993466 10554 policy_static.go:203] [cpumanager] Pod 4d4fe6a9-ce0b-42b8-b097-764cfdbe8440, Container appcntr1 Topology Affinity is: {01 true} I0317 20:19:35.993503 10554 policy_static.go:239] [cpumanager] allocateCpus: (numCPUs: 10, socket: 01) Allocate ******** I0317 20:19:35.993733 10554 state_mem.go:88] [cpumanager] updated default cpuset: "0-1,3,5,7,9,11-13,15,17,19,21,23" I0317 20:19:35.994799 10554 policy_static.go:272] [cpumanager] allocateCPUs: returning "2,4,6,8,10,14,16,18,20,22" I0317 20:19:35.995109 10554 state_mem.go:80] [cpumanager] updated desired cpuset (pod: 4d4fe6a9-ce0b-42b8-b097-764cfdbe8440, container: appcntr1, cpuset: "2,4,6,8,10,14,16,18,20,22") I0317 20:19:35.997666 10554 topology_manager.go:233] [topologymanager] Topology Admit Handler I0317 20:19:35.997770 10554 topology_manager.go:180] [topologymanager] TopologyHints for pod 'testpod2', container 'appcntr1': map[] I0317 20:19:35.998118 10554 policy_static.go:329] [cpumanager] TopologyHints generated for pod 'testpod2', container 'appcntr1': [{10 true} {11 false}] I0317 20:19:35.998218 10554 topology_manager.go:180] [topologymanager] TopologyHints for pod 'testpod2', container 'appcntr1': map[cpu:[{10 true} {11 false}]] I0317 20:19:35.998278 10554 policy.go:70] [topologymanager] Hint Provider has no preference for NUMA affinity with any resource Hint for testpod2 ****************** I0317 20:19:35.998314 10554 topology_manager.go:199] [topologymanager] ContainerTopologyHint: {10 true} I0317 20:19:35.998364 10554 topology_manager.go:258] [topologymanager] Topology Affinity for (pod: b8bc3582-f330-4da0-8c0e-fba14e278e7d container: appcntr1): {10 true} I0317 20:19:35.999325 10554 policy_static.go:193] [cpumanager] static policy: Allocate (pod: testpod2, container: appcntr1) I0317 20:19:35.999367 10554 policy_static.go:203] [cpumanager] Pod b8bc3582-f330-4da0-8c0e-fba14e278e7d, Container appcntr1 Topology Affinity is: {10 true} I0317 20:19:35.999508 10554 policy_static.go:239] [cpumanager] allocateCpus: (numCPUs: 10, socket: 10) I0317 20:19:36.000065 10554 state_mem.go:88] [cpumanager] updated default cpuset: "0,11-12,23" Allocate ********* I0317 20:19:36.000674 10554 policy_static.go:272] [cpumanager] allocateCPUs: returning "1,3,5,7,9,13,15,17,19,21" I0317 20:19:36.000704 10554 state_mem.go:80] [cpumanager] updated desired cpuset (pod: b8bc3582-f330-4da0-8c0e-fba14e278e7d, container: appcntr1, cpuset: "1,3,5,7,9,13,15,17,19,21") --- Additional comment from Ryan Phillips on 2020-05-12 16:01:01 UTC --- Victor: Can we close this bug? --- Additional comment from Victor Pickard on 2020-05-12 16:36:42 UTC --- This issue should be resolved in 4.5 now with Topology Manager getting to Beta status. Can you please retest to confirm? Also, we need to clone this issue to 4.4, as this issue definitely exists in that release, and will not be fixed without a pretty significant backport, which, at this time, has been deferred. --- Additional comment from Victor Pickard on 2020-05-12 16:37:50 UTC --- (In reply to Ryan Phillips from comment #4) > Victor: Can we close this bug? Hi Ryan, Yes, this issue should be resolved now in 4.5. However, per my last comment, we need to clone this to 4.4, so we have the issue documented.
This issue is resolved in 4.5 with the k8s rebase that pulls in k8s 1.18. In order to resolve this in 4.4.z, a pretty significant backport would be required. We are not planning to fix this in 4.4, so we will add documentation on how to avoid this scenario, and also, a work around if the bug is encountered.
The work around for this bug is as follows: 1. Don't spin up multiple pods on a node simultaneously. This is likely to trigger this issue, and resources would not be NUMA aligned. 2. If NUMA resource alignment fails due to 1 above, the work around is to delete the pod, then create the pod again.
Let me adjust the workaround text to make it a little clearer. 1. Don't spin up multiple pods with a Guaranteed QoS on a node simultaneously. This is likely to trigger this issue, and resources may not be NUMA aligned as requested in the pod spec. 2. If this bug is encountered, the work around is to delete and then recreate the pod.
PR to add this to Known Issues in the 4.4 release notes with the workaround. https://github.com/openshift/openshift-docs/pull/22058
Sunil -- Can you take a look at the release note, linked in comment 4?