Description of problem: The runc process that is responsible to create the container runs with the CPU affinity provided in the config.json arg. Currently the container creation process calls runc without specifying the cpuset at first, then updates the cpuset in a subsequent call. In the meantime runc can run on any CPU including the ones used by Guaranteed Pods interfering with possible ultra-low-latency workloads. Backporting https://github.com/kubernetes/kubernetes/pull/98019 solves the issue by passing the cpuset directly during container creation.
Reproduction steps: 1. Use a kubelet config with: cpuManagerPolicy: static [...] reservedSystemCPUs: 0,1,... 2. Create a Guaranteed Pod (so some of the CPUs will be used) 3. Create a Pod (Guaranteed or not) 4. Verify the config.json of the new pod at create time (but before the container is started) It can be seen the cpuset is not set. In order to "catch" the config.json at the time the container is created one can use a cri-o wrapper: - change the runtime path in crio config [crio.runtime.runtimes.runc] runtime_path = "/usr/local/bin/runc-wrapper.sh" - Use a wrapper like if [ -n "$3" ] && [ "$3" == "create" ] && [ -f "$5/config.json" ]; then conf="$5/config.json" ... fi /bin/runc "$@"
Tested and verified on OCP 4.7.0-0.nightly-2021-02-05-105159 on AWS cluster with m5.4xlarge instances (16 vCPUs) Followed reproduction steps in Comment 1. kubeletconfig: . . spec: kubeletConfig: cpuManagerPolicy: static cpuManagerReconcilePeriod: 5s reservedSystemCPUs: 0,1,2,3,4 Edited the runtime path in crio config [crio.runtime.runtimes.runc] runtime_path = "/usr/local/bin/runc-wrapper.sh" systemctl restart crio cat /usr/local/bin/runc-wrapper.sh #!/bin/bash if [ -n "$3" ] && [ "$3" == "create" ] && [ -f "$5/config.json" ]; then conf="$5/config.json" cat $conf >> /root/create fi exec /bin/runc "$@" ------ Corndoned 2 of the 3 worker nodes to force the pods to be deployed on the cpu manager enabled worker node ip-10-0-128-206.us-east-2.compute.internal # oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-128-206.us-east-2.compute.internal Ready worker 6h25m v1.20.0+68292b2 ip-10-0-138-89.us-east-2.compute.internal Ready master 6h31m v1.20.0+68292b2 ip-10-0-187-126.us-east-2.compute.internal Ready master 6h30m v1.20.0+68292b2 ip-10-0-188-63.us-east-2.compute.internal Ready,SchedulingDisabled worker 6h24m v1.20.0+68292b2 ip-10-0-192-141.us-east-2.compute.internal Ready,SchedulingDisabled worker 6h24m v1.20.0+68292b2 ip-10-0-211-8.us-east-2.compute.internal Ready master 6h30m v1.20.0+68292b2 Deployed pod1 with 5 guaranteed CPUs: oc create -f pod1_5cpu.yaml # cat pod1_5cpu.yaml apiVersion: v1 kind: Pod metadata: name: pod1cpu5 annotations: spec: nodeSelector: containers: - name: appcntr1 image: zenghui/centos-dpdk imagePullPolicy: IfNotPresent command: [ "/bin/bash", "-c", "--" ] args: [ "while true; do sleep 300000; done;" ] resources: requests: cpu: 5 memory: 100Mi limits: cpu: 5 memory: 100Mi After pod1cpu5 was running: # oc get pods NAME READY STATUS RESTARTS AGE ip-10-0-128-206us-east-2computeinternal-debug 1/1 Running 0 3m26s pod1cpu5 1/1 Running 0 20s Then deployed pod2cpu3: # cat pod2_3cpu.yaml apiVersion: v1 kind: Pod metadata: name: pod2cpu3 annotations: spec: nodeSelector: containers: - name: appcntr1 image: zenghui/centos-dpdk imagePullPolicy: IfNotPresent command: [ "/bin/bash", "-c", "--" ] args: [ "while true; do sleep 300000; done;" ] resources: requests: cpu: 3 memory: 100Mi limits: cpu: 3 memory: 100Mi oc create -f pod3cpu3 # oc get pods NAME READY STATUS RESTARTS AGE ip-10-0-128-206us-east-2computeinternal-debug 1/1 Running 0 16m pod1cpu5 1/1 Running 0 13m pod2cpu3 1/1 Running 0 13m # oc debug node/ip-10-0-128-206.us-east-2.compute.internal # chroot /host sh-4.4# cat /var/lib/kubelet/cpu_manager_state {"policyName":"static","defaultCpuSet":"0-4,10-12","entries":{"50f732a3-956f-404f-9b94-c49fece9bd5e":{"appcntr1":"7,9,15"},"7e5ce205-8fc5-4b59-9eea-80cd020044bc":{"appcntr1":"5-6,8,13-14"}},"checksum":320863805}sh-4.4# When pod2 was deployed after pod1, only available CPUs were reserved, verifying fix cat /root/create | grep cpus: . . . "cpus": "0-4,10-12" "cpus": "0-4,10-12" "cpus": "5-6,8,13-14". <=== Pod1 has 5 guaranteed CPUs "cpus": "7,9,15". <=== Pod2 has 3 guaranteed CPUs
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633