Bug 1920368 - Fix containers creation issue resulting in runc running on Guaranteed Pod CPUs
Summary: Fix containers creation issue resulting in runc running on Guaranteed Pod CPUs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.7.0
Assignee: Artyom
QA Contact: Walid A.
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-26 08:10 UTC by Marcel Apfelbaum
Modified: 2021-02-24 15:56 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:56:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 541 0 None closed Bug 1920368: UPSTREAM: 98019: specify the container CPU set during the creation 2021-02-02 02:48:21 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:56:27 UTC

Description Marcel Apfelbaum 2021-01-26 08:10:24 UTC
Description of problem:
The runc process that is responsible to create the container runs with the CPU affinity provided in the config.json arg.

Currently the container creation process calls runc without specifying the cpuset at first, then updates the cpuset in a subsequent call.

In the meantime runc can run on any CPU including the ones used by Guaranteed Pods interfering with possible ultra-low-latency workloads.


Backporting https://github.com/kubernetes/kubernetes/pull/98019 solves the issue by passing the cpuset directly during container creation.

Comment 1 Marcel Apfelbaum 2021-01-26 16:21:47 UTC
Reproduction steps:

1. Use a kubelet config with:
    cpuManagerPolicy: static
    [...]
    reservedSystemCPUs: 0,1,...

2. Create a Guaranteed Pod (so some of the CPUs will be used)

3. Create a Pod (Guaranteed or not)

4. Verify the config.json of the new pod at create time (but before the container is started)
   It can be seen the cpuset is not set.


In order to "catch" the config.json at the time the container is created one can use a cri-o wrapper:
- change the runtime path in crio config
  [crio.runtime.runtimes.runc]
  runtime_path = "/usr/local/bin/runc-wrapper.sh"
- Use a wrapper like
  if [ -n "$3" ] && [ "$3" == "create" ] && [ -f "$5/config.json" ]; then
        conf="$5/config.json"
        ...
  fi

  /bin/runc "$@"

Comment 5 Walid A. 2021-02-06 04:43:27 UTC
Tested and verified on OCP 4.7.0-0.nightly-2021-02-05-105159 on AWS cluster with m5.4xlarge instances (16 vCPUs)

Followed reproduction steps in Comment 1.

kubeletconfig:
.
.  
  spec:
    kubeletConfig:
      cpuManagerPolicy: static
      cpuManagerReconcilePeriod: 5s
      reservedSystemCPUs: 0,1,2,3,4

Edited the runtime path in crio config
  [crio.runtime.runtimes.runc]
  runtime_path = "/usr/local/bin/runc-wrapper.sh"

systemctl restart crio

cat /usr/local/bin/runc-wrapper.sh
#!/bin/bash                                                                                                                            
                                                                                                                                       
if [ -n "$3" ] && [ "$3" == "create" ] && [ -f "$5/config.json" ]; then                                                                
        conf="$5/config.json"                                                                                                          
        cat $conf >> /root/create                                                                                                      
                                                                                                                                       
fi                                                                                                                                     
exec /bin/runc "$@"

------

Corndoned 2 of the 3 worker nodes to force the pods to be deployed on the cpu manager enabled worker node ip-10-0-128-206.us-east-2.compute.internal
# oc get nodes
NAME                                         STATUS                     ROLES    AGE     VERSION
ip-10-0-128-206.us-east-2.compute.internal   Ready                      worker   6h25m   v1.20.0+68292b2
ip-10-0-138-89.us-east-2.compute.internal    Ready                      master   6h31m   v1.20.0+68292b2
ip-10-0-187-126.us-east-2.compute.internal   Ready                      master   6h30m   v1.20.0+68292b2
ip-10-0-188-63.us-east-2.compute.internal    Ready,SchedulingDisabled   worker   6h24m   v1.20.0+68292b2
ip-10-0-192-141.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   6h24m   v1.20.0+68292b2
ip-10-0-211-8.us-east-2.compute.internal     Ready                      master   6h30m   v1.20.0+68292b2

Deployed pod1 with 5 guaranteed CPUs:
oc create -f pod1_5cpu.yaml
# cat  pod1_5cpu.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod1cpu5
  annotations:
spec:
  nodeSelector:
  containers:
  - name: appcntr1
    image: zenghui/centos-dpdk
    imagePullPolicy: IfNotPresent
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        cpu: 5
        memory: 100Mi
      limits:
        cpu: 5
        memory: 100Mi


After pod1cpu5 was running:

#  oc get pods
NAME                                            READY   STATUS    RESTARTS   AGE
ip-10-0-128-206us-east-2computeinternal-debug   1/1     Running   0          3m26s
pod1cpu5                                        1/1     Running   0          20s



Then deployed pod2cpu3:

# cat  pod2_3cpu.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod2cpu3
  annotations:
spec:
  nodeSelector:
  containers:
  - name: appcntr1
    image: zenghui/centos-dpdk
    imagePullPolicy: IfNotPresent
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        cpu: 3
        memory: 100Mi
      limits:
        cpu: 3
        memory: 100Mi

oc create -f pod3cpu3

# oc get pods
NAME                                            READY   STATUS    RESTARTS   AGE
ip-10-0-128-206us-east-2computeinternal-debug   1/1     Running   0          16m
pod1cpu5                                        1/1     Running   0          13m
pod2cpu3                                        1/1     Running   0          13m

# oc debug node/ip-10-0-128-206.us-east-2.compute.internal
# chroot /host
sh-4.4# cat /var/lib/kubelet/cpu_manager_state 
{"policyName":"static","defaultCpuSet":"0-4,10-12","entries":{"50f732a3-956f-404f-9b94-c49fece9bd5e":{"appcntr1":"7,9,15"},"7e5ce205-8fc5-4b59-9eea-80cd020044bc":{"appcntr1":"5-6,8,13-14"}},"checksum":320863805}sh-4.4# 

When pod2 was deployed after pod1, only available CPUs were reserved, verifying fix

cat /root/create | grep cpus:

.
.
.				"cpus": "0-4,10-12"
				"cpus": "0-4,10-12"
				"cpus": "5-6,8,13-14".  <=== Pod1 has 5 guaranteed CPUs
				"cpus": "7,9,15".       <=== Pod2 has 3 guaranteed CPUs

Comment 8 errata-xmlrpc 2021-02-24 15:56:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.