Bug 1920368

Summary:	Fix containers creation issue resulting in runc running on Guaranteed Pod CPUs
Product:	OpenShift Container Platform	Reporter:	Marcel Apfelbaum <mapfelba>
Component:	Node	Assignee:	Artyom <alukiano>
Node sub component:	CPU manager	QA Contact:	Walid A. <wabouham>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	alukiano, aos-bugs, ddharwar, mifiedle, nagrawal, rpattath
Version:	4.7
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 15:56:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marcel Apfelbaum 2021-01-26 08:10:24 UTC

Description of problem:
The runc process that is responsible to create the container runs with the CPU affinity provided in the config.json arg.

Currently the container creation process calls runc without specifying the cpuset at first, then updates the cpuset in a subsequent call.

In the meantime runc can run on any CPU including the ones used by Guaranteed Pods interfering with possible ultra-low-latency workloads.


Backporting https://github.com/kubernetes/kubernetes/pull/98019 solves the issue by passing the cpuset directly during container creation.

Comment 1 Marcel Apfelbaum 2021-01-26 16:21:47 UTC

Reproduction steps:

1. Use a kubelet config with:
    cpuManagerPolicy: static
    [...]
    reservedSystemCPUs: 0,1,...

2. Create a Guaranteed Pod (so some of the CPUs will be used)

3. Create a Pod (Guaranteed or not)

4. Verify the config.json of the new pod at create time (but before the container is started)
   It can be seen the cpuset is not set.


In order to "catch" the config.json at the time the container is created one can use a cri-o wrapper:
- change the runtime path in crio config
  [crio.runtime.runtimes.runc]
  runtime_path = "/usr/local/bin/runc-wrapper.sh"
- Use a wrapper like
  if [ -n "$3" ] && [ "$3" == "create" ] && [ -f "$5/config.json" ]; then
        conf="$5/config.json"
        ...
  fi

  /bin/runc "$@"

Comment 5 Walid A. 2021-02-06 04:43:27 UTC

Tested and verified on OCP 4.7.0-0.nightly-2021-02-05-105159 on AWS cluster with m5.4xlarge instances (16 vCPUs)

Followed reproduction steps in Comment 1.

kubeletconfig:
.
.  
  spec:
    kubeletConfig:
      cpuManagerPolicy: static
      cpuManagerReconcilePeriod: 5s
      reservedSystemCPUs: 0,1,2,3,4

Edited the runtime path in crio config
  [crio.runtime.runtimes.runc]
  runtime_path = "/usr/local/bin/runc-wrapper.sh"

systemctl restart crio

cat /usr/local/bin/runc-wrapper.sh
#!/bin/bash                                                                                                                            
                                                                                                                                       
if [ -n "$3" ] && [ "$3" == "create" ] && [ -f "$5/config.json" ]; then                                                                
        conf="$5/config.json"                                                                                                          
        cat $conf >> /root/create                                                                                                      
                                                                                                                                       
fi                                                                                                                                     
exec /bin/runc "$@"

------

Corndoned 2 of the 3 worker nodes to force the pods to be deployed on the cpu manager enabled worker node ip-10-0-128-206.us-east-2.compute.internal
# oc get nodes
NAME                                         STATUS                     ROLES    AGE     VERSION
ip-10-0-128-206.us-east-2.compute.internal   Ready                      worker   6h25m   v1.20.0+68292b2
ip-10-0-138-89.us-east-2.compute.internal    Ready                      master   6h31m   v1.20.0+68292b2
ip-10-0-187-126.us-east-2.compute.internal   Ready                      master   6h30m   v1.20.0+68292b2
ip-10-0-188-63.us-east-2.compute.internal    Ready,SchedulingDisabled   worker   6h24m   v1.20.0+68292b2
ip-10-0-192-141.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   6h24m   v1.20.0+68292b2
ip-10-0-211-8.us-east-2.compute.internal     Ready                      master   6h30m   v1.20.0+68292b2

Deployed pod1 with 5 guaranteed CPUs:
oc create -f pod1_5cpu.yaml
# cat  pod1_5cpu.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod1cpu5
  annotations:
spec:
  nodeSelector:
  containers:
  - name: appcntr1
    image: zenghui/centos-dpdk
    imagePullPolicy: IfNotPresent
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        cpu: 5
        memory: 100Mi
      limits:
        cpu: 5
        memory: 100Mi


After pod1cpu5 was running:

#  oc get pods
NAME                                            READY   STATUS    RESTARTS   AGE
ip-10-0-128-206us-east-2computeinternal-debug   1/1     Running   0          3m26s
pod1cpu5                                        1/1     Running   0          20s



Then deployed pod2cpu3:

# cat  pod2_3cpu.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod2cpu3
  annotations:
spec:
  nodeSelector:
  containers:
  - name: appcntr1
    image: zenghui/centos-dpdk
    imagePullPolicy: IfNotPresent
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        cpu: 3
        memory: 100Mi
      limits:
        cpu: 3
        memory: 100Mi

oc create -f pod3cpu3

# oc get pods
NAME                                            READY   STATUS    RESTARTS   AGE
ip-10-0-128-206us-east-2computeinternal-debug   1/1     Running   0          16m
pod1cpu5                                        1/1     Running   0          13m
pod2cpu3                                        1/1     Running   0          13m

# oc debug node/ip-10-0-128-206.us-east-2.compute.internal
# chroot /host
sh-4.4# cat /var/lib/kubelet/cpu_manager_state 
{"policyName":"static","defaultCpuSet":"0-4,10-12","entries":{"50f732a3-956f-404f-9b94-c49fece9bd5e":{"appcntr1":"7,9,15"},"7e5ce205-8fc5-4b59-9eea-80cd020044bc":{"appcntr1":"5-6,8,13-14"}},"checksum":320863805}sh-4.4# 

When pod2 was deployed after pod1, only available CPUs were reserved, verifying fix

cat /root/create | grep cpus:

.
.
.				"cpus": "0-4,10-12"
				"cpus": "0-4,10-12"
				"cpus": "5-6,8,13-14".  <=== Pod1 has 5 guaranteed CPUs
				"cpus": "7,9,15".       <=== Pod2 has 3 guaranteed CPUs

Comment 8 errata-xmlrpc 2021-02-24 15:56:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633