Description of problem: When using a pod with static cpu policy (cpu-manager), the container's cpus allocated are not exclusive because every cgroup that runs /usr/bin/pod has not been restricted (its cpuset.cpus is always all cpus). Therefore the pod process is allowed to run on the same CPUs that are supposed to be exclusive for the container. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Deploy a pod with static CPU policy 2. On the worker node, cd to /sys/fs/cgroup/cpuset/kubepods.slice/ 3. Identify your directory for the pod with static policy (not kubepods-besteffort.slice or kubepods-burstable.slice but something like "kubepods-podb30fc002_2acc_431b_9ac7_506dbe82477f.slice" and chdir to that directory 4. Run systemd-cgls: [root@worker0 kubepods-podb30fc002_2acc_431b_9ac7_506dbe82477f.slice]# systemd-cgls Working directory /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-podb30fc002_2acc_431b_9ac7_506dbe82477f.slice: ├─crio-22260041f7fde4194777b5ead26c3d433bc75e654822910daf413bc741b80774.scope │ ├─197747 /root/dumb-init -- /root/start_cyclictest.sh │ ├─197772 /bin/bash /root/start_cyclictest.sh │ └─197829 cyclictest -q -D 30m -p 99 -t 12 -a 10,12,14,16,18,20,22,24,26,4,6,8 -h 30 -m -n └─crio-3312495e15788821ad0f37b62378f8130f5a6099616381875f5d3243b74bcd1a.scope └─194850 /usr/bin/pod 5. cat the cpuset.cpus [root@worker0 kubepods-podb30fc002_2acc_431b_9ac7_506dbe82477f.slice]# cat crio-*/cpuset.cpus 4,6,8,10,12,14,16,18,20,22,24,26 0-31 6. The one for the actual container will be a list of CPUs, like 4,6,8, etc, and the one for /usr/bin/pod will have all CPUs. Note that it is not just for this pod, it is every pod/container on the host. To fix, every time a new container with static policy os created, and the resulting shared-cpu-pool is reduced, every single container running /usr/bin/pod needs to have its cpuset.cpus updated to use the reduced shared-cpu pool.
This looks like a kubelet issue, moving accordingly.
We will need the CRI to pass us this information at RunPodSandbox time. We will have to pursue this upstream. One more possible feature that will help with this is getting rid of pause container for non-pid namespace sharing pods (we will likely get that in for 4.4).
The following struct is passed to RemoteRuntimeService#RunPodSandbox type LinuxPodSandboxConfig struct { where runtimeClient.RunPodSandbox() is called. It seems information on cpu restriction may need to be passed.
Looking at Server#runPodSandbox from server/sandbox_run_linux.go in cri-o , I only see one cpu related measure: g.SetLinuxResourcesCPUShares(PodInfraCPUshares)
Quoting Kevin from sig-node channel with some background information: The CPUManager was only designed to provide CPU isolation between processes managed by Kubernetes (i.e. processes running inside containers). It was not designed to provide isolation between Kubernetes-managed processes and native processes running outside of Kubernetes control (including the kubelet, docker, systemd-managed processes, etc.). Moreover, it was not designed to exclude certain CPUs from being used to run Kubernetes-managed processes vs. native processes. As such, the CPUManager always has full view of all of the CPUs on the system — only isolating Kubernetes-managed processes onto different CPUs based on their QoS class. Processes in the Guaranteed QoS class are given “exclusive” access to CPUs, isolated from all other Kubernetes-managed processes running in the system. Processes in other QoS classes share access to whatever CPUs haven’t yet been handed out for “exclusive” access. Having “exclusive” access to a CPU does not prohibit non-Kubernetes-managed processes from running it (it only prohibits other Kubernetes-managed processes from running on it).
Here is a document on topology awareness and scheduling: https://docs.google.com/document/d/1gPknVIOiu-c_fpLm53-jUAm-AGQQ8XC0hQYE7hq0L4c/edit#heading=h.6rgokjh84vcl
(In reply to Ted Yu from comment #5) > Quoting Kevin from sig-node channel with some background information: > > Having “exclusive” access to a CPU > does not prohibit non-Kubernetes-managed processes from running it (it only > prohibits other Kubernetes-managed processes from running on it). Is crio not considered kubernetes-managed process? If not, then what created the cpuset for it: /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-podb30fc002_2acc_431b_9ac7_506dbe82477f.slice: ├─crio-22260041f7fde4194777b5ead26c3d433bc75e654822910daf413bc741b80774.scope │ ├─197747 /root/dumb-init -- /root/start_cyclictest.sh │ ├─197772 /bin/bash /root/start_cyclictest.sh │ └─197829 cyclictest -q -D 30m -p 99 -t 12 -a 10,12,14,16,18,20,22,24,26,4,6,8 -h 30 -m -n └─crio-3312495e15788821ad0f37b62378f8130f5a6099616381875f5d3243b74bcd1a.scope └─194850 /usr/bin/pod How is crio/pod run? Is it a fork+exec from kubelet? Or was it launched by something like systemd? If crio/pod was run by something else like systemd, then I completely understand the logic, and we can ensure that whatever starts it has a restricted cpu-list. But if kubelet is the one to start it, I can't possibly see how it's someone else's problem.
The cgroup shown above was created via runc. crio creates the config.json file specifying what cgroup name must be used. runc calls the dbus systemd api for creating the cgroup.
(In reply to Ted Yu from comment #8) > The cgroup shown above was created via runc. > crio creates the config.json file specifying what cgroup name must be used. > runc calls the dbus systemd api for creating the cgroup. Ted, in this process, is there any way to 1) Inform runc what cpus to use for its cgroup? 2) When runc calls the dbus systemd api for creating the cgroup, can it assert what cpus should be used?
Looking at https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy : From 1.17, the CPU reservation list can be specified explicitly by kubelet --reserved-cpus option.
(In reply to Ted Yu from comment #10) > Looking at > https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/ > #static-policy : > > From 1.17, the CPU reservation list can be specified explicitly by kubelet > --reserved-cpus option. Ted, we are already doing that. That is the point of this BZ; using --reserved-cpus does not restrict the cpus for crio/runc.
For cri-o, I found the following in Server#createSandboxContainer : specgen.SetLinuxResourcesCPUCpus(resources.GetCpusetCpus()) the CpusetCpus is filled by LinuxContainerResources#Unmarshal. This seems to be how cpus are passed from k8s.
From cpu manager: manager#updateContainerCPUSet calls: ``` return m.containerRuntime.UpdateContainerResources( containerID, &runtimeapi.LinuxContainerResources{ CpusetCpus: cpus.String(), ``` for cri-o: ``` func (s *Server) UpdateContainerResources(ctx context.Context, req *pb.UpdateContainerResourcesRequest) (resp *pb.UpdateContainerResourcesResponse, err error) { ... resources := toOCIResources(req.GetLinux()) ``` toOCIResources() has: ``` return &rspec.LinuxResources{ CPU: &rspec.LinuxCPU{ Shares: proto.Uint64(uint64(r.GetCpuShares())), Quota: proto.Int64(r.GetCpuQuota()), Period: proto.Uint64(uint64(r.GetCpuPeriod())), Cpus: r.GetCpusetCpus(), ```
4.7 version attached
4.7 version merged!
Verified on 4.7.0-0.nightly-2021-01-19-033533. Configured infra_ctr_cpuset following below MachineConfig object using below doc. https://docs.openshift.com/container-platform/4.6/post_installation_configuration/machine-configuration-tasks.html#using-machineconfigs-to-change-machines $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-01-19-033533 True False 28h Cluster version is 4.7.0-0.nightly-2021-01-19-033533 $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-56-217.us-east-2.compute.internal Ready master 28h v1.20.0+d9c52cc 10.0.56.217 <none> Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42 ip-10-0-59-181.us-east-2.compute.internal Ready master 28h v1.20.0+d9c52cc 10.0.59.181 <none> Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42 ip-10-0-63-227.us-east-2.compute.internal Ready worker 28h v1.20.0+d9c52cc 10.0.63.227 <none> Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42 ip-10-0-69-79.us-east-2.compute.internal Ready master 28h v1.20.0+d9c52cc 10.0.69.79 <none> Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42 ip-10-0-70-235. cat << EOF | base64 [crio.runtime] infra_ctr_cpuset = "0" EOF $ cat << EOF | base64 > [crio.runtime] > infra_ctr_cpuset = "0" > EOF W2NyaW8ucnVudGltZV0KaW5mcmFfY3RyX2NwdXNldCA9ICIwIgo= $ cat worker-cfg.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: workers-infra-ctr-cpuset spec: config: ignition: config: {} security: tls: {} timeouts: {} version: 3.1.0 networkd: {} passwd: {} storage: files: - contents: source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZV0KaW5mcmFfY3RyX2NwdXNldCA9ICIwIgo= mode: 420 overwrite: true path: /etc/crio/crio.conf.d/01-infra_ctr_cpuset osImageURL: "" $ oc create -f worker-cfg.yaml $ oc get mc workers-infra-ctr-cpuset -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: creationTimestamp: "2021-01-19T16:09:00Z" generation: 1 labels: machineconfiguration.openshift.io/role: worker managedFields: - apiVersion: machineconfiguration.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: .: {} f:machineconfiguration.openshift.io/role: {} f:spec: .: {} f:config: .: {} f:ignition: .: {} f:config: {} f:security: .: {} f:tls: {} f:timeouts: {} f:version: {} f:networkd: {} f:passwd: {} f:storage: .: {} f:files: {} f:osImageURL: {} manager: oc operation: Update time: "2021-01-19T16:09:00Z" name: workers-infra-ctr-cpuset resourceVersion: "174254" selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigs/workers-infra-ctr-cpuset uid: 8e31ef87-a9d3-4d8d-b546-8e8a59fa748a spec: config: ignition: config: {} security: tls: {} timeouts: {} version: 3.1.0 networkd: {} passwd: {} storage: files: - contents: source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZV0KaW5mcmFfY3RyX2NwdXNldCA9ICIwIgo= mode: 420 overwrite: true path: /etc/crio/crio.conf.d/01-infra_ctr_cpuset osImageURL: "" $ oc debug node/ip-10-0-70-235.us-east-2.compute.internal Starting pod/ip-10-0-70-235us-east-2computeinternal-debug ... ... sh-4.4# cat infra_ctr_cpuset [crio.runtime] infra_ctr_cpuset = "0" sh-4.4# pod_id=$(crictl pods -q | tail -1) sh-4.4# cgroup=$(systemctl status crio-$pod_id.scope | grep CGroup | awk '{ printf $2 }') sh-4.4# cat /sys/fs/cgroup/cpuset$cgroup/cpuset.cpus 0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633