1775444 – RFE: k8s cpu manager does not restrict /usr/bin/pod cpuset

Bug 1775444 - RFE: k8s cpu manager does not restrict /usr/bin/pod cpuset

Summary: RFE: k8s cpu manager does not restrict /usr/bin/pod cpuset

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Peter Hunt
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1910386 1921729
TreeView+	depends on / blocked

Reported:	2019-11-22 01:21 UTC by Andrew Theurer
Modified:	2021-02-24 16:39 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Feature: CRI-O now supports specifying CPUs for the infra containers to run in. Reason: Pods that require guaranteed CPUs should not have any other processes running on those CPUs. This means every process on the host must be configured to use CPUs that are reserved for it. Guaranteed CPUs are CPUs reserved for the host processes (Kubelet, CRI-O, etc..) and are a good place to put pod infra containers. Result: Pods that request Guaranteed CPUs do not have to compete for CPU time with the /usr/bin/pod process of infra containers
Clone Of:
Environment:
Last Closed:	2021-02-24 15:10:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	cri-o cri-o pull 4459	None	closed	Provide functionality to start infra containers on the specified set of CPUs	2021-02-15 22:34:48 UTC
Github	cri-o cri-o pull 4469	None	closed	[1.20] Provide functionality to start infra containers on the specified set of CPUs	2021-02-15 22:34:48 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:11:52 UTC

Description Andrew Theurer 2019-11-22 01:21:42 UTC

Description of problem:

When using a pod with static cpu policy (cpu-manager), the container's cpus allocated are not exclusive because every cgroup that runs /usr/bin/pod has not been restricted (its cpuset.cpus is always all cpus).  Therefore the pod process is allowed to run on the same CPUs that are supposed to be exclusive for the container.


Version-Release number of selected component (if applicable):


How reproducible:



Steps to Reproduce:
1. Deploy a pod with static CPU policy
2. On the worker node, cd to /sys/fs/cgroup/cpuset/kubepods.slice/
3. Identify your directory for the pod with static policy (not kubepods-besteffort.slice or kubepods-burstable.slice but something like "kubepods-podb30fc002_2acc_431b_9ac7_506dbe82477f.slice" and chdir to that directory
4. Run systemd-cgls:
[root@worker0 kubepods-podb30fc002_2acc_431b_9ac7_506dbe82477f.slice]# systemd-cgls
Working directory /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-podb30fc002_2acc_431b_9ac7_506dbe82477f.slice:
├─crio-22260041f7fde4194777b5ead26c3d433bc75e654822910daf413bc741b80774.scope
│ ├─197747 /root/dumb-init -- /root/start_cyclictest.sh
│ ├─197772 /bin/bash /root/start_cyclictest.sh
│ └─197829 cyclictest -q -D 30m -p 99 -t 12 -a 10,12,14,16,18,20,22,24,26,4,6,8 -h 30 -m -n
└─crio-3312495e15788821ad0f37b62378f8130f5a6099616381875f5d3243b74bcd1a.scope
  └─194850 /usr/bin/pod
5. cat the cpuset.cpus
[root@worker0 kubepods-podb30fc002_2acc_431b_9ac7_506dbe82477f.slice]# cat crio-*/cpuset.cpus
4,6,8,10,12,14,16,18,20,22,24,26
0-31
6. The one for the actual container will be a list of CPUs, like 4,6,8, etc, and the one for /usr/bin/pod will have all CPUs. Note that it is not just for this pod, it is every pod/container on the host.

To fix, every time a new container with static policy os created, and the resulting shared-cpu-pool is reduced, every single container running /usr/bin/pod needs to have its cpuset.cpus updated to use the reduced shared-cpu pool.

Comment 1 Maciej Szulik 2019-11-25 08:25:02 UTC

This looks like a kubelet issue, moving accordingly.

Comment 2 Mrunal Patel 2019-11-26 22:34:34 UTC

We will need the CRI to pass us this information at RunPodSandbox time. We will have to pursue this upstream. One more possible feature
that will help with this is getting rid of pause container for non-pid namespace sharing pods (we will likely get that in for 4.4).

Comment 3 Ted Yu 2020-03-26 16:57:12 UTC

The following struct is passed to RemoteRuntimeService#RunPodSandbox

type LinuxPodSandboxConfig struct {

where runtimeClient.RunPodSandbox() is called.

It seems information on cpu restriction may need to be passed.

Comment 4 Ted Yu 2020-03-26 17:46:50 UTC

Looking at Server#runPodSandbox from server/sandbox_run_linux.go in cri-o , I only see one cpu related measure:

	g.SetLinuxResourcesCPUShares(PodInfraCPUshares)

Comment 5 Ted Yu 2020-03-27 14:30:23 UTC

Quoting Kevin from sig-node channel with some background information:

The CPUManager was only designed to provide CPU isolation between processes managed by Kubernetes (i.e. processes running inside containers). It was not designed to provide isolation between Kubernetes-managed processes and native processes running outside of Kubernetes control (including the kubelet, docker, systemd-managed processes, etc.).

Moreover, it was not designed to exclude certain CPUs from being used to run Kubernetes-managed processes vs. native processes. As such, the CPUManager always has full view of all of the CPUs on the system — only isolating Kubernetes-managed processes onto different CPUs based on their QoS class. Processes in the Guaranteed QoS class are given “exclusive” access to CPUs, isolated from all other Kubernetes-managed processes running in the system. Processes in other QoS classes share access to whatever CPUs haven’t yet been handed out for “exclusive” access. Having “exclusive” access to a CPU does not prohibit non-Kubernetes-managed processes from running it (it only prohibits other Kubernetes-managed processes from running on it).

Comment 6 Ted Yu 2020-04-01 01:43:50 UTC

Here is a document on topology awareness and scheduling:

https://docs.google.com/document/d/1gPknVIOiu-c_fpLm53-jUAm-AGQQ8XC0hQYE7hq0L4c/edit#heading=h.6rgokjh84vcl

Comment 7 Andrew Theurer 2020-04-01 02:37:57 UTC

(In reply to Ted Yu from comment #5)
> Quoting Kevin from sig-node channel with some background information:
> 
> Having “exclusive” access to a CPU
> does not prohibit non-Kubernetes-managed processes from running it (it only
> prohibits other Kubernetes-managed processes from running on it).

Is crio not considered kubernetes-managed process?  If not, then what created the cpuset for it:

/sys/fs/cgroup/cpuset/kubepods.slice/kubepods-podb30fc002_2acc_431b_9ac7_506dbe82477f.slice:
├─crio-22260041f7fde4194777b5ead26c3d433bc75e654822910daf413bc741b80774.scope
│ ├─197747 /root/dumb-init -- /root/start_cyclictest.sh
│ ├─197772 /bin/bash /root/start_cyclictest.sh
│ └─197829 cyclictest -q -D 30m -p 99 -t 12 -a 10,12,14,16,18,20,22,24,26,4,6,8 -h 30 -m -n
└─crio-3312495e15788821ad0f37b62378f8130f5a6099616381875f5d3243b74bcd1a.scope
  └─194850 /usr/bin/pod


How is crio/pod run?  Is it a fork+exec from kubelet?  Or was it launched by something like systemd?

If crio/pod was run by something else like systemd, then I completely understand the logic, and we can ensure that whatever starts it has a restricted cpu-list.  But if kubelet is the one to start it, I can't possibly see how it's someone else's problem.

Comment 8 Ted Yu 2020-04-01 14:13:42 UTC

The cgroup shown above was created via runc.
crio creates the config.json file specifying what cgroup name must be used.
runc calls the dbus systemd api for creating the cgroup.

Comment 9 Andrew Theurer 2020-04-30 21:09:06 UTC

(In reply to Ted Yu from comment #8)
> The cgroup shown above was created via runc.
> crio creates the config.json file specifying what cgroup name must be used.
> runc calls the dbus systemd api for creating the cgroup.

Ted, in this process, is there any way to 
1) Inform runc what cpus to use for its cgroup?
2) When runc calls the dbus systemd api for creating the cgroup, can it assert what cpus should be used?

Comment 10 Ted Yu 2020-05-01 13:45:14 UTC

Looking at https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy :

From 1.17, the CPU reservation list can be specified explicitly by kubelet --reserved-cpus option.

Comment 11 Andrew Theurer 2020-05-05 16:38:58 UTC

(In reply to Ted Yu from comment #10)
> Looking at
> https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/
> #static-policy :
> 
> From 1.17, the CPU reservation list can be specified explicitly by kubelet
> --reserved-cpus option.

Ted, we are already doing that. That is the point of this BZ; using --reserved-cpus does not restrict the cpus for crio/runc.

Comment 12 Ted Yu 2020-05-05 17:58:23 UTC

For cri-o, I found the following in Server#createSandboxContainer :

specgen.SetLinuxResourcesCPUCpus(resources.GetCpusetCpus())

the CpusetCpus is filled by LinuxContainerResources#Unmarshal.

This seems to be how cpus are passed from k8s.

Comment 13 Ted Yu 2020-05-05 21:03:25 UTC

From cpu manager:
manager#updateContainerCPUSet calls:
```
        return m.containerRuntime.UpdateContainerResources(
                containerID,
                &runtimeapi.LinuxContainerResources{
                        CpusetCpus: cpus.String(),
```
for cri-o:
```
func (s *Server) UpdateContainerResources(ctx context.Context, req *pb.UpdateContainerResourcesRequest) (resp *pb.UpdateContainerResourcesResponse, err error) {
...
	resources := toOCIResources(req.GetLinux())

```
toOCIResources() has:
```
	return &rspec.LinuxResources{
		CPU: &rspec.LinuxCPU{
			Shares: proto.Uint64(uint64(r.GetCpuShares())),
			Quota:  proto.Int64(r.GetCpuQuota()),
			Period: proto.Uint64(uint64(r.GetCpuPeriod())),
			Cpus:   r.GetCpusetCpus(),
```

Comment 55 Peter Hunt 2021-01-12 21:18:04 UTC

4.7 version attached

Comment 56 Peter Hunt 2021-01-15 14:09:47 UTC

4.7 version merged!

Comment 58 Sunil Choudhary 2021-01-20 12:54:35 UTC

Verified on 4.7.0-0.nightly-2021-01-19-033533. 

Configured infra_ctr_cpuset following below MachineConfig object using below doc.
https://docs.openshift.com/container-platform/4.6/post_installation_configuration/machine-configuration-tasks.html#using-machineconfigs-to-change-machines

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-01-19-033533   True        False         28h     Cluster version is 4.7.0-0.nightly-2021-01-19-033533

$ oc get nodes -o wide
NAME                                        STATUS   ROLES    AGE   VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
ip-10-0-56-217.us-east-2.compute.internal   Ready    master   28h   v1.20.0+d9c52cc   10.0.56.217   <none>        Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42
ip-10-0-59-181.us-east-2.compute.internal   Ready    master   28h   v1.20.0+d9c52cc   10.0.59.181   <none>        Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42
ip-10-0-63-227.us-east-2.compute.internal   Ready    worker   28h   v1.20.0+d9c52cc   10.0.63.227   <none>        Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42
ip-10-0-69-79.us-east-2.compute.internal    Ready    master   28h   v1.20.0+d9c52cc   10.0.69.79    <none>        Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42
ip-10-0-70-235.

cat << EOF | base64
[crio.runtime]
infra_ctr_cpuset = "0"
EOF


$ cat << EOF | base64
> [crio.runtime]
> infra_ctr_cpuset = "0"
> EOF
W2NyaW8ucnVudGltZV0KaW5mcmFfY3RyX2NwdXNldCA9ICIwIgo=


$ cat worker-cfg.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: workers-infra-ctr-cpuset
spec:
  config:
    ignition:
      config: {}
      security:
        tls: {}
      timeouts: {}
      version: 3.1.0
    networkd: {}
    passwd: {}
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZV0KaW5mcmFfY3RyX2NwdXNldCA9ICIwIgo=
        mode: 420
        overwrite: true
        path: /etc/crio/crio.conf.d/01-infra_ctr_cpuset
  osImageURL: ""

$ oc create -f worker-cfg.yaml

$ oc get mc workers-infra-ctr-cpuset -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  creationTimestamp: "2021-01-19T16:09:00Z"
  generation: 1
  labels:
    machineconfiguration.openshift.io/role: worker
  managedFields:
  - apiVersion: machineconfiguration.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:machineconfiguration.openshift.io/role: {}
      f:spec:
        .: {}
        f:config:
          .: {}
          f:ignition:
            .: {}
            f:config: {}
            f:security:
              .: {}
              f:tls: {}
            f:timeouts: {}
            f:version: {}
          f:networkd: {}
          f:passwd: {}
          f:storage:
            .: {}
            f:files: {}
        f:osImageURL: {}
    manager: oc
    operation: Update
    time: "2021-01-19T16:09:00Z"
  name: workers-infra-ctr-cpuset
  resourceVersion: "174254"
  selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigs/workers-infra-ctr-cpuset
  uid: 8e31ef87-a9d3-4d8d-b546-8e8a59fa748a
spec:
  config:
    ignition:
      config: {}
      security:
        tls: {}
      timeouts: {}
      version: 3.1.0
    networkd: {}
    passwd: {}
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZV0KaW5mcmFfY3RyX2NwdXNldCA9ICIwIgo=
        mode: 420
        overwrite: true
        path: /etc/crio/crio.conf.d/01-infra_ctr_cpuset
  osImageURL: ""



$ oc debug node/ip-10-0-70-235.us-east-2.compute.internal
Starting pod/ip-10-0-70-235us-east-2computeinternal-debug ...
...

sh-4.4# cat infra_ctr_cpuset 
[crio.runtime]
infra_ctr_cpuset = "0"


sh-4.4# pod_id=$(crictl pods -q | tail -1)

sh-4.4# cgroup=$(systemctl status crio-$pod_id.scope | grep CGroup | awk '{ printf $2 }')

sh-4.4# cat /sys/fs/cgroup/cpuset$cgroup/cpuset.cpus
0

Comment 62 errata-xmlrpc 2021-02-24 15:10:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.

alukiano
aos-bugs
atheurer
daolivei
fherrman
fiezzi
harpatil
jokerman
krister
lcapitulino
mapfelba
mburke
mfojtik
mpatel
mtosatti
nagrawal
pehunt
rphillips
tsweeney
williams
yquinn