Description of problem: IHAC who is not able to apply cpu pinning for some of the pods which openshift runs like coredns. We have set isolcpu to isolate a set of cores for the workloads, these workloads are sensitive to context switches. The isolcpu is set through tuned CR and also with machine-config and the changes for isolcpu is seen. apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: openshift-realtime namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Custom OpenShift realtime profile include=openshift-node,realtime [variables] # isolated_cores take a list of ranges; e.g. isolated_cores=2,4-7 isolated_cores=1,2 name: openshift-realtime recommend: - machineConfigLabels: machineconfiguration.openshift.io/role: "worker-cputest" priority: 30 profile: openshift-realtime $ cat [root@ipiocp45-tfcml-worker-wkdmm ~]# cat /proc/cmdline BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-cf9735b7120f73237bff0e6a2b255b9c1b30e10189d4180152ef9e9e8402cc09/vmlinuz-4.18.0-193.14.3.el8_2.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.1/rhcos/cf9735b7120f73237bff0e6a2b255b9c1b30e10189d4180152ef9e9e8402cc09/0 ignition.platform.id=openstack skew_tick=1 isolcpus=1,2 intel_pstate=disable nosoftlockup tsc=nowatchdog [root@ipiocp45-tfcml-worker-wkdmm ~]# crictl ps -a | grep -i core e104a5e5fab89 229dbdee63b061c21623adb8ae30dad4ad25ddf8f48b3ffca0509a95bf9b0d13 4 minutes ago Running coredns 0 7949e23d9be5a [root@ipiocp45-tfcml-worker-wkdmm ~]# crictl inspect e104a5e5fab89 | grep -i pid "pid": 2172, "pids": { "type": "pid" [root@ipiocp45-tfcml-worker-wkdmm ~]# taskset -acp 2172 pid 2172's current affinity list: 0-3 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Configure isolcpu through machine-config or tuned 2. Once the changes are applied and pods restart check one of the containers to check the cpu affinity 3. Once that is done, deploy a container with Garunteed QOS class and check cpu affinity. Actual results: Expected results: Additional info:
These are the other processes that have access to all the CPUs (including the one isolated for the workload) haproxy coredns openshift-route prometheus-conf kube-rbac-proxy oauth-proxy kube-state-metr thanos cm-adapter prometheus prom-kube-proxy alertmanager configmap-reloa node_exporter ovs-vswitchd
Can you please provide information how the kubelet was configured? Did you use --reserved-cpus? Did you also took a look at https://www.openshift.com/blog/node-tuning-operator-and-friends-in-openshift-4.5?
Sharing info from another environment: From worker node: cat /etc/kubernetes/kubelet.conf |grep -i reserved "systemReserved": { "kubeReserved": { "reservedSystemCPUs": "0,36,18,54" isolcpu=1-17,19-35,37-53,55-71 cat /proc/cmdline BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-76a449d5c5374a21595629fed75f2051af40f4108b358d9c52219c4a9aca3750/vmlinuz-4.18.0-193.14.3.el8_2.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal rd.luks.options=discard ostree=/ostree/boot.0/rhcos/76a449d5c5374a21595629fed75f2051af40f4108b358d9c52219c4a9aca3750/0 default_hugepagesz=1G hugepagesz=1G isolcpus=1-17,19-35,37-53,55-71 hugepages=228 skew_tick=1 nohz=on nohz_full=1-17,19-35,37-53,55-71 rcu_nocbs=1-17,19-35,37-53,55-71 nosoftlockup intel_pstate=disable tsc=nowatchdog oc describe no worker01 |grep -i cpu: cpu: 72//Capacity// cpu: 68//Allocatable//
Hello Team, Thanks for the update Yes, I have configured --reserved-cpus using kubeletconfig. With the other link, I started off with the similar approach of using a custom tuned profile exactly like the one in the comment 0 but could not see the container taking up those cpu affinitly. Thanks. Let me know if you are looking for more details on it. Thank you.
I believe this bug report has nothing to do with isolcpus. The good news is I was able to reproduce this issue and file a Tuned daemon bug: https://bugzilla.redhat.com/show_bug.cgi?id=1890219 However, there is more to low-latency tuning then just using --reserved-cpus (reservedSystemCPUs) for the kubelet, using isolcpus and simply applying the realtime profile provided by the Node Tuning Operator (NTO). Also, using isolcpus on OpenShift may be limiting as pointed out, for example, here: https://github.com/openshift-kni/performance-addon-operators/blob/master/docs/cpu-load-balancing.md I'd encourage you to look at the Performance Addon Operator (PAO) https://github.com/openshift-kni/performance-addon-operators and the extra settings/tuning it does on top of using NTO. Failing that, I'd recommend using (a variation of) the cpu-partitioning profile which sets the CPUAffinity via systemd and thus works around BZ1890219 in this way. Currently, you are likely to get the best latency on OpenShift for your workloads by taking a full advantage of PAO.
Thanks for trying and providing the feedback. We tried isolcpu with MachineConfig as well without using tuned. Observed the same thing in both cases. That's why we filed the original report with isolcpu not being honored by Openshift pods. Guaranteed class pod however uses CPUs from whats defined in isolcpu ie 1-17,19-35,37-53,55-71 An example workload: cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod2fc121f4_6155_4f43_9dd3_93d5b271fbec.slice/crio-0145439c468da093d4a887bbdd07611788b2727c269905273e4fad839521bc16.scope/cpuset.cpus 1,37 However, for Openshift pod "node-exporter" it has access to all the cpus. crictl ps |grep 0c83383 0c83383269c04 f6d2edb82cae9d243bde88e80ab5578d9f1aa133680a6cbcc3e95538b045b57f 22 hours ago Running node-exporter 0 7e8c0396775cd cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod19171c12_416a_444e_a8cc_231a26653ad9.slice/cpuset.cpus 0-71 Can you confirm whether this behaviour would change once tuned team makes changes? or is this because systemd doesn't understand isolcpu and hence the changes needed are not passed to the non guaranteed pods. I assume tuned also has to pass some info to systemd, hence making kernel args directly using machine config shouldn't be made. Thanks for the link about PAO, I got to know that this is expected to land on OCP 4.6 and currently we are on 4.5.
Ananth, instead of looking at the pod cpuset, can you take a look of the container cpuset (one level down), also please make sure it is an active container. Thanks.
Hi Ananth, (In reply to Ananth from comment #8) > Can you confirm whether this behaviour would change once tuned team makes > changes? No, BZ1890219 will only fix the issue with CPUAffinity not being set for some processes under heavy load. This is not related. But please see Jianzhu's comment #9 above. > changes needed are not passed to the non guaranteed pods. I assume tuned > also has to pass some info to systemd It does via the [systemd] plugin. You can use systemd to set CPUAffinity directly or the Tuned daemon and one of Tuned profiles that does this for you. One example of such a profile is the cpu-partitioning profile.
(In reply to jianzzha from comment #9) > Ananth, > > instead of looking at the pod cpuset, can you take a look of the container > cpuset (one level down), also please make sure it is an active container. > Thanks. Thanks for your reply. Do you think this is ok cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod19171c12_416a_444e_a8cc_231a26653ad9.slice/crio-edf2609c4e9944bd24d6e5b786a0ec1f28317f7d41a9849d939026ebecd6b3e6.scope/cpuset.cpus 0-71 ps aux|grep 4015|grep -v grep nobody 4015 1.3 0.0 119736 36732 ? Ssl 10:21 4:08 /bin/node_exporter --web.listen-address=127.0.0.1:9100 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/host/root --no-collector.wifi --no-collector.hwmon --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/) --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$ --collector.mountstats --collector.cpu.info --collector.textfile.directory=/var/node_exporter/textfile grep edf2609c4e99 /proc/4015/cgroup |grep cpuset 12:cpuset:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod19171c12_416a_444e_a8cc_231a26653ad9.slice/crio- edf2609c4e9944bd24d6e5b786a0ec1f28317f7d41a9849d939026ebecd6b3e6.scope
(In reply to jmencak from comment #10) > Hi Ananth, > > (In reply to Ananth from comment #8) > > Can you confirm whether this behaviour would change once tuned team makes > > changes? > No, BZ1890219 will only fix the issue with CPUAffinity not being set for some > processes under heavy load. This is not related. But please see Jianzhu's > comment #9 above. > > > changes needed are not passed to the non guaranteed pods. I assume tuned > > also has to pass some info to systemd > It does via the [systemd] plugin. You can use systemd to set CPUAffinity > directly or the Tuned daemon and one of Tuned profiles that does this > for you. One example of such a profile is the cpu-partitioning profile. Thanks for the clarification and support. This explains why isolcpu kernel arg should never be set through MachineConfig. I was expecting that by setting the "openshift-realtime" profile along with "isolated_cores" should set the affinity correctly through systemd. But it didn't happen. Worker node: crictl exec -ti 7491009adbc9e tuned-adm active Cannot talk to Tuned daemon via DBus. Is Tuned daemon running? Current active profile: openshift-realtime Also Changed to cpu-partitioning: // surprisingly nodes didn't move to updating state in Machine Config Pool while applying the change and no reboot even. but the profile got changed// crictl exec -ti 54a78cc60bbf0 tuned-adm active Cannot talk to Tuned daemon via DBus. Is Tuned daemon running? Current active profile: cpu-partitioning grep 6810878849b2ad2eea1e /proc/3841/cgroup |grep -i cpuse 8:cpuset:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod19171c12_416a_444e_a8cc_231a26653ad9.slice/crio-6810878849b2ad2eea1e7d3c632c2853b845d2e258401fe16d7fa95bc23b0a7d.scope But cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod19171c12_416a_444e_a8cc_231a26653ad9.slice/crio-6810878849b2ad2eea1e7d3c632c2853b845d2e258401fe16d7fa95bc23b0a7d.scope/cpuset.cpus 0-71 cat /etc/systemd/system.conf|grep -i affi #CPUAffinity=1 2 Tried to set systemd manually and rebooted the node but didn't help either. cat /etc/systemd/system.conf|grep -i cpu #CPUAffinity=1 2 CPUAffinity=0 36 18 54 After reboot: crictl inspect 4d2ad1225ff77 |grep -i pid "pid": 3981, "pids" taskset -acp 3981 pid 3981's current affinity list: 0-71 I think something is missing
(In reply to Ananth from comment #11) > (In reply to jianzzha from comment #9) > > Ananth, > > > > instead of looking at the pod cpuset, can you take a look of the container > > cpuset (one level down), also please make sure it is an active container. > > Thanks. > > Thanks for your reply. > > Do you think this is ok > > cat > /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods- > burstable-pod19171c12_416a_444e_a8cc_231a26653ad9.slice/crio- > edf2609c4e9944bd24d6e5b786a0ec1f28317f7d41a9849d939026ebecd6b3e6.scope/ > cpuset.cpus > 0-71 this happens when a guaranteed container is actively running at the same time?
(In reply to jianzzha from comment #13) > (In reply to Ananth from comment #11) > > (In reply to jianzzha from comment #9) > > > Ananth, > > > > > > instead of looking at the pod cpuset, can you take a look of the container > > > cpuset (one level down), also please make sure it is an active container. > > > Thanks. > > > > Thanks for your reply. > > > > Do you think this is ok > > > > cat > > /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods- > > burstable-pod19171c12_416a_444e_a8cc_231a26653ad9.slice/crio- > > edf2609c4e9944bd24d6e5b786a0ec1f28317f7d41a9849d939026ebecd6b3e6.scope/ > > cpuset.cpus > > 0-71 > > this happens when a guaranteed container is actively running at the same > time? Hello jianzzha, sorry for the delay in responding. Guaranteed pods are running on the same nodes.I can delete those and reboot to see if burstable openshift pods are getting pinned to the 4 CPUs mentioned in the tuned file. Since Openshift service pods are burstable/besteffort is it guaranteed that cpuset of guaranteed pods and non guaranteed will not overlap?
The testing below was done on a 4CPU worker node: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.14 True False 24h Cluster version is 4.5.14 $ oc get no NAME STATUS ROLES AGE VERSION jmencak-rlkxh-master-0.c.openshift-gce-devel.internal Ready master 24h v1.18.3+970c1b3 jmencak-rlkxh-master-1.c.openshift-gce-devel.internal Ready master 24h v1.18.3+970c1b3 jmencak-rlkxh-master-2.c.openshift-gce-devel.internal Ready master 24h v1.18.3+970c1b3 jmencak-rlkxh-worker-a-mrp87.c.openshift-gce-devel.internal Ready worker,worker-rt 24h v1.18.3+970c1b3 jmencak-rlkxh-worker-b-4l6w9.c.openshift-gce-devel.internal Ready worker 24h v1.18.3+970c1b3 $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-336ecec11e782f3f5ad97700156db165 True False False 3 3 3 0 28h worker rendered-worker-6431cf3d87d8ecb07ade38b4b1c6a351 True False False 1 1 1 0 28h worker-rt rendered-worker-rt-fc645d967f03f29c676a916d42fbc53e True False False 1 1 1 0 17h For kubelet configuration, I've used the following CR: apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: kubelet-worker-rt spec: machineConfigPoolSelector: matchLabels: worker-rt: "" kubeletConfig: systemReserved: cpu: 1000m memory: 500Mi kubeReserved: cpu: 1000m memory: 500Mi cpuManagerPolicy: static cpuManagerReconcilePeriod: 5s reservedSystemCPUs: "0-1" $ oc debug no/jmencak-rlkxh-worker-a-mrp87.c.openshift-gce-devel.internal sh-4.4# cd /sys/fs/cgroup/cpuset/kubepods.slice sh-4.4# find /sys/fs/cgroup/cpuset/kubepods.slice -name cpuset.cpus -exec grep -H 0-3 {} \;|wc -l 130 sh-4.4# find /sys/fs/cgroup/cpuset/kubepods.slice -name cpuset.cpus -exec grep -H . {} \;|wc -l 130 $ oc create -f- <<EOF apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: openshift-cpu-partitioning namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=A custom OpenShift profile include=openshift-node,cpu-partitioning [variables] # isolated_cores take a list of ranges; e.g. isolated_cores=2,4-7 isolated_cores=2-3 [bootloader] cmdline_ocp_cpu_partitioning=+systemd.cpu_affinity=${not_isolated_cores_expanded} name: openshift-cpu-partitioning recommend: - machineConfigLabels: machineconfiguration.openshift.io/role: "worker-rt" priority: 30 profile: openshift-cpu-partitioning EOF $ oc create -f- <<EOF apiVersion: v1 kind: Pod metadata: name: guaranteed spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node-role.kubernetes.io/worker-rt operator: In values: - "" containers: - image: quay.io/jmencak/nginx imagePullPolicy: IfNotPresent name: guaranteed resources: limits: cpu: 1000m memory: 40Mi requests: cpu: 1000m memory: 40Mi restartPolicy: Never EOF $ oc describe po/guaranteed|grep ^QoS QoS Class: Guaranteed sh-4.4# find /sys/fs/cgroup/cpuset/kubepods.slice -name cpuset.cpus -exec grep -H 0-3 {} \;|wc -l 65 sh-4.4# find /sys/fs/cgroup/cpuset/kubepods.slice -name cpuset.cpus -exec grep -H . {} \;|wc -l 133 sh-4.4# grep ^CPUAffinity /etc/systemd/system.conf CPUAffinity=0 1 sh-4.4# cat /proc/cmdline BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-7876c063e224728537b3948c3e103aec521da4bb3836ec0a69fbbfadddfedade/vmlinuz-4.18.0-193.23.1.el8_2.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.0/rhcos/7876c063e224728537b3948c3e103aec521da4bb3836ec0a69fbbfadddfedade/0 ignition.platform.id=gcp skew_tick=1 nohz=on intel_pstate=disable nosoftlockup nohz_full=2-3 rcu_nocbs=2-3 tuned.non_isolcpus=00000003 systemd.cpu_affinity=0,1 sh-4.4# pidof nginx 1808607 1808606 sh-4.4# cat /proc/1808607/cpuset /kubepods.slice/kubepods-podb907505b_9679_4c41_a9a0_a4b567f058b7.slice/crio-cab79ffbce842dd8f6f9c30d02c8afba2a8a40dbdb1c8f8a63027b4b38236fd0.scope sh-4.4# cat kubepods-podb907505b_9679_4c41_a9a0_a4b567f058b7.slice/crio-cab79ffbce842dd8f6f9c30d02c8afba2a8a40dbdb1c8f8a63027b4b38236fd0.scope/cpuset.cpus 2 sh-4.4# cat /proc/1808606/cpuset /kubepods.slice/kubepods-podb907505b_9679_4c41_a9a0_a4b567f058b7.slice/crio-cab79ffbce842dd8f6f9c30d02c8afba2a8a40dbdb1c8f8a63027b4b38236fd0.scope sh-4.4# cat kubepods-podb907505b_9679_4c41_a9a0_a4b567f058b7.slice/crio-cab79ffbce842dd8f6f9c30d02c8afba2a8a40dbdb1c8f8a63027b4b38236fd0.scope/cpuset.cpus 2 sh-4.4# taskset -cp 1808607 pid 1808607's current affinity list: 2 sh-4.4# taskset -cp 1808606 pid 1808606's current affinity list: 2 $ oc get po -o wide|grep jmencak-rlkxh-worker-a-mrp87.c.openshift-gce-devel.internal guaranteed 1/1 Running 0 12m 10.131.0.26 jmencak-rlkxh-worker-a-mrp87.c.openshift-gce-devel.internal <none> <none> tuned-tdjg4 1/1 Running 0 24h 10.0.32.2 jmencak-rlkxh-worker-a-mrp87.c.openshift-gce-devel.internal <none> <none> $ oc exec tuned-tdjg4 -- cat /etc/tuned/active_profile openshift-cpu-partitioning sh-4.4# ps ax -o args,pid,psr|grep '0$'|wc -l 107 sh-4.4# ps ax -o args,pid,psr|grep '1$'|wc -l 99 sh-4.4# ps ax -o args,pid,psr|grep '2$'|wc -l 68 sh-4.4# ps ax -o args,pid,psr|grep '2$' [cpuhp/2] 21 2 [migration/2] 23 2 [ksoftirqd/2] 24 2 [kworker/2:0-mm_percpu_wq] 25 2 [kworker/2:0H-events_highpr 26 2 [tpm_dev_wq] 49 2 [kworker/2:1-mm_percpu_wq] 90 2 [acpi_thermal_pm] 172 2 [kstrp] 178 2 [iscsi_eh] 357 2 [cnic_wq] 539 2 [bnx2i_thread/2] 542 2 [scsi_eh_0] 700 2 [scsi_tmf_0] 701 2 [nvme-wq] 703 2 [nvme-reset-wq] 704 2 [nvme-delete-wq] 706 2 [xfsalloc] 815 2 [xfs_mru_cache] 817 2 [xfs-reclaim/dm-] 823 2 [xfs-log/dm-0] 824 2 [xfs-eofblocks/d] 825 2 [nfit] 1060 2 [ib-comp-wq] 1110 2 [ib-comp-unb-wq] 1113 2 [ib_mcast] 1114 2 [ib_nl_sa_wq] 1115 2 [rdma_cm] 1123 2 [iw_cxgb4] 1131 2 [Register_iWARP_] 1132 2 [ext4-rsv-conver] 1154 2 [target_completi] 1206 2 [xcopy_wq] 1209 2 [rpciod] 1227 2 [xprtiod] 1228 2 /usr/bin/pod 1677 2 /usr/bin/pod 1743 2 /usr/bin/pod 1773 2 /usr/bin/machine-config-dae 1843 2 /bin/bash -c #!/bin/bash se 2097 2 openshift-tuned -v=0 2490 2 /usr/bin/pod 3256 2 /usr/bin/kube-rbac-proxy -- 3345 2 /usr/bin/telemeter-client - 4802 2 /usr/bin/pod 4900 2 /usr/bin/pod 5019 2 /usr/bin/pod 5078 2 /usr/bin/pod 5309 2 /usr/bin/pod 5327 2 /usr/bin/pod 5534 2 /usr/bin/pod 5926 2 /usr/bin/dockerregistry 6209 2 /usr/bin/kube-rbac-proxy -- 6334 2 /usr/bin/kube-rbac-proxy -- 6528 2 /usr/bin/kube-rbac-proxy -- 6615 2 /usr/bin/openshift-state-me 7283 2 /usr/bin/kube-rbac-proxy -- 7847 2 /usr/bin/prom-kube-proxy -- 8286 2 /usr/bin/configmap-reload - 8397 2 /usr/bin/prom-kube-proxy -- 8479 2 /usr/bin/prom-kube-proxy -- 8956 2 /usr/bin/prom-kube-proxy -- 9096 2 [kworker/2:1H-events_highpr 9529 2 /usr/bin/pod 1725257 2 /bin/sh ./docker-entrypoint 1808554 2 nginx: master process /sbin 1808606 2 nginx: worker process 1808607 2 sh-4.4# ps ax -o pid,psr|grep '2$'|awk '{print $1}' > /root/cpu2-pid.txt sh-4.4# for p in $(cat /root/cpu2-pid.txt); do n=$(ps h -p $p -o comm); t=$(taskset -cp $p); echo "$n $t" | grep -E '(2|[0-2]-3)$'; done cpuhp/2 pid 21's current affinity list: 2 migration/2 pid 23's current affinity list: 2 ksoftirqd/2 pid 24's current affinity list: 2 kworker/2:0-mm_percpu_wq pid 25's current affinity list: 2 kworker/2:0H-events_highpri pid 26's current affinity list: 2 tpm_dev_wq pid 49's current affinity list: 0-3 kworker/2:1-mm_percpu_wq pid 90's current affinity list: 2 acpi_thermal_pm pid 172's current affinity list: 0-3 kstrp pid 178's current affinity list: 0-3 iscsi_eh pid 357's current affinity list: 0-3 cnic_wq pid 539's current affinity list: 0-3 bnx2i_thread/2 pid 542's current affinity list: 2 scsi_tmf_0 pid 701's current affinity list: 0-3 nvme-wq pid 703's current affinity list: 0-3 nvme-reset-wq pid 704's current affinity list: 0-3 nvme-delete-wq pid 706's current affinity list: 0-3 xfsalloc pid 815's current affinity list: 0-3 xfs_mru_cache pid 817's current affinity list: 0-3 xfs-reclaim/dm- pid 823's current affinity list: 0-3 xfs-log/dm-0 pid 824's current affinity list: 0-3 xfs-eofblocks/d pid 825's current affinity list: 0-3 nfit pid 1060's current affinity list: 0-3 ib-comp-wq pid 1110's current affinity list: 0-3 ib-comp-unb-wq pid 1113's current affinity list: 0-3 ib_mcast pid 1114's current affinity list: 0-3 ib_nl_sa_wq pid 1115's current affinity list: 0-3 rdma_cm pid 1123's current affinity list: 0-3 iw_cxgb4 pid 1131's current affinity list: 0-3 Register_iWARP_ pid 1132's current affinity list: 0-3 ext4-rsv-conver pid 1154's current affinity list: 0-3 target_completi pid 1206's current affinity list: 0-3 xcopy_wq pid 1209's current affinity list: 0-3 rpciod pid 1227's current affinity list: 0-3 xprtiod pid 1228's current affinity list: 0-3 pod pid 4900's current affinity list: 0-3 pod pid 5019's current affinity list: 0-3 pod pid 5078's current affinity list: 0-3 pod pid 5309's current affinity list: 0-3 pod pid 5327's current affinity list: 0-3 pod pid 5534's current affinity list: 0-3 pod pid 5926's current affinity list: 0-3 kworker/2:1H-events_highpri pid 9529's current affinity list: 2 pod pid 1725257's current affinity list: 0-3 docker-entrypoi pid 1808554's current affinity list: 2 nginx pid 1808606's current affinity list: 2 nginx pid 1808607's current affinity list: 2 sh-4.4# for p in $(cat /root/cpu2-pid.txt); do find -name tasks -exec grep -H "^$p\$" {} \; | sed 's|/tasks.*$|/cpuset.cpus|' | xargs grep -H . | grep -E '(2|[0-2]-3)$' ; done | sort -u ./kubepods-besteffort.slice/kubepods-besteffort-pod10652218_3876_4c58_9b4e_6891ef20d6aa.slice/crio-a4171c120d05a1b2d45dd66d54e951f90d5bd0554e92b8232f3798fbf493dafe.scope/cpuset.cpus:0-3 ./kubepods-burstable.slice/kubepods-burstable-pod0376ac8c_faeb_4821_896a_563cfde3afee.slice/crio-4afa5b0cda818e675f4a280f7f8e9801f6f9a935f3ce66c7909a4e42cfef1754.scope/cpuset.cpus:0-3 ./kubepods-burstable.slice/kubepods-burstable-pod0bd2cb08_4a2c_4e93_9808_c48140bb20e3.slice/crio-bfe377753a754d507475911a5b43c66f89dd698c0be5692fa13f63c9045cb583.scope/cpuset.cpus:0-3 ./kubepods-burstable.slice/kubepods-burstable-pod2c6377a7_7933_426d_9ef1_31ccd268fe79.slice/crio-0c6277d177552a8a70f4591732d0ae54b6c0d39de0b81d673ebfb3631b49ab99.scope/cpuset.cpus:0-3 ./kubepods-burstable.slice/kubepods-burstable-pod32b427b2_d829_4770_b2a5_e85e22cadd92.slice/crio-6f57519a9dc6001d374ea7f3e346a1297deafc834e1dbd0d5daee090b888bb43.scope/cpuset.cpus:0-3 ./kubepods-burstable.slice/kubepods-burstable-pod33159fe7_08e6_42b2_bf4c_94313209fa9c.slice/crio-02bffdcfefa0fd0b0b8021297e36982ac5e89b9261e6dbc135d4ffc3ab746739.scope/cpuset.cpus:0-3 ./kubepods-burstable.slice/kubepods-burstable-pod60b9ca4e_6120_488e_9419_b00d23d5b4f3.slice/crio-61641f1783fa22c1ec570f8c36502791913265e907bb99cdd91e5e701b443348.scope/cpuset.cpus:0-3 ./kubepods-burstable.slice/kubepods-burstable-pod7ffef005_403c_4836_95c6_1e33f343e869.slice/crio-59c57612c677dd435ce7dcd596beeff2b2bf955634b49880b32d3fe845dd25bb.scope/cpuset.cpus:0-3 ./kubepods-burstable.slice/kubepods-burstable-pod8ddf21d1_45a8_4fc7_9545_6cbff9676149.slice/crio-be5becbd87291060a72a88d628b1eb6613e1c9cd47784773a25caa69a3d03e2a.scope/cpuset.cpus:0-3 ./kubepods-burstable.slice/kubepods-burstable-pod925cda42_b225_4afd_9a51_70b833052ddc.slice/crio-36ca3f93bfd95645850a6529edec21b49308c4910806e964310ebd4b8e4632b8.scope/cpuset.cpus:0-3 ./kubepods-burstable.slice/kubepods-burstable-podd080e476_6f8b_45cb_b700_f002e4508cdb.slice/crio-1f60f7646c0a7a62e5c1198640e8d2b01002d7c5b248e3efece100da361b8890.scope/cpuset.cpus:0-3 ./kubepods-burstable.slice/kubepods-burstable-pode488d64b_0fe6_4b52_9df2_39fc0ea1e8b2.slice/crio-fca77f645c4d647fd63c40657f03517e4e73044b1470d7fc42b9ec8ceda7ac9a.scope/cpuset.cpus:0-3 ./kubepods-podb907505b_9679_4c41_a9a0_a4b567f058b7.slice/crio-cab79ffbce842dd8f6f9c30d02c8afba2a8a40dbdb1c8f8a63027b4b38236fd0.scope/cpuset.cpus:2 The last line is our guaranteed container with nginx. sh-4.4# cat ./kubepods-podb907505b_9679_4c41_a9a0_a4b567f058b7.slice/crio-cab79ffbce842dd8f6f9c30d02c8afba2a8a40dbdb1c8f8a63027b4b38236fd0.scope/tasks 1808554 1808606 1808607 I've checked all of the remaining cpuset.cpus above, and they all follow the same pattern. # for t in $(cat /root/slice.txt); do cat $t; pid=$(head -n1 $t); pstree -p $pid -t; done For example: sh-4.4# cat ./kubepods-besteffort.slice/kubepods-besteffort-pod10652218_3876_4c58_9b4e_6891ef20d6aa.slice/crio-a4171c120d05a1b2d45dd66d54e951f90d5bd0554e92b8232f3798fbf493dafe.scope/tasks 1725257 1725273 1725274 1725275 1725276 1725277 1725279 sh-4.4# pstree -p 1725257 -t pod(1725257)-+-{pod}(1725273) |-{pod}(1725274) |-{pod}(1725275) |-{pod}(1725276) |-{pod}(1725277) `-{pod}(1725279) I wasn't able to find any active workload running on the isolated CPU2 apart from the guaranteed workload (nginx).
Thanks for trying in your environment Can you tell me if 4.5.6 has any known limitation or the changes you have tried is available on 4.5.6 as well. Especially this section,can I try this as it is.I usually use machine config to make kernel args changes . Thanks in advance for confirming [bootloader] cmdline_ocp_cpu_partitioning=+systemd.cpu_affinity=${not_isolated_cores_expanded} oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.6 True False 29d
Tried this: apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: cpu-partition namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Custom OpenShift CPU profile include=openshift-node,cpu-partitioning [variables] # isolated_cores take a list of ranges; e.g. isolated_cores=2,4-7 isolated_cores="1-17,19-35,37-53,55-71" [bootloader] cmdline_ocp_cpu_partitioning=+systemd.cpu_affinity=${not_isolated_cores_expanded} name: cpu-partitioning recommend: - machineConfigLabels: machineconfiguration.openshift.io/role: worker priority: 30 profile: cpu-partitioning Tuned logs: I1023 14:53:09.892100 3698 tuned.go:271] extracting tuned profiles I1023 14:53:10.019864 3698 tuned.go:305] recommended tuned profile cpu-partitioning content changed I1023 14:53:10.020369 3698 tuned.go:271] extracting tuned profiles I1023 14:53:10.136854 3698 tuned.go:305] recommended tuned profile cpu-partitioning content changed I1023 14:53:10.337344 3698 tuned.go:429] reloading tuned... I1023 14:53:10.337357 3698 tuned.go:432] sending HUP to PID 7426 2020-10-23 14:53:10,337 INFO tuned.daemon.daemon: stopping tuning 2020-10-23 14:53:10,902 INFO tuned.daemon.daemon: terminating Tuned, rolling back all changes 2020-10-23 14:53:10,903 INFO tuned.plugins.plugin_bootloader: removing grub2 tuning previously added by Tuned 2020-10-23 14:53:10,903 INFO tuned.plugins.plugin_bootloader: cannot find grub.cfg to patch 2020-10-23 14:53:10,908 ERROR tuned.utils.commands: Executing grub2-editenv error: [Errno 2] No such file or directory 2020-10-23 14:53:10,908 WARNING tuned.plugins.plugin_bootloader: cannot update grubenv: '' 2020-10-23 14:53:10,908 INFO tuned.plugins.plugin_script: calling script '/usr/lib/tuned/cpu-partitioning/script.sh' with arguments '['stop', 'full_rollback']' I1023 14:53:11.343658 3698 tuned.go:602] updated Profile auh-akb-nec-rk03-cisco-oc-worker02 with bootcmdline: I1023 14:53:11.344038 3698 tuned.go:349] written "/etc/tuned/recommend.d/50-openshift.conf" to set tuned profile cpu-partitioning 2020-10-23 14:53:11,695 INFO tuned.plugins.plugin_systemd: removing 'CPUAffinity' systemd tuning previously added by Tuned 2020-10-23 14:53:11,695 CONSOLE tuned.plugins.plugin_systemd: you may need to manualy run 'dracut -f' to update the systemd configuration in initrd image ############XXXXXXXXXXXXXX Skipped some logs about setting performance mode XXXXXXXXXXXXXX########## 2020-10-23 14:53:12,607 INFO tuned.plugins.plugin_cpu: energy_perf_bias successfully set to 'performance' on cpu 'cpu22' 2020-10-23 14:53:12,608 INFO tuned.plugins.plugin_cpu: setting new cpu latency 2 2020-10-23 14:53:12,610 INFO tuned.plugins.plugin_sysctl: reapplying system sysctl 2020-10-23 14:53:12,630 INFO tuned.plugins.plugin_systemd: setting 'CPUAffinity' to '0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71' in the '/etc/systemd/system.conf' 2020-10-23 14:53:13,208 INFO tuned.plugins.plugin_script: calling script '/usr/lib/tuned/cpu-partitioning/script.sh' with arguments '['start']' 2020-10-23 14:53:14,054 INFO tuned.plugins.plugin_bootloader: installing additional boot command line parameters to grub2 2020-10-23 14:53:14,054 INFO tuned.plugins.plugin_bootloader: generating initrd image from directory '/tmp/tmp.ktO0gzuEFl' 2020-10-23 14:53:14,063 INFO tuned.plugins.plugin_bootloader: installing initrd image as '/boot/tuned-initrd.img' 2020-10-23 14:53:14,064 INFO tuned.plugins.plugin_bootloader: removing directory '/tmp/tmp.ktO0gzuEFl' 2020-10-23 14:53:14,119 INFO tuned.plugins.plugin_bootloader: cannot find grub.cfg to patch 2020-10-23 14:53:14,120 INFO tuned.daemon.daemon: static tuning from profile 'cpu-partitioning' applied I1023 14:53:14.343832 3698 tuned.go:602] updated Profile auh-akb-nec-rk03-cisco-oc-worker02 with bootcmdline: skew_tick=1 nohz=on nohz_full="1-17,19-35,37-53,55-71" rcu_nocbs="1-17,19-35,37-53,55-71" tuned.non_isolcpus=000000ff,ffffffff,ffffffff intel_pstate=disable nosoftlockup systemd.cpu_affinity=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71 I1023 14:53:14.344086 3698 tuned.go:349] written "/etc/tuned/recommend.d/50-openshift.conf" to set tuned profile cpu-partitioning I1023 14:53:15.446582 3698 tuned.go:530] active and recommended profile (cpu-partitioning) match; profile change will not trigger profile reload worker: //these parameters are added through machineconfig default_hugepagesz=1G hugepagesz=1G isolcpus=1-17,19-35,37-53,55-71 hugepages=228 skew_tick=1 nohz=on nohz_full=1-17,19-35,37-53,55-71 rcu_nocbs=1-17,19-35,37-53,55-71// intel_iommu=on iommu=pt // added by sriov operator// cat /proc/cmdline BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-76a449d5c5374a21595629fed75f2051af40f4108b358d9c52219c4a9aca3750/vmlinuz-4.18.0-193.14.3.el8_2.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal rd.luks.options=discard ostree=/ostree/boot.1/rhcos/76a449d5c5374a21595629fed75f2051af40f4108b358d9c52219c4a9aca3750/0 default_hugepagesz=1G hugepagesz=1G isolcpus=1-17,19-35,37-53,55-71 hugepages=228 skew_tick=1 nohz=on nohz_full=1-17,19-35,37-53,55-71 rcu_nocbs=1-17,19-35,37-53,55-71 nosoftlockup intel_pstate=disable tsc=nowatchdog intel_iommu=on iommu=pt cat /etc/systemd/system.conf|grep -i cpu #CPUAffinity=1 2 CPUAffinity=0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
(In reply to Ananth from comment #16) > Thanks for trying in your environment > > Can you tell me if 4.5.6 has any known limitation or the changes you have > tried is available on 4.5.6 as well. > > Especially this section,can I try this as it is.I usually use machine config > to make kernel args changes . Thanks in advance for confirming > > [bootloader] > > cmdline_ocp_cpu_partitioning=+systemd. > cpu_affinity=${not_isolated_cores_expanded} 4.5.6 ships with systemd-239-31.el8_2.2.x86_64 and it supports the systemd.cpu_affinity kernel command line parameter. > > > oc get clusterversion > NAME VERSION AVAILABLE PROGRESSING SINCE STATUS > version 4.5.6 True False 29d
(In reply to Ananth from comment #17) > Tried this: > > apiVersion: tuned.openshift.io/v1 > kind: Tuned > metadata: > name: cpu-partition > namespace: openshift-cluster-node-tuning-operator > spec: > profile: > - data: | > [main] > summary=Custom OpenShift CPU profile > include=openshift-node,cpu-partitioning > [variables] > # isolated_cores take a list of ranges; e.g. isolated_cores=2,4-7 > isolated_cores="1-17,19-35,37-53,55-71" isolated_cores (in tuned) do not take doublequotes, could you please try without them? Please see the example above. I'll raise this with tuned developers.
Thanks that helped! tuned logs: updated Profile auh-akb-nec-rk03-cisco-oc-worker02 with bootcmdline: skew_tick=1 nohz=on nohz_full=1-17,19-35,37-53,55-71 rcu_nocbs=1-17,19-35,37-53,55-71 tuned.non_isolcpus=00400010,00040001 intel_pstate=disable nosoftlockup systemd.cpu_affinity=0,18,36,54 I1023 15:30:29.251940 4075 tuned.go:349] written "/etc/tuned/recommend.d/50-openshift.conf" to set tuned profile cpu-partitioning I1023 15:30:30.334788 4075 tuned.go:530] active and recommended profile (cpu-partitioning) match; profile change will not trigger profile reload ^C cat /etc/systemd/system.conf|grep -i ^cpu CPUAffinity=0 18 36 54 though kernel args shows a different value ,i will reboot and see if that changes. Can you tell me where does Coreos keep grub conf files cat /proc/cmdline BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-76a449d5c5374a21595629fed75f2051af40f4108b358d9c52219c4a9aca3750/vmlinuz-4.18.0-193.14.3.el8_2.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal rd.luks.options=discard ostree=/ostree/boot.0/rhcos/76a449d5c5374a21595629fed75f2051af40f4108b358d9c52219c4a9aca3750/0 default_hugepagesz=1G hugepagesz=1G isolcpus=1-17,19-35,37-53,55-71 hugepages=228 skew_tick=1 nohz=on nohz_full=1-17,19-35,37-53,55-71 rcu_nocbs=1-17,19-35,37-53,55-71 nosoftlockup intel_pstate=disable intel_iommu=on iommu=pt tuned.non_isolcpus=000000ff,ffffffff,ffffffff systemd.cpu_affinity=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71
Filed a BZ for the doublequote issue: https://bugzilla.redhat.com/show_bug.cgi?id=1891036 CoreOS uses BLS and the entries are in /boot/loader/entries However changing them is likely not supportable, but please check with support engineers.
Thanks i just wanted to check if isolcpus are added correctly in the grub After reboot,some of the parameters are duplicated (machineconfig and tuned) i will update the machineconfig to remove the duplicate parameters cat /proc/cmdline BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-76a449d5c5374a21595629fed75f2051af40f4108b358d9c52219c4a9aca3750/vmlinuz-4.18.0-193.14.3.el8_2.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal rd.luks.options=discard ostree=/ostree/boot.1/rhcos/76a449d5c5374a21595629fed75f2051af40f4108b358d9c52219c4a9aca3750/0 default_hugepagesz=1G hugepagesz=1G isolcpus=1-17,19-35,37-53,55-71 hugepages=228 skew_tick=1 nohz=on nohz_full=1-17,19-35,37-53,55-71 rcu_nocbs=1-17,19-35,37-53,55-71 nosoftlockup intel_pstate=disable intel_iommu=on iommu=pt nohz_full=1-17,19-35,37-53,55-71 rcu_nocbs=1-17,19-35,37-53,55-71 tuned.non_isolcpus=00400010,00040001 systemd.cpu_affinity=0,18,36,54
Ananth, could you pleasee confirm the suggested settings above work for you? I believe we are seeing a configuration problem rather than a bug here. Thanks!
Thanks for your support. After removing quotes and rebooting the nodes both boot arguments and containers are configured correctly. If you could explain why "conmon" process had the correct CPU affinity.Is it because container's cgroup is managed by systemd and hence for CPU affinity to be set on a container to has to be set at cgroup level and systemd should have affinity set.For other system process its not at cgroup level (like kthreads,other host process)
Non k8s managed system processes have their affinity set through systemd's CPUAffinity/systemd.cpu_affinity setting and via the use of the tuned daemon with combination of its scheduler plugin. Closing BZ. If the solution does not work for you, please open a new bugzilla.