Created attachment 1795591 [details] Must gather Created attachment 1795591 [details] Must gather Description of problem: Attempt to launch a guaranteed CPU pod with annotation "cpu-load-balancing.crio.io: disable". The pod (name "cyclictest" in openshift-monitoring namespace) failed to start with a message: Error: failed to run pre-start hook for container "container-perf-tools": set CPU load balancing: readdirent /proc/sys/kernel/sched_domain/cpu66/domain0: no such file or directory Using oc debug node/.... after the fact, validated that the path does exist on the node and that the kubelet and crio configurations for workload partitioning were valid for the node: sh-4.4# cat /etc/kubernetes/openshift-workload-pinning { "management": { "cpuset": "0-1,40-41" } } sh-4.4# cat /etc/crio/crio.conf.d/01-workload-partitioning [crio.runtime.workloads.management] activation_annotation = "target.workload.openshift.io/management" annotation_prefix = "resources.workload.openshift.io" resources = { "cpushares" = 0, "cpuset" = "0-1,40-41" } sh-4.4# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel BIOS Vendor ID: Intel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz BIOS Model name: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz Stepping: 7 CPU MHz: 2800.000 CPU max MHz: 3900.0000 CPU min MHz: 800.0000 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 28160K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities Version-Release number of selected component (if applicable): 4.8 How reproducible: only once Steps to Reproduce: 1. oc apply -f pod.yaml Actual results: $ oc describe pod cyclictest Name: cyclictest Namespace: openshift-monitoring Priority: 0 Node: master-1.cluster1.savanna.lab.eng.rdu2.redhat.com/10.1.190.13 Start Time: Mon, 28 Jun 2021 12:35:56 -0400 Labels: <none> Annotations: cpu-load-balancing.crio.io: disable <snip> Status: Failed <snip> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 15m default-scheduler Successfully assigned openshift-monitoring/cyclictest to master-1.cluster1.savanna.lab.eng.rdu2.redhat.com Normal AddedInterface 15m multus Add eth0 [10.128.0.186/23] from openshift-sdn Normal Pulling 15m kubelet Pulling image "quay.io/jianzzha/perf-tools" Normal Pulled 15m kubelet Successfully pulled image "quay.io/jianzzha/perf-tools" in 1.50961756s Normal Created 15m kubelet Created container container-perf-tools Warning Failed 15m kubelet Error: failed to run pre-start hook for container "container-perf-tools": set CPU load balancing: readdirent /proc/sys/kernel/sched_domain/cpu66/domain0: no such file or directory Expected results: Pod successfully deploys. Additional info: Subsequent attempts to deploy this same pod were successful.
Artyom can you PTAL?
bugzilla is hard
Hi Ian, does this issue is persistent in your environment or does it happen only once?
Hi Artyom. It only occurred once and has not been seen since.
Not completed this sprint.
Ian, please feel free to re-open it if you encounter it again.
Issue seen again as part of pod create/delete in a loop. Name: dpdk-testpmd-1 Namespace: default Priority: 0 Node: cnfocto2.ptp.lab.eng.bos.redhat.com/10.16.231.12 Start Time: Fri, 01 Apr 2022 17:45:51 -0400 Labels: <none> Annotations: cpu-load-balancing.crio.io: disable cpu-quota.crio.io: disable irq-load-balancing.crio.io: disable .... Status: Failed .... Message: failed to run pre-start hook for container "a0118c046214e70fdaa2c6216a429941e28737658423a225b5961611f415e5b4": set CPU load balancing: lstat /proc/sys/kernel/sched_domain/cpu22/domain1/flags: no such file or directory
Looks like the kernel recreates sched_domain directories each time a new process needed to be re-balanced(my speculations and maybe I am wrong), but I can see once I am creating a new pod: 1. Create a debug pod for the node and under it run sh-4.4# stat -c '%y' /host/proc/sys/kernel/sched_domain/cpu2/ 2022-04-05 08:17:44.067865761 +0000 2. exit 3. create a new debug pod to the same node and check again the command above stat -c '%y' /host/proc/sys/kernel/sched_domain/cpu2/ 2022-04-05 08:22:41.960812800 +0000 So in general we have a race under the CRI-O between creating the pod and setting the sched_domain values. Occurrences are pretty rare, but we anyway should think about the way to fix them.
A very recent attempt to fix this was in https://github.com/cri-o/cri-o/pull/5786, and it went into cri-o v1.24.0. From what I see, it should indeed fix (or reduce the probability of) this issue happening. Since this bug is reported against openshift 4.8, I guess we need to backport the fix to cri-o v1.21. Will do.
1.21 backport: https://github.com/cri-o/cri-o/pull/5919 1.22 backport: https://github.com/cri-o/cri-o/pull/5920 1.23 backport: https://github.com/cri-o/cri-o/pull/5921
% oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-15-222801 True False 47m Cluster version is 4.11.0-0.nightly-2022-06-15-222801 % oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-75-50.us-east-2.compute.internal Ready master,worker 64m v1.24.0+cb71478 % oc debug node/ip-10-0-75-50.us-east-2.compute.internal Warning: would violate PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") Starting pod/ip-10-0-75-50us-east-2computeinternal-debug ... … sh-4.4# cat /etc/kubernetes/openshift-workload-pinning { "management": { "cpuset": "0,1" } } sh-4.4# cat /etc/crio/crio.conf.d/01-workload-partitioning [crio.runtime.workloads.management] activation_annotation = "target.workload.openshift.io/management" annotation_prefix = "resources.workload.openshift.io" resources = { "cpushares" = 0, "cpuset" = "0-1,10-12" } % cat epod.yaml apiVersion: v1 kind: Pod metadata: name: twocontainers annotations: cpu-load-balancing.crio.io: disable spec: containers: - name: sise image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:aa8d1daf3432d8dedc5c56d94aeb1f25301bce6ccd7d5406fb03a00be97374ad command: - "bin/bash" - "-c" - "sleep 10000" resources: limits: cpu: “500m” memory: "500Mi" requests: cpu: “400m” memory: "400Mi" % oc create -f epod.yaml Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "sise" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "sise" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "sise" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "sise" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") pod/twocontainers created % oc get pods NAME READY STATUS RESTARTS AGE twocontainers 1/1 Running 0 3s % oc describe pod twocontainers Name: twocontainers Namespace: default Priority: 0 Node: ip-10-0-75-50.us-east-2.compute.internal/10.0.75.50 Start Time: Fri, 17 Jun 2022 16:12:58 +0530 Labels: <none> Annotations: cpu-load-balancing.crio.io: disable k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.128.0.46/23"],"mac_address":"0a:58:0a:80:00:2e","gateway_ips":["10.128.0.1"],"ip_address":"10.128.0.46/23"... k8s.v1.cni.cncf.io/network-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.128.0.46" ], "mac": "0a:58:0a:80:00:2e", "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.128.0.46" ], "mac": "0a:58:0a:80:00:2e", "default": true, "dns": {} }] Status: Running IP: 10.128.0.46 IPs: IP: 10.128.0.46 Containers: sise: Container ID: cri-o://ef887311fad1f4d6b9d73ec399d7f7f40735b010ba241f79cd6637125d4a6fb0 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:aa8d1daf3432d8dedc5c56d94aeb1f25301bce6ccd7d5406fb03a00be97374ad Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:aa8d1daf3432d8dedc5c56d94aeb1f25301bce6ccd7d5406fb03a00be97374ad Port: <none> Host Port: <none> Command: bin/bash -c sleep 10000 State: Running Started: Fri, 17 Jun 2022 16:13:00 +0530 Ready: True Restart Count: 0 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hp9j5 (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: kube-api-access-hp9j5: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 12s default-scheduler Successfully assigned default/twocontainers to ip-10-0-75-50.us-east-2.compute.internal by ip-10-0-75-50 Normal AddedInterface 10s multus Add eth0 [10.128.0.46/23] from ovn-kubernetes Normal Pulled 10s kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:aa8d1daf3432d8dedc5c56d94aeb1f25301bce6ccd7d5406fb03a00be97374ad" already present on machine Normal Created 10s kubelet Created container sise Normal Started 10s kubelet Started container sise
Peter, can you please take a look?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069
@pehunt please see the prior comment, did this make OCP v4.9