Bug 1977100 - Pod failed to start with message "set CPU load balancing: readdirent /proc/sys/kernel/sched_domain/cpu66/domain0: no such file or directory"
Summary: Pod failed to start with message "set CPU load balancing: readdirent /proc/sy...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.11.0
Assignee: Peter Hunt
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks: 2091524 2102803
TreeView+ depends on / blocked
 
Reported: 2021-06-28 21:56 UTC by Ian Miller
Modified: 2022-11-09 15:18 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2091524 2102803 (view as bug list)
Environment:
Last Closed: 2022-08-10 10:36:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Must gather (16.60 MB, application/gzip)
2021-06-28 21:56 UTC, Ian Miller
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 5786 0 None Merged Bug 1977100: retry setting CPU load balancing 2022-06-01 15:44:04 UTC
Red Hat Issue Tracker OCPBUGS-3446 0 None None None 2022-11-09 15:18:53 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:37:19 UTC

Description Ian Miller 2021-06-28 21:56:08 UTC
Created attachment 1795591 [details]
Must gather

Created attachment 1795591 [details]
Must gather

Description of problem:
Attempt to launch a guaranteed CPU pod with annotation "cpu-load-balancing.crio.io: disable". The pod (name "cyclictest" in openshift-monitoring namespace) failed to start with a message:

Error: failed to run pre-start hook for container "container-perf-tools": set CPU load balancing: readdirent /proc/sys/kernel/sched_domain/cpu66/domain0: no such file or directory

Using oc debug node/.... after the fact, validated that the path does exist on the node and that the kubelet and crio configurations for workload partitioning were valid for the node:

sh-4.4# cat /etc/kubernetes/openshift-workload-pinning 
{                                
  "management": {
    "cpuset": "0-1,40-41"            
  }
}                                    

sh-4.4# cat /etc/crio/crio.conf.d/01-workload-partitioning 
[crio.runtime.workloads.management]  
activation_annotation = "target.workload.openshift.io/management"
annotation_prefix = "resources.workload.openshift.io"
resources = { "cpushares" = 0, "cpuset" = "0-1,40-41" }

sh-4.4# lscpu                                                                                                                                                                                                                                                 
Architecture:        x86_64                                                                                                                                                                                                                                   
CPU op-mode(s):      32-bit, 64-bit                                                                                                                                                                                                                           
Byte Order:          Little Endian                                                                                                                                                                                                                            
CPU(s):              80                                                                                                                                                                                                                                       
On-line CPU(s) list: 0-79                                                                                                                                                                                                                                     
Thread(s) per core:  2                                                                                                                                                                                                                                        
Core(s) per socket:  20                                                                                                                                                                                                                                       
Socket(s):           2                                                                                                                                                                                                                                        
NUMA node(s):        2
Vendor ID:           GenuineIntel                              
BIOS Vendor ID:      Intel         
CPU family:          6                          
Model:               85    
Model name:          Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
BIOS Model name:     Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
Stepping:            7                        
CPU MHz:             2800.000              
CPU max MHz:         3900.0000
CPU min MHz:         800.0000                                                                                                  
BogoMIPS:            4200.00                 
Virtualization:      VT-x         
L1d cache:           32K                              
L1i cache:           32K                     
L2 cache:            1024K       
L3 cache:            28160K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities         


Version-Release number of selected component (if applicable): 4.8


How reproducible: only once


Steps to Reproduce:
1. oc apply -f pod.yaml


Actual results:
$ oc describe pod cyclictest                                                                 
Name:         cyclictest               
Namespace:    openshift-monitoring                          
Priority:     0                                                                                                                
Node:         master-1.cluster1.savanna.lab.eng.rdu2.redhat.com/10.1.190.13                                                    
Start Time:   Mon, 28 Jun 2021 12:35:56 -0400                                                                                  
Labels:       <none> 
Annotations:  cpu-load-balancing.crio.io: disable          
<snip>
Status:       Failed
<snip>
Events:                                                                                                                        
  Type     Reason          Age   From               Message
  ----     ------          ----  ----               -------
  Normal   Scheduled       15m   default-scheduler  Successfully assigned openshift-monitoring/cyclictest to master-1.cluster1.savanna.lab.eng.rdu2.redhat.com
  Normal   AddedInterface  15m   multus             Add eth0 [10.128.0.186/23] from openshift-sdn
  Normal   Pulling         15m   kubelet            Pulling image "quay.io/jianzzha/perf-tools"
  Normal   Pulled          15m   kubelet            Successfully pulled image "quay.io/jianzzha/perf-tools" in 1.50961756s
  Normal   Created         15m   kubelet            Created container container-perf-tools
  Warning  Failed          15m   kubelet            Error: failed to run pre-start hook for container "container-perf-tools": set CPU load balancing: readdirent /proc/sys/kernel/sched_domain/cpu66/domain0: no such file or directory


Expected results:
Pod successfully deploys.

Additional info:

Subsequent attempts to deploy this same pod were successful.

Comment 2 Peter Hunt 2021-08-05 16:42:26 UTC
Artyom can you PTAL?

Comment 3 Peter Hunt 2021-08-05 16:43:32 UTC
bugzilla is hard

Comment 4 Artyom 2021-08-08 08:56:47 UTC
Hi Ian, does this issue is persistent in your environment or does it happen only once?

Comment 5 Ian Miller 2021-08-25 12:14:29 UTC
Hi Artyom. It only occurred once and has not been seen since.

Comment 8 Tom Sweeney 2022-01-06 15:41:33 UTC
Not completed this sprint.

Comment 10 Artyom 2022-01-31 09:11:19 UTC
Ian, please feel free to re-open it if you encounter it again.

Comment 11 Dahir Osman 2022-04-04 16:09:24 UTC
Issue seen again as part of pod create/delete in a loop.

Name:         dpdk-testpmd-1
Namespace:    default
Priority:     0
Node:         cnfocto2.ptp.lab.eng.bos.redhat.com/10.16.231.12
Start Time:   Fri, 01 Apr 2022 17:45:51 -0400
Labels:       <none>
Annotations:  cpu-load-balancing.crio.io: disable
              cpu-quota.crio.io: disable
              irq-load-balancing.crio.io: disable
....
Status:       Failed
....
      Message:      failed to run pre-start hook for container "a0118c046214e70fdaa2c6216a429941e28737658423a225b5961611f415e5b4": set CPU load balancing: lstat /proc/sys/kernel/sched_domain/cpu22/domain1/flags: no such file or directory

Comment 13 Artyom 2022-04-05 08:24:57 UTC
Looks like the kernel recreates sched_domain directories each time a new process needed to be re-balanced(my speculations and maybe I am wrong), but I can see once I am creating a new pod:

1. Create a debug pod for the node and under it run
sh-4.4# stat -c '%y' /host/proc/sys/kernel/sched_domain/cpu2/
2022-04-05 08:17:44.067865761 +0000

2. exit

3. create a new debug pod to the same node and check again the command above
stat -c '%y' /host/proc/sys/kernel/sched_domain/cpu2/
2022-04-05 08:22:41.960812800 +0000

So in general we have a race under the CRI-O between creating the pod and setting the sched_domain values.
Occurrences are pretty rare, but we anyway should think about the way to fix them.

Comment 21 Kir Kolyshkin 2022-05-31 17:10:08 UTC
A very recent attempt to fix this was in https://github.com/cri-o/cri-o/pull/5786, and it went into cri-o v1.24.0. From what I see, it should indeed fix (or reduce the probability of) this issue happening.

Since this bug is reported against openshift 4.8, I guess we need to backport the fix to cri-o v1.21. Will do.

Comment 22 Kir Kolyshkin 2022-05-31 23:39:33 UTC
1.21 backport: https://github.com/cri-o/cri-o/pull/5919

1.22 backport: https://github.com/cri-o/cri-o/pull/5920

1.23 backport: https://github.com/cri-o/cri-o/pull/5921

Comment 30 Sunil Choudhary 2022-06-17 10:46:00 UTC
% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-15-222801   True        False         47m     Cluster version is 4.11.0-0.nightly-2022-06-15-222801

% oc get nodes         
NAME                                       STATUS   ROLES           AGE   VERSION
ip-10-0-75-50.us-east-2.compute.internal   Ready    master,worker   64m   v1.24.0+cb71478

% oc debug node/ip-10-0-75-50.us-east-2.compute.internal
Warning: would violate PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
Starting pod/ip-10-0-75-50us-east-2computeinternal-debug ...
…

sh-4.4# cat /etc/kubernetes/openshift-workload-pinning 
{
  "management": {
    "cpuset": "0,1"
  }
}


sh-4.4# cat /etc/crio/crio.conf.d/01-workload-partitioning 
[crio.runtime.workloads.management]
activation_annotation = "target.workload.openshift.io/management"
annotation_prefix = "resources.workload.openshift.io"
resources = { "cpushares" = 0, "cpuset" = "0-1,10-12" }

% cat epod.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: twocontainers
  annotations:
    cpu-load-balancing.crio.io: disable
spec:
  containers:
  - name: sise
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:aa8d1daf3432d8dedc5c56d94aeb1f25301bce6ccd7d5406fb03a00be97374ad
    command:
      - "bin/bash"
      - "-c"
      - "sleep 10000"
  resources:
    limits:
      cpu: “500m”
      memory: "500Mi"
    requests:
      cpu: “400m”
      memory: "400Mi"


% oc create -f epod.yaml                                     
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "sise" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "sise" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "sise" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "sise" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
pod/twocontainers created

% oc get pods
NAME            READY   STATUS    RESTARTS   AGE
twocontainers   1/1     Running   0          3s

% oc describe pod twocontainers
Name:         twocontainers
Namespace:    default
Priority:     0
Node:         ip-10-0-75-50.us-east-2.compute.internal/10.0.75.50
Start Time:   Fri, 17 Jun 2022 16:12:58 +0530
Labels:       <none>
Annotations:  cpu-load-balancing.crio.io: disable
              k8s.ovn.org/pod-networks:
                {"default":{"ip_addresses":["10.128.0.46/23"],"mac_address":"0a:58:0a:80:00:2e","gateway_ips":["10.128.0.1"],"ip_address":"10.128.0.46/23"...
              k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "ovn-kubernetes",
                    "interface": "eth0",
                    "ips": [
                        "10.128.0.46"
                    ],
                    "mac": "0a:58:0a:80:00:2e",
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "ovn-kubernetes",
                    "interface": "eth0",
                    "ips": [
                        "10.128.0.46"
                    ],
                    "mac": "0a:58:0a:80:00:2e",
                    "default": true,
                    "dns": {}
                }]
Status:       Running
IP:           10.128.0.46
IPs:
  IP:  10.128.0.46
Containers:
  sise:
    Container ID:  cri-o://ef887311fad1f4d6b9d73ec399d7f7f40735b010ba241f79cd6637125d4a6fb0
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:aa8d1daf3432d8dedc5c56d94aeb1f25301bce6ccd7d5406fb03a00be97374ad
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:aa8d1daf3432d8dedc5c56d94aeb1f25301bce6ccd7d5406fb03a00be97374ad
    Port:          <none>
    Host Port:     <none>
    Command:
      bin/bash
      -c
      sleep 10000
    State:          Running
      Started:      Fri, 17 Jun 2022 16:13:00 +0530
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hp9j5 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kube-api-access-hp9j5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age   From               Message
  ----    ------          ----  ----               -------
  Normal  Scheduled       12s   default-scheduler  Successfully assigned default/twocontainers to ip-10-0-75-50.us-east-2.compute.internal by ip-10-0-75-50
  Normal  AddedInterface  10s   multus             Add eth0 [10.128.0.46/23] from ovn-kubernetes
  Normal  Pulled          10s   kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:aa8d1daf3432d8dedc5c56d94aeb1f25301bce6ccd7d5406fb03a00be97374ad" already present on machine
  Normal  Created         10s   kubelet            Created container sise
  Normal  Started         10s   kubelet            Started container sise

Comment 31 Kir Kolyshkin 2022-06-29 22:33:37 UTC
Peter, can you please take a look?

Comment 32 errata-xmlrpc 2022-08-10 10:36:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 34 Tom Sweeney 2022-11-08 16:21:53 UTC
@pehunt please see the prior comment, did this make OCP v4.9


Note You need to log in before you can comment on or make changes to this bug.