Description of problem: While running scale testing on SNOs with 88 pods, many pods get stuck in state CreateContainerError Version-Release number of selected component (if applicable): 4.18.0-305.30.1.rt7.102.el8_4.x86_64 cri-o://1.22.1-10.rhaos4.9.gitf1d2c6e.el8 OCP 4.9.17 SNO cluster with Openshift SDN, kernel-rt enabled and 4 cpus reserved for housekeeping pods through performance profile. Node has 128 GiB of available memory. Started 88 pods guranteed pods on the SNO (50m cpu requests and limits, 100Mi memory requests and limits set per container) Multiple pods got stuck in CreateContainerError state. The issue finally cleared itself, but it took over 3 hours for all the pods to come to running state. Pod events for the Failed pods look like this: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 82m default-scheduler Successfully assigned boatload-25/boatload-25-1-boatload-c4897845b-zn47f to nchhabra-baremetal06 Normal AddedInterface 81m multus Add eth0 [10.128.3.204/21] from openshift-sdn Warning Failed 57m kubelet Error: ImageInspectError Warning Failed 35m kubelet Error: Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name k8s_boatload-1_boatload-25-1-boatload-c4897845b-zn47f_boatload-25_81361b94-64a5-4e75-9f81-0e055603d99b_13 for id 73bbd2dce6d74b6e754a1d9f2186291776c02c5a623a7f31fc781fda7243fdda: name is reserved Normal Pulled 20m (x21 over 80m) kubelet Container image "quay.io/redhat-performance/test-gohttp-probe:v0.0.2" already present on machine Warning Failed 8m57s (x23 over 78m) kubelet Error: context deadline exceeded Warning InspectFailed 4m48s (x3 over 57m) kubelet Failed to inspect image "quay.io/redhat-performance/test-gohttp-probe:v0.0.2": rpc error: code = DeadlineExceeded desc = context deadline exceeded Journalctl recorded errors such as for the failed pods: Feb 09 17:31:00 nchhabra-baremetal06 crio[9134]: time="2022-02-09 17:31:00.776106307Z" level=warning msg="error reserving ctr name k8s_boatload-1_boatload-87-1-boatload-8844bc57b-9frcn_boatload-87_24f14f88-3a31-4bcf-a1c7-6dd6bd063b8c_51 f or id 3e94a15b3dce98062ac85e91f1ef4b804f20e7038fbd023657c65e0d1d2da857: name is reserved" How reproducible: Unclear at the moment. Another newly deployed SNO (with same versions and profile) ran the same tests without issues. Steps to Reproduce: 1. 2. 3. Actual results: Pods stuck in CreateContainerError state for an unacceptably long duration Expected results: Pods to run without errors Additional info: complete pod description: [root@nchhabra-baremetal01 logs]# cat pod-describe-crio.log Name: boatload-25-1-boatload-c4897845b-zn47f Namespace: boatload-25 Priority: 0 Node: nchhabra-baremetal06/10.95.147.199 Start Time: Wed, 09 Feb 2022 08:34:01 -0600 Labels: app=boatload-25-1 pod-template-hash=c4897845b Annotations: k8s.v1.cni.cncf.io/network-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.128.3.204" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.128.3.204" ], "default": true, "dns": {} }] openshift.io/scc: restricted Status: Pending IP: 10.128.3.204 IPs: IP: 10.128.3.204 Controlled By: ReplicaSet/boatload-25-1-boatload-c4897845b Containers: boatload-1: Container ID: Image: quay.io/redhat-performance/test-gohttp-probe:v0.0.2 Image ID: Port: 8000/TCP Host Port: 0/TCP State: Waiting Reason: CreateContainerError Ready: False Restart Count: 0 Limits: cpu: 50m memory: 100Mi Requests: cpu: 50m memory: 100Mi Environment: PORT: 8000 LISTEN_DELAY_SECONDS: 0 LIVENESS_DELAY_SECONDS: 0 READINESS_DELAY_SECONDS: 0 RESPONSE_DELAY_MILLISECONDS: 0 LIVENESS_SUCCESS_MAX: 0 READINESS_SUCCESS_MAX: 0 Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hxzkq (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kube-api-access-hxzkq: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Guaranteed Node-Selectors: jetlag=true Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 82m default-scheduler Successfully assigned boatload-25/boatload-25-1-boatload-c4897845b-zn47f to nchhabra-baremetal06 Normal AddedInterface 81m multus Add eth0 [10.128.3.204/21] from openshift-sdn Warning Failed 57m kubelet Error: ImageInspectError Warning Failed 35m kubelet Error: Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name k8s_boatload-1_boatload-25-1- boatload-c4897845b-zn47f_boatload-25_81361b94-64a5-4e75-9f81-0e055603d99b_13 for id 73bbd2dce6d74b6e754a1d9f2186291776c02c5a623a7f31fc781fda7243fdda: name is reserved Normal Pulled 20m (x21 over 80m) kubelet Container image "quay.io/redhat-performance/test-gohttp-probe:v0.0.2" already present on machine Warning Failed 8m57s (x23 over 78m) kubelet Error: context deadline exceeded Warning InspectFailed 4m48s (x3 over 57m) kubelet Failed to inspect image "quay.io/redhat-performance/test-gohttp-probe:v0.0.2": rpc error: code = DeadlineExceeded desc = context deadline exceeded List of pods in CreateContainerError state: [root@nchhabra-baremetal01 logs]# oc get po -A --field-selector=status.phase=Pending | grep -iv running | grep -iv compl | grep boatload boatload-25 boatload-25-1-boatload-c4897845b-zn47f 0/1 CreateContainerError 0 114m boatload-26 boatload-26-1-boatload-6d798584c9-bf8rl 0/1 CreateContainerError 0 114m boatload-28 boatload-28-1-boatload-7b7dcb4545-4x4vk 0/1 CreateContainerError 0 114m boatload-30 boatload-30-1-boatload-7767fd88c4-fd4bb 0/1 CreateContainerError 0 114m boatload-31 boatload-31-1-boatload-695c47dcf-f6djb 0/1 CreateContainerError 0 114m boatload-32 boatload-32-1-boatload-6459557d48-n5d7f 0/1 CreateContainerError 0 114m boatload-33 boatload-33-1-boatload-6cfb4f76b5-7hszq 0/1 CreateContainerError 0 114m boatload-34 boatload-34-1-boatload-56484746d6-c58cw 0/1 CreateContainerError 0 114m boatload-35 boatload-35-1-boatload-6bb8c9f8ff-ffbsg 0/1 CreateContainerError 0 114m boatload-36 boatload-36-1-boatload-69f8bccb8d-xlnph 0/1 CreateContainerError 0 114m boatload-38 boatload-38-1-boatload-84bb75dc94-949cb 0/1 CreateContainerError 0 114m boatload-39 boatload-39-1-boatload-847644979f-rrdmv 0/1 CreateContainerError 0 114m boatload-40 boatload-40-1-boatload-79dd786d55-d25tr 0/1 CreateContainerError 0 114m boatload-42 boatload-42-1-boatload-5d7fc7f8c7-zgzn5 0/1 CreateContainerError 0 114m boatload-43 boatload-43-1-boatload-5bb757dcd8-4twv7 0/1 CreateContainerError 0 114m boatload-44 boatload-44-1-boatload-6f8bc64c6c-d5gzd 0/1 CreateContainerError 0 114m boatload-45 boatload-45-1-boatload-6d99fdb6cd-87pn2 0/1 CreateContainerError 0 114m boatload-46 boatload-46-1-boatload-66ff4c9d49-2p5dt 0/1 CreateContainerError 0 114m boatload-47 boatload-47-1-boatload-55bdcfd8f6-xm7nw 0/1 CreateContainerError 0 114m boatload-49 boatload-49-1-boatload-5bdf9966-8p86t 0/1 CreateContainerError 0 114m boatload-50 boatload-50-1-boatload-6567bd4d64-p5cw2 0/1 CreateContainerError 0 114m boatload-51 boatload-51-1-boatload-c7b7fdf-4grnz 0/1 CreateContainerError 0 114m boatload-52 boatload-52-1-boatload-7dc4f6666-lg8d4 0/1 CreateContainerError 0 114m boatload-53 boatload-53-1-boatload-585458f4c5-zfqtt 0/1 CreateContainerError 0 114m boatload-54 boatload-54-1-boatload-78cfcb8cb4-7vs7n 0/1 CreateContainerError 0 114m boatload-55 boatload-55-1-boatload-5bb87958b9-srnwj 0/1 CreateContainerError 0 114m boatload-56 boatload-56-1-boatload-7cd8c474f4-xx5ww 0/1 CreateContainerError 0 114m boatload-57 boatload-57-1-boatload-56db586d79-k9cdl 0/1 CreateContainerError 0 114m boatload-58 boatload-58-1-boatload-874455f8d-rh54s 0/1 CreateContainerError 0 114m boatload-59 boatload-59-1-boatload-fcd9ddfc7-4q6xl 0/1 CreateContainerError 0 114m boatload-60 boatload-60-1-boatload-5966fc45bf-s6l8h 0/1 CreateContainerError 0 114m boatload-61 boatload-61-1-boatload-6b6f86d964-nxd2d 0/1 CreateContainerError 0 114m boatload-62 boatload-62-1-boatload-7f887885f9-ggq9r 0/1 CreateContainerError 0 114m boatload-63 boatload-63-1-boatload-5575ff7c74-ltgw6 0/1 CreateContainerError 0 114m boatload-64 boatload-64-1-boatload-849588d49-stzbv 0/1 CreateContainerError 0 114m boatload-65 boatload-65-1-boatload-5b4475f744-mvmsp 0/1 CreateContainerError 0 114m boatload-66 boatload-66-1-boatload-85c9798587-fptb4 0/1 CreateContainerError 0 114m boatload-67 boatload-67-1-boatload-548c95f754-fqcpf 0/1 CreateContainerError 0 114m boatload-69 boatload-69-1-boatload-5bc965f6-q5td6 0/1 CreateContainerError 0 114m boatload-70 boatload-70-1-boatload-667b9cb8bb-pdbl7 0/1 CreateContainerError 0 114m boatload-71 boatload-71-1-boatload-5dbcd84655-2tjx5 0/1 CreateContainerError 0 114m boatload-72 boatload-72-1-boatload-8559b897b-6hqtv 0/1 CreateContainerError 0 114m boatload-73 boatload-73-1-boatload-66886d5989-pwmwh 0/1 CreateContainerError 0 114m boatload-74 boatload-74-1-boatload-6d4d649466-fpk5b 0/1 CreateContainerError 0 114m boatload-75 boatload-75-1-boatload-77b6b67cf5-stj4h 0/1 CreateContainerError 0 114m boatload-76 boatload-76-1-boatload-86b99b5b64-jsx7x 0/1 CreateContainerError 0 114m boatload-77 boatload-77-1-boatload-8495bfbb6f-knxq4 0/1 CreateContainerError 0 114m boatload-78 boatload-78-1-boatload-6f759c77db-mh8zw 0/1 CreateContainerError 0 114m boatload-80 boatload-80-1-boatload-84fccd74f9-jfkrh 0/1 CreateContainerError 0 114m boatload-81 boatload-81-1-boatload-57c7d66f5d-kxd5s 0/1 CreateContainerError 0 114m boatload-82 boatload-82-1-boatload-75f657467c-r4z4d 0/1 CreateContainerError 0 114m boatload-84 boatload-84-1-boatload-7f9c84b775-j56qg 0/1 CreateContainerError 0 114m boatload-85 boatload-85-1-boatload-66599bffcd-6rwvs 0/1 CreateContainerError 0 114m boatload-86 boatload-86-1-boatload-77bc986765-sqfph 0/1 CreateContainerError 0 114m boatload-87 boatload-87-1-boatload-8844bc57b-9frcn 0/1 CreateContainerError 0 114m
Reopening after discussing with @pehunt . Was able to reproduce the issue with 170 pods after the custom crio binary was loaded on the node. Journalctl and crio routine logs collected and uploaded to the same location.
Hi @pehunt , I am seeing this issue across both OpenshiftSDN and OVN SNOs. It is currently a blocker for me. Please let me know if you need any more logs. Thank you, Noreen
for posterity: we worked together offline and found an issue with the workload pods being put into their own cpuset, as is expected with SNO
@pehunt any update on this? Thank you, Noreen
Hey Noreen, What was the result of checking if the containers were being put into the correct cpuset? last I heard they were not
Hi Peter, I used the workload pinning which allows all of the OCP overhead to be constrained to the "reserved cores" as configured by PAO, before running these tests. This should ensure that all of the non-reserved cores of the server are available to the user for their workloads. So trying to recall how it was discovered that the workload pods were utilizing reserved cpus? Did you notice something in the logs to indicate this? Also, I don't recall noticing high cpu utilization on CRI-O or kubelet. If you would like, I can provide you access to the node so we can take a second look. Thank you, Noreen
Sorry this fell through the cracks. Are you still working on this Noreen?
I think we should close this one, and open a new one if similar issues pop up. It doesn't look like anyone is actively looking at it and since we've all moved on, I don't see a point in keeping it around.