Bug 2052789 - [OCP 4.9] SNO: Kubelet may be retrying requests that are timing out in CRI-O due to system load
Summary: [OCP 4.9] SNO: Kubelet may be retrying requests that are timing out in CRI-O ...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Peter Hunt
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-10 03:45 UTC by Noreen
Modified: 2022-05-13 20:07 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-13 20:07:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 5602 0 None open [release-1.22] server: fix race with kubelet 2022-02-10 14:20:00 UTC

Description Noreen 2022-02-10 03:45:08 UTC
Description of problem:
While running scale testing on SNOs with 88 pods, many pods get stuck in state CreateContainerError


Version-Release number of selected component (if applicable):
4.18.0-305.30.1.rt7.102.el8_4.x86_64
cri-o://1.22.1-10.rhaos4.9.gitf1d2c6e.el8
OCP 4.9.17

SNO cluster with Openshift SDN, kernel-rt enabled and 4 cpus reserved for housekeeping pods through performance profile.
Node has 128 GiB of available memory.
Started 88 pods guranteed pods on the SNO (50m cpu requests and limits, 100Mi memory requests and limits set per container)
Multiple pods got stuck in CreateContainerError state. The issue finally cleared itself, but it took over 3 hours for all the pods to come to running state.

Pod events for the Failed pods look like this:
Events:
  Type     Reason          Age                   From               Message
  ----     ------          ----                  ----               -------
  Normal   Scheduled       82m                   default-scheduler  Successfully assigned boatload-25/boatload-25-1-boatload-c4897845b-zn47f to nchhabra-baremetal06                                                                         
  Normal   AddedInterface  81m                   multus             Add eth0 [10.128.3.204/21] from openshift-sdn
  Warning  Failed          57m                   kubelet            Error: ImageInspectError
  Warning  Failed          35m                   kubelet            Error: Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name k8s_boatload-1_boatload-25-1-boatload-c4897845b-zn47f_boatload-25_81361b94-64a5-4e75-9f81-0e055603d99b_13 for id 73bbd2dce6d74b6e754a1d9f2186291776c02c5a623a7f31fc781fda7243fdda: name is reserved                                                                       
  Normal   Pulled          20m (x21 over 80m)    kubelet            Container image "quay.io/redhat-performance/test-gohttp-probe:v0.0.2" already present on machine                                                                         
  Warning  Failed          8m57s (x23 over 78m)  kubelet            Error: context deadline exceeded
  Warning  InspectFailed   4m48s (x3 over 57m)   kubelet            Failed to inspect image "quay.io/redhat-performance/test-gohttp-probe:v0.0.2": rpc error: code = DeadlineExceeded desc = context deadline exceeded 

Journalctl recorded errors such as for the failed pods:
Feb 09 17:31:00 nchhabra-baremetal06 crio[9134]: time="2022-02-09 17:31:00.776106307Z" level=warning msg="error reserving ctr name k8s_boatload-1_boatload-87-1-boatload-8844bc57b-9frcn_boatload-87_24f14f88-3a31-4bcf-a1c7-6dd6bd063b8c_51 f
or id 3e94a15b3dce98062ac85e91f1ef4b804f20e7038fbd023657c65e0d1d2da857: name is reserved" 

How reproducible:
Unclear at the moment. Another newly deployed SNO (with same versions and profile) ran the same tests without issues.

Steps to Reproduce:
1.
2.
3.

Actual results:
Pods stuck in CreateContainerError state for an unacceptably long duration

Expected results:
Pods to run without errors

Additional info:

complete pod description:

[root@nchhabra-baremetal01 logs]# cat pod-describe-crio.log                                                                                                                                                                                   
Name:         boatload-25-1-boatload-c4897845b-zn47f                                                                                                                                                                                          
Namespace:    boatload-25                                                                                                                                                                                                                     
Priority:     0                                                                                                                                                                                                                               
Node:         nchhabra-baremetal06/10.95.147.199                                                                                                                                                                                              
Start Time:   Wed, 09 Feb 2022 08:34:01 -0600                                                                                                                                                                                                 
Labels:       app=boatload-25-1                                                                                                                                                                                                               
              pod-template-hash=c4897845b                                                                                                                                                                                                     
Annotations:  k8s.v1.cni.cncf.io/network-status:                                                                                                                                                                                              
                [{                                                                                                                                                                                                                            
                    "name": "openshift-sdn",                                                                                                                                                                                                  
                    "interface": "eth0",                                                                                                                                                                                                      
                    "ips": [                                                                                                                                                                                                                  
                        "10.128.3.204"                                                                                                                                                                                                        
                    ],                                                                                                                                                                                                                        
                    "default": true,                                                                                                                                                                                                          
                    "dns": {}                                                                                                                                                                                                                 
                }]                                                                                                                                                                                                                            
              k8s.v1.cni.cncf.io/networks-status:                                                                                                                                                                                             
                [{                                                                                                                                                                                                                            
                    "name": "openshift-sdn",                                                                                                                                                                                                  
                    "interface": "eth0",                                                                                                                                                                                                      
                    "ips": [                                                                                                                                                                                                                  
                        "10.128.3.204"                                                                                                                                                                                                        
                    ],                                                                                                                                                                                                                        
                    "default": true,                                                                                                                                                                                                          
                    "dns": {}                                                                                                                                                                                                                 
                }]                                                                                                                                                                                                                            
              openshift.io/scc: restricted                                                                                                                                                                                                    
Status:       Pending                                                                                                                                                                                                                         
IP:           10.128.3.204                                                                                                                                                                                                                    
IPs:                                                                                                                                                                                                                                          
  IP:           10.128.3.204                                                                                                                                                                                                                  
Controlled By:  ReplicaSet/boatload-25-1-boatload-c4897845b                                                                                                                                                                                   
Containers:                                                                                                                                                                                                                                   
  boatload-1:                                                                                                                                                                                                                                 
    Container ID:                                                                                                                                                                                                                             
    Image:          quay.io/redhat-performance/test-gohttp-probe:v0.0.2                                                                                                                                                                       
    Image ID:                                                                                                                                                                                                                                 
    Port:           8000/TCP                                                                                                                                                                                                                  
    Host Port:      0/TCP                                                                                                                                                                                                                     
    State:          Waiting                                                                                                                                                                                                                   
      Reason:       CreateContainerError
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     50m
      memory:  100Mi
    Requests:
      cpu:     50m
      memory:  100Mi
    Environment:
      PORT:                         8000
      LISTEN_DELAY_SECONDS:         0
      LIVENESS_DELAY_SECONDS:       0
      READINESS_DELAY_SECONDS:      0
      RESPONSE_DELAY_MILLISECONDS:  0
      LIVENESS_SUCCESS_MAX:         0
      READINESS_SUCCESS_MAX:        0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hxzkq (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-hxzkq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Guaranteed
Node-Selectors:              jetlag=true
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason          Age                   From               Message
  ----     ------          ----                  ----               -------
  Normal   Scheduled       82m                   default-scheduler  Successfully assigned boatload-25/boatload-25-1-boatload-c4897845b-zn47f to nchhabra-baremetal06
  Normal   AddedInterface  81m                   multus             Add eth0 [10.128.3.204/21] from openshift-sdn
  Warning  Failed          57m                   kubelet            Error: ImageInspectError
  Warning  Failed          35m                   kubelet            Error: Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name k8s_boatload-1_boatload-25-1-
boatload-c4897845b-zn47f_boatload-25_81361b94-64a5-4e75-9f81-0e055603d99b_13 for id 73bbd2dce6d74b6e754a1d9f2186291776c02c5a623a7f31fc781fda7243fdda: name is reserved
  Normal   Pulled          20m (x21 over 80m)    kubelet            Container image "quay.io/redhat-performance/test-gohttp-probe:v0.0.2" already present on machine
  Warning  Failed          8m57s (x23 over 78m)  kubelet            Error: context deadline exceeded
  Warning  InspectFailed   4m48s (x3 over 57m)   kubelet            Failed to inspect image "quay.io/redhat-performance/test-gohttp-probe:v0.0.2": rpc error: code = DeadlineExceeded desc = context deadline exceeded

List of pods in CreateContainerError state:
[root@nchhabra-baremetal01 logs]# oc get po -A --field-selector=status.phase=Pending | grep -iv running | grep -iv compl | grep boatload                                                                                                      
boatload-25                            boatload-25-1-boatload-c4897845b-zn47f    0/1     CreateContainerError   0          114m                                                                                                               
boatload-26                            boatload-26-1-boatload-6d798584c9-bf8rl   0/1     CreateContainerError   0          114m                                                                                                               
boatload-28                            boatload-28-1-boatload-7b7dcb4545-4x4vk   0/1     CreateContainerError   0          114m                                                                                                               
boatload-30                            boatload-30-1-boatload-7767fd88c4-fd4bb   0/1     CreateContainerError   0          114m                                                                                                               
boatload-31                            boatload-31-1-boatload-695c47dcf-f6djb    0/1     CreateContainerError   0          114m                                                                                                               
boatload-32                            boatload-32-1-boatload-6459557d48-n5d7f   0/1     CreateContainerError   0          114m                                                                                                               
boatload-33                            boatload-33-1-boatload-6cfb4f76b5-7hszq   0/1     CreateContainerError   0          114m                                                                                                               
boatload-34                            boatload-34-1-boatload-56484746d6-c58cw   0/1     CreateContainerError   0          114m                                                                                                               
boatload-35                            boatload-35-1-boatload-6bb8c9f8ff-ffbsg   0/1     CreateContainerError   0          114m                                                                                                               
boatload-36                            boatload-36-1-boatload-69f8bccb8d-xlnph   0/1     CreateContainerError   0          114m                                                                                                               
boatload-38                            boatload-38-1-boatload-84bb75dc94-949cb   0/1     CreateContainerError   0          114m                                                                                                               
boatload-39                            boatload-39-1-boatload-847644979f-rrdmv   0/1     CreateContainerError   0          114m                                                                                                               
boatload-40                            boatload-40-1-boatload-79dd786d55-d25tr   0/1     CreateContainerError   0          114m                                                                                                               
boatload-42                            boatload-42-1-boatload-5d7fc7f8c7-zgzn5   0/1     CreateContainerError   0          114m                                                                                                               
boatload-43                            boatload-43-1-boatload-5bb757dcd8-4twv7   0/1     CreateContainerError   0          114m                                                                                                               
boatload-44                            boatload-44-1-boatload-6f8bc64c6c-d5gzd   0/1     CreateContainerError   0          114m                                                                                                               
boatload-45                            boatload-45-1-boatload-6d99fdb6cd-87pn2   0/1     CreateContainerError   0          114m                                                                                                               
boatload-46                            boatload-46-1-boatload-66ff4c9d49-2p5dt   0/1     CreateContainerError   0          114m                                                                                                               
boatload-47                            boatload-47-1-boatload-55bdcfd8f6-xm7nw   0/1     CreateContainerError   0          114m                                                                                                               
boatload-49                            boatload-49-1-boatload-5bdf9966-8p86t     0/1     CreateContainerError   0          114m                                                                                                               
boatload-50                            boatload-50-1-boatload-6567bd4d64-p5cw2   0/1     CreateContainerError   0          114m                                                                                                               
boatload-51                            boatload-51-1-boatload-c7b7fdf-4grnz      0/1     CreateContainerError   0          114m                                                                                                               
boatload-52                            boatload-52-1-boatload-7dc4f6666-lg8d4    0/1     CreateContainerError   0          114m                                                                                                               
boatload-53                            boatload-53-1-boatload-585458f4c5-zfqtt   0/1     CreateContainerError   0          114m                                                                                                               
boatload-54                            boatload-54-1-boatload-78cfcb8cb4-7vs7n   0/1     CreateContainerError   0          114m                                                                                                               
boatload-55                            boatload-55-1-boatload-5bb87958b9-srnwj   0/1     CreateContainerError   0          114m                                                                                                               
boatload-56                            boatload-56-1-boatload-7cd8c474f4-xx5ww   0/1     CreateContainerError   0          114m                                                                                                               
boatload-57                            boatload-57-1-boatload-56db586d79-k9cdl   0/1     CreateContainerError   0          114m                                                                                                               
boatload-58                            boatload-58-1-boatload-874455f8d-rh54s    0/1     CreateContainerError   0          114m                                                                                                               
boatload-59                            boatload-59-1-boatload-fcd9ddfc7-4q6xl    0/1     CreateContainerError   0          114m                                                                                                               
boatload-60                            boatload-60-1-boatload-5966fc45bf-s6l8h   0/1     CreateContainerError   0          114m                                                                                                               
boatload-61                            boatload-61-1-boatload-6b6f86d964-nxd2d   0/1     CreateContainerError   0          114m                                                                                                               
boatload-62                            boatload-62-1-boatload-7f887885f9-ggq9r   0/1     CreateContainerError   0          114m                                                                                                               
boatload-63                            boatload-63-1-boatload-5575ff7c74-ltgw6   0/1     CreateContainerError   0          114m                                                                                                               
boatload-64                            boatload-64-1-boatload-849588d49-stzbv    0/1     CreateContainerError   0          114m                                                                                                               
boatload-65                            boatload-65-1-boatload-5b4475f744-mvmsp   0/1     CreateContainerError   0          114m                                                                                                               
boatload-66                            boatload-66-1-boatload-85c9798587-fptb4   0/1     CreateContainerError   0          114m                                                                                                               
boatload-67                            boatload-67-1-boatload-548c95f754-fqcpf   0/1     CreateContainerError   0          114m                                                                                                               
boatload-69                            boatload-69-1-boatload-5bc965f6-q5td6     0/1     CreateContainerError   0          114m                                                                                                               
boatload-70                            boatload-70-1-boatload-667b9cb8bb-pdbl7   0/1     CreateContainerError   0          114m                                                                                                               
boatload-71                            boatload-71-1-boatload-5dbcd84655-2tjx5   0/1     CreateContainerError   0          114m                                                                                                               
boatload-72                            boatload-72-1-boatload-8559b897b-6hqtv    0/1     CreateContainerError   0          114m
boatload-73                            boatload-73-1-boatload-66886d5989-pwmwh   0/1     CreateContainerError   0          114m
boatload-74                            boatload-74-1-boatload-6d4d649466-fpk5b   0/1     CreateContainerError   0          114m
boatload-75                            boatload-75-1-boatload-77b6b67cf5-stj4h   0/1     CreateContainerError   0          114m
boatload-76                            boatload-76-1-boatload-86b99b5b64-jsx7x   0/1     CreateContainerError   0          114m
boatload-77                            boatload-77-1-boatload-8495bfbb6f-knxq4   0/1     CreateContainerError   0          114m
boatload-78                            boatload-78-1-boatload-6f759c77db-mh8zw   0/1     CreateContainerError   0          114m
boatload-80                            boatload-80-1-boatload-84fccd74f9-jfkrh   0/1     CreateContainerError   0          114m
boatload-81                            boatload-81-1-boatload-57c7d66f5d-kxd5s   0/1     CreateContainerError   0          114m
boatload-82                            boatload-82-1-boatload-75f657467c-r4z4d   0/1     CreateContainerError   0          114m
boatload-84                            boatload-84-1-boatload-7f9c84b775-j56qg   0/1     CreateContainerError   0          114m
boatload-85                            boatload-85-1-boatload-66599bffcd-6rwvs   0/1     CreateContainerError   0          114m
boatload-86                            boatload-86-1-boatload-77bc986765-sqfph   0/1     CreateContainerError   0          114m
boatload-87                            boatload-87-1-boatload-8844bc57b-9frcn    0/1     CreateContainerError   0          114m

Comment 4 Noreen 2022-02-10 21:32:04 UTC
Reopening after discussing with @pehunt . Was able to reproduce the issue with 170 pods after the custom crio binary was loaded on the node. Journalctl and crio routine logs collected and uploaded to the same location.

Comment 5 Noreen 2022-02-11 15:27:18 UTC
Hi @pehunt , I am seeing this issue across both OpenshiftSDN and OVN SNOs. It is currently a blocker for me.
Please let me know if you need any more logs.

Thank you,
Noreen

Comment 6 Peter Hunt 2022-02-14 20:03:03 UTC
for posterity: we worked together offline and found an issue with the workload pods being put into their own cpuset, as is expected with SNO

Comment 7 Noreen 2022-02-24 20:11:25 UTC
@pehunt any update on this? 

Thank you,
Noreen

Comment 8 Peter Hunt 2022-02-24 21:34:24 UTC
Hey Noreen,

What was the result of checking if the containers were being put into the correct cpuset? last I heard they were not

Comment 9 Noreen 2022-03-01 02:50:38 UTC
Hi Peter,

I used the workload pinning which allows all of the OCP overhead to be constrained to the "reserved cores" as configured by PAO, before running these tests. This should ensure that all of the non-reserved cores of the server are available to the user for their workloads.
So trying to recall how it was discovered that the workload pods were utilizing reserved cpus? Did you notice something in the logs to indicate this?
Also, I don't recall noticing high cpu utilization on CRI-O or kubelet. 
If you would like, I can provide you access to the node so we can take a second look.

Thank you,
Noreen

Comment 10 Peter Hunt 2022-04-01 18:45:25 UTC
Sorry this fell through the cracks. Are you still working on this Noreen?

Comment 14 Peter Hunt 2022-05-13 20:07:46 UTC
I think we should close this one, and open a new one if similar issues pop up. It doesn't look like anyone is actively looking at it and since we've all moved on, I don't see a point in keeping it around.


Note You need to log in before you can comment on or make changes to this bug.