Bug 1830018

Summary: ocp4.4: Pod Error: context deadline exceeded
Product: OpenShift Container Platform Reporter: Hongkai Liu <hongkliu>
Component: RHCOSAssignee: Colin Walters <walters>
Status: CLOSED NOTABUG QA Contact: Michael Nguyen <mnguyen>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.4CC: aos-bugs, bbreard, imcleod, jligon, jokerman, miabbott, nstielau, pehunt, walters, zyu
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-06-17 17:36:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hongkai Liu 2020-04-30 17:48:04 UTC
Description of problem:

oc get machine -n openshift-machine-api build01-9hdwj-worker-us-east-1b-m5d4x-w4fp2 -o wide
NAME                                          PHASE     TYPE          REGION      ZONE         AGE   NODE                           PROVIDERID                              STATE
build01-9hdwj-worker-us-east-1b-m5d4x-w4fp2   Running   m5d.4xlarge   us-east-1   us-east-1b   15d   ip-10-0-146-117.ec2.internal   aws:///us-east-1b/i-0890eb78de6644a83   running

oc get node ip-10-0-146-117.ec2.internal -o wide
NAME                           STATUS   ROLES    AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
ip-10-0-146-117.ec2.internal   Ready    worker   15d   v1.17.1   10.0.146.117   <none>        Red Hat Enterprise Linux CoreOS 44.81.202004260825-0 (Ootpa)   4.18.0-147.8.1.el8_1.x86_64   cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8

This is m5d.4xlarge worker node from CI build cluster.

oc get clusterversions.config.openshift.io
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0     True        False         9h      Cluster version is 4.4.0

We have several pods on this node with this error (Error: context deadline exceeded) in the pod description.
Sometimes, retries worked out: the pod is eventually up and running. 

I would like to make sure it is expected hehavior from kubelet and crio, instead of bugs.

I will attach more files later.

Comment 1 Peter Hunt 2020-04-30 17:51:48 UTC
AFAICT this is expected. This is kubelet and crio saying "we are taking a long time to create pods/containers!". If the pods eventually reconcile and become ready, then this is okay. If they don't, the node may be overcommitted.

Comment 7 Micah Abbott 2020-05-14 15:36:11 UTC
I think this was fixed with https://github.com/openshift/release/pull/8715?

Comment 10 Colin Walters 2020-06-17 17:36:12 UTC
I believe this is obsolete, CI isn't using this configuration anymore.  Please reopen if that's not correct.