Description of problem:
oc get machine -n openshift-machine-api build01-9hdwj-worker-us-east-1b-m5d4x-w4fp2 -o wide
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
build01-9hdwj-worker-us-east-1b-m5d4x-w4fp2 Running m5d.4xlarge us-east-1 us-east-1b 15d ip-10-0-146-117.ec2.internal aws:///us-east-1b/i-0890eb78de6644a83 running
oc get node ip-10-0-146-117.ec2.internal -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-0-146-117.ec2.internal Ready worker 15d v1.17.1 10.0.146.117 <none> Red Hat Enterprise Linux CoreOS 44.81.202004260825-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8
This is m5d.4xlarge worker node from CI build cluster.
oc get clusterversions.config.openshift.io
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.4.0 True False 9h Cluster version is 4.4.0
We have several pods on this node with this error (Error: context deadline exceeded) in the pod description.
Sometimes, retries worked out: the pod is eventually up and running.
I would like to make sure it is expected hehavior from kubelet and crio, instead of bugs.
I will attach more files later.
AFAICT this is expected. This is kubelet and crio saying "we are taking a long time to create pods/containers!". If the pods eventually reconcile and become ready, then this is okay. If they don't, the node may be overcommitted.