Hide Forgot
Restart all instances of cluster, some pods are in error status and storage operator is not available: > oc get pod -A | grep -iv -E "Running|Completed" NAMESPACE NAME READY STATUS RESTARTS AGE openshift-cluster-csi-drivers aws-ebs-csi-driver-controller-67d7785f4-qstw7 0/11 CreateContainerConfigError 11 94m openshift-cluster-csi-drivers aws-ebs-csi-driver-controller-67d7785f4-xpwsb 0/11 CreateContainerConfigError 15 (90m ago) 94m openshift-dns-operator dns-operator-7669b65fb7-kg2qk 0/2 Error 2 96m openshift-machine-config-operator machine-config-operator-68949bfd47-vxqcn 0/1 Error 1 96m openshift-machine-config-operator machine-config-server-77wv6 0/1 CreateContainerConfigError 1 93m all two aws-ebs-csi-driver-controller pods are CreateContainerConfigError, which cause storage operator is not available, events from one pods(aws-ebs-csi-driver-controller-67d7785f4-xpwsb): Events: Warning Preempting 91m kubelet Preempted in order to admit critical pod Normal Pulled 91m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b850e05e7b99e5f45c8ae2b321e42826d9ea9b92865500ea49b36bb1ddf3c2a0" already present on machine Warning Failed 91m kubelet Error: cannot find volume "aws-credentials" to mount into container "csi-driver" Normal Pulled 91m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79ed2aa8d4c6bb63c813bead31c2b7da0f082506315e9b93f6a3bfbc1c44d940" already present on machine Warning Failed 91m kubelet Error: cannot find volume "metrics-serving-cert" to mount into container "driver-kube-rbac-proxy" Normal Pulled 91m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9143971d7c9550881adbf9604752a43b95ca3d38f091546c0ea342633cc03e7c" already present on machine Warning Failed 91m kubelet Error: cannot find volume "socket-dir" to mount into container "csi-provisioner" Normal Pulled 91m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79ed2aa8d4c6bb63c813bead31c2b7da0f082506315e9b93f6a3bfbc1c44d940" already present on machine Warning Failed 91m kubelet Error: cannot find volume "metrics-serving-cert" to mount into container "provisioner-kube-rbac-proxy" Normal Pulled 91m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:430df9b440e4f9a18aaeab1f520d6dbbb4788d1d0bb72b97c7b1ee379a88eb86" already present on machine Warning Failed 91m kubelet Error: cannot find volume "socket-dir" to mount into container "csi-attacher" Normal Pulled 91m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bf8f0e9aecca2f1a313a4c7f4332e0a10cb41f4ac094d633dc4ec6f5d49d3282" already present on machine Warning Failed 91m kubelet Error: cannot find volume "metrics-serving-cert" to mount into container "attacher-kube-rbac-proxy" Normal Pulled 91m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8c329d1fd0553f6b5a8650b1e477ad3cbfa3236a1b6e5aa74b6583ab82922955" already present on machine Warning Failed 91m kubelet Error: cannot find volume "socket-dir" to mount into container "csi-resizer" Normal Pulled 91m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79ed2aa8d4c6bb63c813bead31c2b7da0f082506315e9b93f6a3bfbc1c44d940" already present on machine Warning Failed 91m kubelet Error: cannot find volume "metrics-serving-cert" to mount into container "resizer-kube-rbac-proxy" Normal Pulled 91m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d3fd86f1a1cf4ec3328b0cbf4a90b799ce8e43c8c4d54f0023619bfbdbd4f46" already present on machine Warning Failed 91m kubelet Error: cannot find volume "kube-api-access-xdwpq" to mount into container "csi-snapshotter" Normal Pulled 91m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79ed2aa8d4c6bb63c813bead31c2b7da0f082506315e9b93f6a3bfbc1c44d940" already present on machine Normal Pulled 91m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79ed2aa8d4c6bb63c813bead31c2b7da0f082506315e9b93f6a3bfbc1c44d940" already present on machine Warning Failed 91m (x2 over 91m) kubelet (combined from similar events): Error: cannot find volume "kube-api-access-xdwpq" to mount into container "csi-liveness-probe" > oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.9.0-0.nightly-2021-08-25-010624 True False False 52m baremetal 4.9.0-0.nightly-2021-08-25-010624 True False False 69m cloud-controller-manager 4.9.0-0.nightly-2021-08-25-010624 True False False 72m cloud-credential 4.9.0-0.nightly-2021-08-25-010624 True False False 72m cluster-autoscaler 4.9.0-0.nightly-2021-08-25-010624 True False False 69m config-operator 4.9.0-0.nightly-2021-08-25-010624 True False False 71m console 4.9.0-0.nightly-2021-08-25-010624 True False False 14m csi-snapshot-controller 4.9.0-0.nightly-2021-08-25-010624 True False False 70m dns 4.9.0-0.nightly-2021-08-25-010624 True False False 69m etcd 4.9.0-0.nightly-2021-08-25-010624 True False False 69m image-registry 4.9.0-0.nightly-2021-08-25-010624 True False False 65m ingress 4.9.0-0.nightly-2021-08-25-010624 True False False 14m insights 4.9.0-0.nightly-2021-08-25-010624 True False False 64m kube-apiserver 4.9.0-0.nightly-2021-08-25-010624 True False False 67m kube-controller-manager 4.9.0-0.nightly-2021-08-25-010624 True False False 68m kube-scheduler 4.9.0-0.nightly-2021-08-25-010624 True False False 69m kube-storage-version-migrator 4.9.0-0.nightly-2021-08-25-010624 True False False 70m machine-api 4.9.0-0.nightly-2021-08-25-010624 True False False 66m machine-approver 4.9.0-0.nightly-2021-08-25-010624 True False False 70m machine-config 4.9.0-0.nightly-2021-08-25-010624 True False False 68m marketplace 4.9.0-0.nightly-2021-08-25-010624 True False False 69m monitoring 4.9.0-0.nightly-2021-08-25-010624 True False False 63m network 4.9.0-0.nightly-2021-08-25-010624 True False False 71m node-tuning 4.9.0-0.nightly-2021-08-25-010624 True False False 70m openshift-apiserver 4.9.0-0.nightly-2021-08-25-010624 True False False 66m openshift-controller-manager 4.9.0-0.nightly-2021-08-25-010624 True False False 66m openshift-samples 4.9.0-0.nightly-2021-08-25-010624 True False False 66m operator-lifecycle-manager 4.9.0-0.nightly-2021-08-25-010624 True False False 70m operator-lifecycle-manager-catalog 4.9.0-0.nightly-2021-08-25-010624 True False False 70m operator-lifecycle-manager-packageserver 4.9.0-0.nightly-2021-08-25-010624 True False False 66m service-ca 4.9.0-0.nightly-2021-08-25-010624 True False False 70m storage 4.9.0-0.nightly-2021-08-25-010624 False True False 12m AWSEBSCSIDriverOperatorCRAvailable: AWSEBSDriverControllerServiceControllerAvailable: Waiting for Deployment ./oc describe storage Name: cluster Namespace: Labels: <none> Annotations: include.release.openshift.io/ibm-cloud-managed: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true release.openshift.io/create-only: true API Version: operator.openshift.io/v1 Kind: Storage Metadata: Creation Timestamp: 2021-08-25T07:40:37Z Generation: 1 Managed Fields: API Version: operator.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:include.release.openshift.io/ibm-cloud-managed: f:include.release.openshift.io/self-managed-high-availability: f:include.release.openshift.io/single-node-developer: f:release.openshift.io/create-only: f:ownerReferences: .: k:{"uid":"cb70338d-f4dd-421e-8ad2-d924fb0796f4"}: f:spec: .: f:logLevel: f:managementState: f:operatorLogLevel: Manager: cluster-version-operator Operation: Update Time: 2021-08-25T07:40:37Z API Version: operator.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: .: f:conditions: f:generations: f:readyReplicas: Manager: cluster-storage-operator Operation: Update Subresource: status Time: 2021-08-25T07:42:56Z Owner References: API Version: config.openshift.io/v1 Kind: ClusterVersion Name: version UID: cb70338d-f4dd-421e-8ad2-d924fb0796f4 Resource Version: 41493 UID: 21a2f7b8-96d3-4d8b-a0f3-9510e679e3f0 Spec: Log Level: Normal Management State: Managed Operator Log Level: Normal Status: Conditions: Last Transition Time: 2021-08-25T07:42:51Z Status: False Type: ManagementStateDegraded Last Transition Time: 2021-08-25T07:42:51Z Status: False Type: ConfigObservationDegraded Last Transition Time: 2021-08-25T07:42:51Z Status: True Type: DefaultStorageClassControllerAvailable Last Transition Time: 2021-08-25T07:42:51Z Status: False Type: DefaultStorageClassControllerProgressing Last Transition Time: 2021-08-25T07:42:51Z Reason: AsExpected Status: False Type: DefaultStorageClassControllerDegraded Last Transition Time: 2021-08-25T07:42:51Z Status: True Type: SnapshotCRDControllerUpgradeable Last Transition Time: 2021-08-25T07:42:52Z Reason: AsExpected Status: False Type: VSphereProblemDetectorStarterDegraded Last Transition Time: 2021-08-25T07:42:52Z Reason: AsExpected Status: False Type: SnapshotCRDControllerDegraded Last Transition Time: 2021-08-25T07:42:55Z Reason: AsExpected Status: False Type: CSIDriverStarterDegraded Last Transition Time: 2021-08-25T08:40:55Z Message: AWSEBSDriverControllerServiceControllerAvailable: Waiting for Deployment Reason: AWSEBSDriverControllerServiceController_Deploying Status: False Type: AWSEBSCSIDriverOperatorCRAvailable Last Transition Time: 2021-08-25T08:40:55Z Message: AWSEBSDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods Reason: AWSEBSDriverControllerServiceController_Deploying Status: True Type: AWSEBSCSIDriverOperatorCRProgressing Last Transition Time: 2021-08-25T07:42:56Z Message: All is well Reason: AsExpected Status: False Type: AWSEBSCSIDriverOperatorCRDegraded Last Transition Time: 2021-08-25T07:43:02Z Status: False Type: AWSEBSProgressing Last Transition Time: 2021-08-25T07:42:56Z Reason: AsExpected Status: False Type: AWSEBSCSIDriverOperatorDegraded Last Transition Time: 2021-08-25T07:42:56Z Reason: AsExpected Status: False Type: AWSEBSCSIDriverOperatorDeploymentDegraded Last Transition Time: 2021-08-25T07:42:58Z Reason: AsExpected Status: False Type: AWSEBSCSIDriverOperatorStaticControllerDegraded Generations: Group: apps Hash: Last Generation: 1 Name: aws-ebs-csi-driver-operator Namespace: openshift-cluster-csi-drivers Resource: deployments Group: operator.openshift.io Hash: Last Generation: 1 Name: ebs.csi.aws.com Namespace: Resource: clustercsidrivers Ready Replicas: 0 Events: <none> How reproducible: 3/3 Version: 4.9.0-0.nightly-2021-08-25-010624 Platform: AWS How to reproduce it (as minimally and precisely as possible)? 1. install a cluster on AWS 2. reboot all instances
Created attachment 1817432 [details] kubelet_log
This looks similar to bug 1996184 that I'm investigating.
*** Bug 1996184 has been marked as a duplicate of this bug. ***
The pod won't start because some volumes can't be mounted into the containers. Taking the "metrics-volume-cert" volume as an example: this volume is backed by the "aws-ebs-csi-driver-controller-metrics-serving-cert" secret and from the must-gather output we can see that it exists. However, the kubelet hasn't been able to fetch this secret from the API server: Aug 25 07:43:15 ip-10-0-61-148 hyperkube[1368]: E0825 07:43:15.536690 1368 secret.go:195] Couldn't get secret openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-metrics-serving-cert: secret "aws-ebs-csi-driver-controller-metrics-serving-cert" not found Aug 25 07:43:15 ip-10-0-61-148 hyperkube[1368]: E0825 07:43:15.536754 1368 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/secret/6a24646f-5aa1-496d-9f68-59c8ec17f864-metrics-serving-cert podName:6a24646f-5aa1-496d-9f68-59c8ec17f864 nodeName:}" failed. No retries permitted until 2021-08-25 07:43:23.536737498 +0000 UTC m=+101.211117505 (durationBeforeRetry 8s). Error: MountVolume.SetUp failed for volume "metrics-serving-cert" (UniqueName: "kubernetes.io/secret/6a24646f-5aa1-496d-9f68-59c8ec17f864-metrics-serving-cert") pod "aws-ebs-csi-driver-controller-67d7785f4-qstw7" (UID: "6a24646f-5aa1-496d-9f68-59c8ec17f864") : secret "aws-ebs-csi-driver-controller-metrics-serving-cert" not found Assigning to Node team to help triage this.
Elana, any progress on this issue? thanks
I believe I have a minimal repro on a nightly AWS cluster: ehashman@red-dot:~$ cat tmp/manifest.yml kind: Pod apiVersion: v1 metadata: name: mypod spec: terminationGracePeriodSeconds: 10 containers: - name: busybox image: k8s.gcr.io/e2e-test-images/nginx:1.14-2 command: ["sh", "-c", "fallocate -l 110G file; sleep 100000"] ehashman@red-dot:~$ oc apply -f tmp/manifest.yml pod/mypod created ehashman@red-dot:~$ oc get pod NAME READY STATUS RESTARTS AGE mypod 0/1 ContainerCreating 0 3s ehashman@red-dot:~$ oc get pod NAME READY STATUS RESTARTS AGE mypod 1/1 Running 0 7s ehashman@red-dot:~$ oc get pod NAME READY STATUS RESTARTS AGE mypod 1/1 Evicted 0 12s ehashman@red-dot:~$ oc get pod NAME READY STATUS RESTARTS AGE mypod 0/1 Error 0 23s Kubelet log from node for pod, after waiting for node to become responsive (note podname in my testrun was "mypod2"): Sep 29 21:44:59 ip-10-0-255-253 hyperkube[1388]: I0929 21:44:59.062780 1388 eviction_manager.go:350] "Eviction manager: must evict pod(s) to reclaim" resourceName="ephemeral-storage" Sep 29 21:44:59 ip-10-0-255-253 hyperkube[1388]: I0929 21:44:59.063965 1388 eviction_manager.go:368] "Eviction manager: pods ranked for eviction" pods=[openshift-network-diagnostics/network-check-target-sxv6g openshift-console/downloads-86dbddddf-2wwrv openshift-monitoring/prometheus-k8s-0 openshift-monitoring/grafana-54d77c7589-s8lkc openshift-monitoring/telemeter-client-799587c847-gp2p8 openshift-monitoring/alertmanager-main-0 openshift-monitoring/thanos-querier-5ccd9c4865-ztgv8 openshift-ingress/router-default-7658c77c46-mg2jx openshift-monitoring/node-exporter-m89qf openshift-image-registry/node-ca-zdp5x openshift-monitoring/prometheus-adapter-bd96f9697-msc6w openshift-ingress-canary/ingress-canary-p5qlq openshift-sdn/sdn-9jcbh openshift-cluster-node-tuning-operator/tuned-hlhz9 openshift-multus/multus-additional-cni-plugins-ndfs6 openshift-multus/network-metrics-daemon-c2qqk openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-8g2f5 openshift-dns/dns-default-rzl29 openshift-machine-config-operator/machine-config-daemon-d4msj openshift-dns/node-resolver-tjg5h openshift-multus/multus-sphhl default/mypod2] Sep 29 21:44:59 ip-10-0-255-253 hyperkube[1388]: I0929 21:44:59.108229 1388 kubelet.go:2092] "SyncLoop DELETE" source="api" pods=[openshift-network-diagnostics/network-check-target-sxv6g] Sep 29 21:44:59 ip-10-0-255-253 hyperkube[1388]: I0929 21:44:59.111131 1388 kubelet.go:2086] "SyncLoop REMOVE" source="api" pods=[openshift-network-diagnostics/network-check-target-sxv6g] Sep 29 21:44:59 ip-10-0-255-253 hyperkube[1388]: I0929 21:44:59.172057 1388 kubelet.go:2076] "SyncLoop ADD" source="api" pods=[openshift-network-diagnostics/network-check-target-xqqn8] Sep 29 21:44:59 ip-10-0-255-253 hyperkube[1388]: I0929 21:44:59.172547 1388 eviction_manager.go:168] "Failed to admit pod to node" pod="openshift-network-diagnostics/network-check-target-xqqn8" nodeCondition=[DiskPressure] Sep 29 21:44:59 ip-10-0-255-253 hyperkube[1388]: I0929 21:44:59.480474 1388 eviction_manager.go:576] "Eviction manager: pod is evicted successfully" pod="openshift-network-diagnostics/network-check-target-sxv6g" Sep 29 21:44:59 ip-10-0-255-253 hyperkube[1388]: I0929 21:44:59.480519 1388 eviction_manager.go:199] "Eviction manager: pods evicted, waiting for pod to be cleaned up" pods=[openshift-network-diagnostics/network-check-target-sxv6g] Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:00.481736 1388 eviction_manager.go:411] "Eviction manager: pods successfully cleaned up" pods=[openshift-network-diagnostics/nework-check-target-sxv6g] Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:00.491421 1388 eviction_manager.go:339] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:00.491471 1388 container_gc.go:85] "Attempting to delete unused containers" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.504009 1388 kubelet_pods.go:1484] "Pod attempted illegal phase transition" pod="openshift-network-diagnostics/network-check-arget-xqqn8" originalStatusPhase=Failed apiStatusPhase=Pending apiStatus="&PodStatus{Phase:Pending,Conditions:[]PodCondition{},Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[]ContainerStatus{ContainerStatus{Name:network-check-target-container,State:ContainerState{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,},Running:nil,Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da24fe6e3f9bf5bf398ea11d384c9ef4d5e964e827408f5e7eedffa3e67e2b26,ImageID:,ContainerID:,Started:nil,},},QOSClass:Burstable,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{},EphemeralContainerStatuses:[]ContainerStatus{},}" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:00.611130 1388 eviction_manager.go:350] "Eviction manager: must evict pod(s) to reclaim" resourceName="ephemeral-storage" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:00.611933 1388 eviction_manager.go:368] "Eviction manager: pods ranked for eviction" pods=[openshift-console/downloads-86dbddddf-2wwrv openshift-monitoring/prometheus-k8s-0 openshift-monitoring/grafana-54d77c7589-s8lkc openshift-monitoring/alertmanager-main-0 openshift-monitoring/thanos-querier-5ccd9c4865-ztgv8 openshift-monitoring/telemeter-client-799587c847-gp2p8 openshift-ingress/router-default-7658c77c46-mg2jx openshift-monitoring/node-exporter-m89qf openshift-image-registry/node-ca-zdp5x openshift-monitoring/prometheus-adapter-bd96f9697-msc6w openshift-ingress-canary/ingress-canary-p5qlq openshift-sdn/sdn-9jcbh openshift-cluster-node-tuning-operator/tuned-hlhz9 openshift-multus/multus-additional-cni-plugins-ndfs6 openshift-machine-config-operator/machine-config-daemon-d4msj openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-8g2f5 openshift-dns/dns-default-rzl29 openshift-multus/network-metrics-daemon-c2qqk openshift-multus/multus-sphhl openshift-dns/node-resolver-tjg5h default/mypod2] Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.611989 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-console/downloads-86dbddddf-2wwrv" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612041 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-monitoring/prometheus-k8s-0" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612068 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-monitoring/grafana-54d77c7589-s8lkc" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612105 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-monitoring/alertmanager-main-0" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612143 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-monitoring/thanos-querier-5ccd9c4865-ztgv8" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612174 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-monitoring/telemeter-client-799587c847-gp2p8" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612197 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-ingress/router-default-7658c77c46-mg2jx" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612226 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-monitoring/node-exporter-m89qf" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612251 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-image-registry/node-ca-zdp5x" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612274 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-monitoring/prometheus-adapter-bd96f9697-msc6w" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612298 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-ingress-canary/ingress-canary-p5qlq" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612322 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-sdn/sdn-9jcbh" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612345 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-cluster-node-tuning-operator/tuned-hlhz9" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612365 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-multus/multus-additional-cni-plugins-ndfs6" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612391 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-machine-config-operator/machine-config-daemon-d4msj" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612421 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-8g2f5" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612447 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-dns/dns-default-rzl29" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612473 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-multus/network-metrics-daemon-c2qqk" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612494 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-multus/multus-sphhl" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612517 1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-dns/node-resolver-tjg5h" Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:00.612864 1388 kuberuntime_container.go:720] "Killing container with a grace period" pod="default/mypod2" podUID=a66ad821-0748-6c7-b111-591e1cd86071 containerName="busybox" containerID="cri-o://15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c" gracePeriod=10 Sep 29 21:45:10 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:10.612977 1388 eviction_manager.go:574] "Eviction manager: pod failed to evict" err="timeout waiting to kill pod" pod="default/mypod2" Sep 29 21:45:10 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:10.613019 1388 eviction_manager.go:199] "Eviction manager: pods evicted, waiting for pod to be cleaned up" pods=[default/mypod2] Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.539069 1388 generic.go:296] "Generic (PLEG): container finished" podID=a66ad821-0748-46c7-b111-591e1cd86071 containerID="15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c" exitCode=137 Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.539137 1388 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="default/mypod2" event=&{ID:a66ad821-0748-46c7-b111-591e1cd86071 Type:ContainerDied Data:15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c} Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.539187 1388 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="default/mypod2" event=&{ID:a66ad821-0748-46c7-b111-591e1cd86071 Type:ContainerDied Data:fd70cc494c8ac718f881389f1c0e76741640f2641a18cbd25c2ba87d59b28bdb} Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.539231 1388 scope.go:110] "RemoveContainer" containerID="15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c" Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.567278 1388 scope.go:110] "RemoveContainer" containerID="15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c" Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:11.567810 1388 remote_runtime.go:334] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = could not find container \"15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c\": container with ID starting with 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c not found: ID does not exist" containerID="15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c" Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.567858 1388 pod_container_deletor.go:52] "DeleteContainer returned error" containerID={Type:cri-o ID:15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c} err="failed to get container status \"15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c\": rpc error: code = NotFound desc = could not find container \"15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c\": container with ID starting with 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c not found: ID does not exist" Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.613937 1388 eviction_manager.go:411] "Eviction manager: pods successfully cleaned up" pods=[default/mypod2] Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.539137 1388 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="default/mypod2" event=&{ID:a66ad821-0748-46c7-b111-591e1cd6071 Type:ContainerDied Data:15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c} Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.539187 1388 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="default/mypod2" event=&{ID:a66ad821-0748-46c7-b111-591e1cd6071 Type:ContainerDied Data:fd70cc494c8ac718f881389f1c0e76741640f2641a18cbd25c2ba87d59b28bdb} Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.613937 1388 eviction_manager.go:411] "Eviction manager: pods successfully cleaned up" pods=[default/mypod2] Pod still shows up in API as in error state. Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 23m default-scheduler Successfully assigned default/mypod2 to ip-10-0-255-253.us-east-2.compute.internal Normal AddedInterface 23m multus Add eth0 [10.128.2.18/23] from openshift-sdn Normal Pulling 23m kubelet Pulling image "k8s.gcr.io/e2e-test-images/nginx:1.14-2" Normal Pulled 23m kubelet Successfully pulled image "k8s.gcr.io/e2e-test-images/nginx:1.14-2" in 2.639942044s Normal Created 23m kubelet Created container busybox Normal Started 23m kubelet Started container busybox Warning Evicted 23m kubelet The node was low on resource: ephemeral-storage. Normal Killing 23m kubelet Stopping container busybox Warning ExceededGracePeriod 23m kubelet Container runtime did not kill the pod within specified grace period. status: conditions: - lastProbeTime: null lastTransitionTime: "2021-09-29T21:44:49Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2021-09-29T21:45:11Z" message: 'containers with unready status: [busybox]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2021-09-29T21:45:11Z" message: 'containers with unready status: [busybox]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2021-09-29T21:44:49Z" status: "True" type: PodScheduled containerStatuses: - containerID: cri-o://15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c image: k8s.gcr.io/e2e-test-images/nginx:1.14-2 imageID: k8s.gcr.io/e2e-test-images/nginx@sha256:13616070e3f29de4417eee434a8ef472221c9e51b3d037b5a6b46cef08eb7443 lastState: {} name: busybox ready: false restartCount: 0 started: false state: terminated: containerID: cri-o://15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c exitCode: 137 finishedAt: "2021-09-29T21:45:10Z" reason: Error startedAt: "2021-09-29T21:44:55Z" hostIP: 10.0.255.253 phase: Running podIP: 10.128.2.18 podIPs: - ip: 10.128.2.18 qosClass: BestEffort startTime: "2021-09-29T21:44:49Z"
crio seems to think it removed the container ok: sh-4.4# journalctl -u crio | grep 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c Sep 29 21:44:55 ip-10-0-255-253 crio[1356]: time="2021-09-29 21:44:55.223939634Z" level=info msg="Created container 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c: default/mypod2/busybox" id=7281076f-39d3-4427-bf76-5d26d4536403 name=/runtime.v1alpha2.RuntimeService/CreateContainer Sep 29 21:44:55 ip-10-0-255-253 crio[1356]: time="2021-09-29 21:44:55.224653116Z" level=info msg="Starting container: 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c" id=478d040e-7c90-4746-9968-a6ceed5ce962 name=/runtime.v1alpha2.RuntimeService/StartContainer Sep 29 21:44:55 ip-10-0-255-253 crio[1356]: time="2021-09-29 21:44:55.243331022Z" level=info msg="Started container" PID=53822 containerID=15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c description=default/mypod2/busybox id=478d040e-7c90-4746-9968-a6ceed5ce962 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=fd70cc494c8ac718f881389f1c0e76741640f2641a18cbd25c2ba87d59b28bdb Sep 29 21:45:00 ip-10-0-255-253 crio[1356]: time="2021-09-29 21:45:00.613225315Z" level=info msg="Stopping container: 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c (timeout: 10s)" id=f99ed228-260d-4914-96d5-e0911fb8646f name=/runtime.v1alpha2.RuntimeService/StopContainer Sep 29 21:45:10 ip-10-0-255-253 crio[1356]: time="2021-09-29 21:45:10.638622250Z" level=warning msg="Stopping container 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c with stop signal timed out: timeout reached after 10 seconds waiting for container process to exit" id=f99ed228-260d-4914-96d5-e0911fb8646f name=/runtime.v1alpha2.RuntimeService/StopContainer Sep 29 21:45:10 ip-10-0-255-253 crio[1356]: time="2021-09-29 21:45:10.850271824Z" level=info msg="Stopped container 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c: default/mypod2/busybox" id=f99ed228-260d-4914-96d5-e0911fb8646f name=/runtime.v1alpha2.RuntimeService/StopContainer Sep 29 21:45:11 ip-10-0-255-253 crio[1356]: time="2021-09-29 21:45:11.540558265Z" level=info msg="Removing container: 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c" id=6920d056-d456-4df2-8a46-1b845f75ce76 name=/runtime.v1alpha2.RuntimeService/RemoveContainer Sep 29 21:45:11 ip-10-0-255-253 crio[1356]: time="2021-09-29 21:45:11.566991841Z" level=info msg="Removed container 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c: default/mypod2/busybox" id=6920d056-d456-4df2-8a46-1b845f75ce76 name=/runtime.v1alpha2.RuntimeService/RemoveContainer So I think we can conclude we've narrowed this down to the kubelet.
I believe I have a working fix in the linked WIP GitHub PR https://github.com/kubernetes/kubernetes/pull/105411 although it only addresses the kubelet side. kubectl will still display status of "ContainerStatusUnknown" once the kubelet gets a chance to sync the final containerStatuses; we may want to also post a PR to fix this bug in case it causes confusion.
*** Bug 1998193 has been marked as a duplicate of this bug. ***
Noting publicly that the Node team wants this fix for the bootstrap use case as well, which means we won't be able to mark this fully delivered/fixed until it lands in an RHCOS boot image bump to `openshift/installer` I've added this BZ as a dependency to our pending 4.9 boot image bump BZ https://bugzilla.redhat.com/show_bug.cgi?id=2009653
Since this BZ has been retargeted at 4.10, I'm switching its bootimage bump dependency to the 4.10 tracker. I'm also moving it back to POST, since the CoreOS team process requires bootimage dependencies to be held in POST until the bootimage bump lands. If that clashes with your process too much, let us know and we can clone this to a tracker bug.
After some additional discussion on the CoreOS team, I've created a separate bug 2011960 to track this fix landing in an RHCOS bootimage, and I'm reverting the changes to this BZ that were prompted by trying to fit it into the bootimage bump process. Thanks for your patience, all.
Wait for new version of rhcos (bug 2011960) get ready, and then verify with the latest rhcos build.
*** Bug 2008745 has been marked as a duplicate of this bug. ***
Tested rc.7 bits, not facing any issue like https://bugzilla.redhat.com/show_bug.cgi?id=2008745 one anymore.
verified. PASS. OCP version: 4.10.0-0.nightly-2021-10-12-203611
rhcos-49.84.202107010027-0 is being used in 4.10, waiting for new version of rhcos (bug 2011960) and do verification again.
verified. PASS. OCP version: 4.10.0-0.nightly-2021-10-23-225921 rhcos: rhcos-410.84.202110140201-0-x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056