Description of problem: setup openshift-sdn cluster. find one pod in Error status. $ oc get pod -n openshift-network-diagnostics -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES network-check-source-75749bc6b4-ggwz2 1/1 Running 1 131m 10.131.0.18 compute-1 <none> <none> network-check-target-7prn7 0/1 Error 1 131m 10.128.0.5 control-plane-0 <none> <none> network-check-target-gjjh2 1/1 Running 1 123m 10.131.0.12 compute-1 <none> <none> network-check-target-gnxxw 1/1 Running 1 131m 10.129.0.21 control-plane-1 <none> <none> network-check-target-m8w7m 1/1 Running 1 131m 10.130.0.9 control-plane-2 <none> <none> network-check-target-qrc5z 1/1 Running 1 121m 10.128.2.2 compute-0 <none> <none> describe the pod found: Normal Scheduled 132m default-scheduler Successfully assigned openshift-network-diagnostics/network-check-target-7prn7 to control-plane-0 Warning NetworkNotReady 132m (x9 over 132m) kubelet network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? Normal AddedInterface 132m multus Add eth0 [10.128.0.2/23] from openshift-sdn Normal Pulling 132m kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:73b7930bc2ce99c36902a8d7ee524c68432247b55489000a1d66ce8030078952" Normal Pulled 132m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:73b7930bc2ce99c36902a8d7ee524c68432247b55489000a1d66ce8030078952" in 8.386299402s Normal Created 132m kubelet Created container network-check-target-container Normal Started 132m kubelet Started container network-check-target-container Warning NodeNotReady 91m node-controller Node is not ready Normal AddedInterface 90m multus Add eth0 [10.128.0.5/23] from openshift-sdn Normal Pulled 90m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:73b7930bc2ce99c36902a8d7ee524c68432247b55489000a1d66ce8030078952" already present on machine Normal Created 90m kubelet Created container network-check-target-container Normal Started 90m kubelet Started container network-check-target-container Warning Preempting 89m kubelet Preempted in order to admit critical pod Normal Killing 89m kubelet Stopping container network-check-target-container Warning ExceededGracePeriod 89m kubelet Container runtime did not kill the pod within specified grace period. Warning FailedKillPod 88m (x2 over 89m) kubelet error killing pod: failed to "KillPodSandbox" for "077ff051-9c1f-4a48-8df9-1007bc104aa3" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_network-check-target-7prn7_openshift-network-diagnostics_077ff051-9c1f-4a48-8df9-1007bc104aa3_0(a73ed428742d847858f290bdf081ac7f10d1cd70dd97d4697e981ddcbfe5e95c): error removing pod openshift-network-diagnostics_network-check-target-7prn7 from CNI network \"multus-cni-network\": Multus: [openshift-network-diagnostics/network-check-target-7prn7]: error getting pod: an error on the server (\"\") has prevented the request from succeeding (get pods network-check-target-7prn7)" Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-08-24-203710 How reproducible: not always Steps to Reproduce: 1. setup cluster with openshift-sdn 2. 3. Actual results: Expected results: Additional info:
must-gather logs: http://file.apac.redhat.com/~zzhao/must-gather.local.6574334561935899688.tar.gz
Even this issue not already happen and can be workaround by recreating the pod. However I think this issue wound affect the customer experience if it occur
****** From the logs, what most probably happened was that the cluster was running out of resources and pod "network-check-target-7prn7" had to be killed as a preemption measure. The deletion of the pod then was not successful since it exceeded the specified grace period (was it set to 0?): $ omg get events -o wide | grep 7prn7 2h4m Normal Scheduled pod/network-check-target-7prn7 Successfully assigned openshift-network-diagnostics/network-check-target-7prn7 to control-plane-0 2h4m Warning NetworkNotReady pod/network-check-target-7prn7 network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? 2h4m Normal AddedInterface pod/network-check-target-7prn7 Add eth0 [10.128.0.2/23] from openshift-sdn 2h4m Normal Pulling pod/network-check-target-7prn7 Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:73b7930bc2ce99c36902a8d7ee524c68432247b55489000a1d66ce8030078952" 2h3m Normal Pulled pod/network-check-target-7prn7 Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:73b7930bc2ce99c36902a8d7ee524c68432247b55489000a1d66ce8030078952" in 8.386299402s 2h3m Normal Created pod/network-check-target-7prn7 Created container network-check-target-container 2h3m Normal Started pod/network-check-target-7prn7 Started container network-check-target-container 1h23m Warning NodeNotReady pod/network-check-target-7prn7 Node is not ready 1h22m Normal AddedInterface pod/network-check-target-7prn7 Add eth0 [10.128.0.5/23] from openshift-sdn 1h22m Normal Pulled pod/network-check-target-7prn7 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:73b7930bc2ce99c36902a8d7ee524c68432247b55489000a1d66ce8030078952" already present on machine 1h22m Normal Created pod/network-check-target-7prn7 Created container network-check-target-container 1h22m Normal Started pod/network-check-target-7prn7 Started container network-check-target-container 1h21m Warning Preempting pod/network-check-target-7prn7 Preempted in order to admit critical pod 1h21m Normal Killing pod/network-check-target-7prn7 Stopping container network-check-target-container 1h20m Warning FailedKillPod pod/network-check-target-7prn7 error killing pod: failed to "KillPodSandbox" for "077ff051-9c1f-4a48-8df9-1007bc104aa3" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_network-check-target-7prn7_openshift-network-diagnostics_077ff051-9c1f-4a48-8df9-1007bc104aa3_0(a73ed428742d847858f290bdf081ac7f10d1cd70dd97d4697e981ddcbfe5e95c): error removing pod openshift-network-diagnostics_network-check-target-7prn7 from CNI network \"multus-cni-network\": Multus: [openshift-network-diagnostics/network-check-target-7prn7]: error getting pod: an error on the server (\"\") has prevented the request from succeeding (get pods network-check-target-7prn7)" 1h21m Warning ExceededGracePeriod pod/network-check-target-7prn7 Container runtime did not kill the pod within specified grace period. ****** More specifically, the event happened at "2021-08-25T08:53:00Z" and SDN got the CNI_DEL a few seconds later (08:53:04.088948214Z). Seems like multus had a glitch where it didn't get a response for "oc get pod" but eventually SDN got the request: namespaces/openshift-sdn/pods/sdn-sgmlk/sdn/sdn/logs/current.log:221:2021-08-25T08:53:04.088948214Z I0825 08:53:04.088905 2514 pod.go:542] CNI_DEL openshift-network-diagnostics/network-check-target-7prn7 Given that the condition from above is hard to reproduce and there's a workaround (recreating the affected pod), I'd close this BZ for now. If ever we run into the bug again, we can reopen the BZ and possibly give me access to a live cluster in order to debug further.