Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1997476

Summary:	pod in Error due to KillPodSandbox
Product:	OpenShift Container Platform	Reporter:	zhaozhanqi <zzhao>
Component:	Networking	Assignee:	Riccardo Ravaioli <rravaiol>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	high
Priority:	high	CC:	astoycos, rravaiol
Version:	4.9
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-13 13:06:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description zhaozhanqi 2021-08-25 10:23:29 UTC

Description of problem:

setup openshift-sdn cluster. find one pod in Error status. 

$ oc get pod -n openshift-network-diagnostics -o wide
NAME                                    READY   STATUS    RESTARTS   AGE    IP            NODE              NOMINATED NODE   READINESS GATES
network-check-source-75749bc6b4-ggwz2   1/1     Running   1          131m   10.131.0.18   compute-1         <none>           <none>
network-check-target-7prn7              0/1     Error     1          131m   10.128.0.5    control-plane-0   <none>           <none>
network-check-target-gjjh2              1/1     Running   1          123m   10.131.0.12   compute-1         <none>           <none>
network-check-target-gnxxw              1/1     Running   1          131m   10.129.0.21   control-plane-1   <none>           <none>
network-check-target-m8w7m              1/1     Running   1          131m   10.130.0.9    control-plane-2   <none>           <none>
network-check-target-qrc5z              1/1     Running   1          121m   10.128.2.2    compute-0         <none>           <none>


describe the pod found:

Normal   Scheduled            132m                 default-scheduler  Successfully assigned openshift-network-diagnostics/network-check-target-7prn7 to control-plane-0
  Warning  NetworkNotReady      132m (x9 over 132m)  kubelet            network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
  Normal   AddedInterface       132m                 multus             Add eth0 [10.128.0.2/23] from openshift-sdn
  Normal   Pulling              132m                 kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:73b7930bc2ce99c36902a8d7ee524c68432247b55489000a1d66ce8030078952"
  Normal   Pulled               132m                 kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:73b7930bc2ce99c36902a8d7ee524c68432247b55489000a1d66ce8030078952" in 8.386299402s
  Normal   Created              132m                 kubelet            Created container network-check-target-container
  Normal   Started              132m                 kubelet            Started container network-check-target-container
  Warning  NodeNotReady         91m                  node-controller    Node is not ready
  Normal   AddedInterface       90m                  multus             Add eth0 [10.128.0.5/23] from openshift-sdn
  Normal   Pulled               90m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:73b7930bc2ce99c36902a8d7ee524c68432247b55489000a1d66ce8030078952" already present on machine
  Normal   Created              90m                  kubelet            Created container network-check-target-container
  Normal   Started              90m                  kubelet            Started container network-check-target-container
  Warning  Preempting           89m                  kubelet            Preempted in order to admit critical pod
  Normal   Killing              89m                  kubelet            Stopping container network-check-target-container
  Warning  ExceededGracePeriod  89m                  kubelet            Container runtime did not kill the pod within specified grace period.
  Warning  FailedKillPod        88m (x2 over 89m)    kubelet            error killing pod: failed to "KillPodSandbox" for "077ff051-9c1f-4a48-8df9-1007bc104aa3" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_network-check-target-7prn7_openshift-network-diagnostics_077ff051-9c1f-4a48-8df9-1007bc104aa3_0(a73ed428742d847858f290bdf081ac7f10d1cd70dd97d4697e981ddcbfe5e95c): error removing pod openshift-network-diagnostics_network-check-target-7prn7 from CNI network \"multus-cni-network\": Multus: [openshift-network-diagnostics/network-check-target-7prn7]: error getting pod: an error on the server (\"\") has prevented the request from succeeding (get pods network-check-target-7prn7)"

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-08-24-203710

How reproducible:
not always

Steps to Reproduce:
1. setup cluster with openshift-sdn 
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 zhaozhanqi 2021-08-25 10:24:27 UTC

must-gather logs:  http://file.apac.redhat.com/~zzhao/must-gather.local.6574334561935899688.tar.gz

Comment 2 zhaozhanqi 2021-08-31 10:23:59 UTC

Even this issue not already happen and can be workaround by recreating the pod.  However I think this issue wound affect the customer experience if it occur

Comment 4 Riccardo Ravaioli 2021-10-13 13:06:14 UTC

****** From the logs, what most probably happened was that the cluster was running out of resources and pod "network-check-target-7prn7" had to be killed as a preemption measure. The deletion of the pod then was not successful since it exceeded the specified grace period (was it set to 0?):

$ omg get events -o wide | grep 7prn7
2h4m       Normal   Scheduled                   pod/network-check-target-7prn7              Successfully assigned openshift-network-diagnostics/network-check-target-7prn7 to control-plane-0
2h4m       Warning  NetworkNotReady             pod/network-check-target-7prn7              network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
2h4m       Normal   AddedInterface              pod/network-check-target-7prn7              Add eth0 [10.128.0.2/23] from openshift-sdn
2h4m       Normal   Pulling                     pod/network-check-target-7prn7              Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:73b7930bc2ce99c36902a8d7ee524c68432247b55489000a1d66ce8030078952"
2h3m       Normal   Pulled                      pod/network-check-target-7prn7              Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:73b7930bc2ce99c36902a8d7ee524c68432247b55489000a1d66ce8030078952" in 8.386299402s
2h3m       Normal   Created                     pod/network-check-target-7prn7              Created container network-check-target-container
2h3m       Normal   Started                     pod/network-check-target-7prn7              Started container network-check-target-container
1h23m      Warning  NodeNotReady                pod/network-check-target-7prn7              Node is not ready
1h22m      Normal   AddedInterface              pod/network-check-target-7prn7              Add eth0 [10.128.0.5/23] from openshift-sdn
1h22m      Normal   Pulled                      pod/network-check-target-7prn7              Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:73b7930bc2ce99c36902a8d7ee524c68432247b55489000a1d66ce8030078952" already present on machine
1h22m      Normal   Created                     pod/network-check-target-7prn7              Created container network-check-target-container
1h22m      Normal   Started                     pod/network-check-target-7prn7              Started container network-check-target-container
1h21m      Warning  Preempting                  pod/network-check-target-7prn7              Preempted in order to admit critical pod
1h21m      Normal   Killing                     pod/network-check-target-7prn7              Stopping container network-check-target-container
1h20m      Warning  FailedKillPod               pod/network-check-target-7prn7              error killing pod: failed to "KillPodSandbox" for "077ff051-9c1f-4a48-8df9-1007bc104aa3" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_network-check-target-7prn7_openshift-network-diagnostics_077ff051-9c1f-4a48-8df9-1007bc104aa3_0(a73ed428742d847858f290bdf081ac7f10d1cd70dd97d4697e981ddcbfe5e95c): error removing pod openshift-network-diagnostics_network-check-target-7prn7 from CNI network \"multus-cni-network\": Multus: [openshift-network-diagnostics/network-check-target-7prn7]: error getting pod: an error on the server (\"\") has prevented the request from succeeding (get pods network-check-target-7prn7)"
1h21m      Warning  ExceededGracePeriod         pod/network-check-target-7prn7              Container runtime did not kill the pod within specified grace period.



****** More specifically, the event happened at "2021-08-25T08:53:00Z" and SDN got the CNI_DEL a few seconds later (08:53:04.088948214Z). Seems like multus had a glitch where it didn't get a response for "oc get pod" but eventually SDN got the request:

 
namespaces/openshift-sdn/pods/sdn-sgmlk/sdn/sdn/logs/current.log:221:2021-08-25T08:53:04.088948214Z I0825 08:53:04.088905    2514 pod.go:542] CNI_DEL openshift-network-diagnostics/network-check-target-7prn7


Given that the condition from above is hard to reproduce and there's a workaround (recreating the affected pod), I'd close this BZ for now. If ever we run into the bug again, we can reopen the BZ and possibly give me access to a live cluster in order to debug further.