Bug 1802481

Summary: Clusters of several matrix/versions hit ovs pod log "unix#... connection dropped (Connection reset by peer)"
Product: OpenShift Container Platform Reporter: Xingxing Xia <xxia>
Component: NetworkingAssignee: Alexander Constantinescu <aconstan>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aconstan, andcosta, aos-bugs, bbennett, igreen, rhowe, sdodson, suchaudh, zzhao
Version: 4.4   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1811140 (view as bug list) Environment:
Last Closed: 2020-07-13 17:14:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1811140    
Attachments:
Description Flags
ovs-pods-logs none

Comment 9 Ben Bennett 2020-03-04 14:34:43 UTC
Moving to 4.5 since it hasn't reproduced in a week.  If you hit it, please provide the details and move it back to 4.4

Comment 21 Xingxing Xia 2020-03-17 09:39:21 UTC
Today I hit failure and after analysis it appears same issue: tried 4.5.0-0.nightly-2020-03-17-011909 ipi on gcp installation https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/85156/ (where kubeconfig link is shown) , the installation failed with https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/85156/console :
...
E0317 03:09:27.807942     561 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get https://api.xxia0317-3.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: dial tcp 146.148.47.225:6443: i/o timeout
level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server"
level=info msg="Cluster operator authentication Progressing is Unknown with NoData: "
level=info msg="Cluster operator authentication Available is Unknown with NoData: "
level=info msg="Cluster operator dns Progressing is True with Reconciling: Not all DNS DaemonSets available."
level=error msg="Cluster operator etcd Degraded is True with InstallerPodContainerWaiting_ContainerCreating::InstallerPodNetworking_FailedCreatePodSandBox::StaticPods_Error: InstallerPodContainerWaitingDegraded: Pod \"installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal\" on node \"xxia03-cqhk8-m-0.c.openshift-qe.internal\" container \"installer\" is waiting for 38m56.672639188s because \"\"\nInstallerPodNetworkingDegraded: Pod \"installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal\" on node \"xxia03-cqhk8-m-0.c.openshift-qe.internal\" observed degraded networking: (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal_openshift-etcd_1c23413d-8db3-4344-935d-8c7e5d9e91fc_0(03c858bd0e2c4d0299ecd9e2118f71b5a5f498139033f85debef6f77bf747645): netplugin failed with no error message\nStaticPodsDegraded: pods \"etcd-xxia03-cqhk8-m-0.c.openshift-qe.internal\" not found\nStaticPodsDegraded: pods \"etcd-xxia03-cqhk8-m-2.c.openshift-qe.internal\" not found\nStaticPodsDegraded: pods \"etcd-xxia03-cqhk8-m-1.c.openshift-qe.internal\" not found"
level=info msg="Cluster operator etcd Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 2"
level=info msg="Cluster operator etcd Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 2"
level=error msg="Cluster operator kube-apiserver Degraded is True with InstallerPodContainerWaiting_ContainerCreating::InstallerPodNetworking_FailedCreatePodSandBox::StaticPods_Error: InstallerPodContainerWaitingDegraded: Pod \"installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal\" on node \"xxia03-cqhk8-m-0.c.openshift-qe.internal\" container \"installer\" is waiting for 36m45.196622777s because \"\"\nInstallerPodNetworkingDegraded: Pod \"installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal\" on node \"xxia03-cqhk8-m-0.c.openshift-qe.internal\" observed degraded networking: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal_openshift-kube-apiserver_4214c57b-231d-409f-8edc-3b11c909e8f5_0(37893ee187d0bfcab4944d58caebafa140990b06bb7ae6520d80120b517308a9): netplugin failed with no error message\nStaticPodsDegraded: pods \"kube-apiserver-xxia03-cqhk8-m-0.c.openshift-qe.internal\" not found\nStaticPodsDegraded: pods \"kube-apiserver-xxia03-cqhk8-m-2.c.openshift-qe.internal\" not found\nStaticPodsDegraded: pods \"kube-apiserver-xxia03-cqhk8-m-1.c.openshift-qe.internal\" not found"
...

Above it says "installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal ... failed to create pod network sandbox ... netplugin failed with no error message".
# oc get no # no workers are up yet
NAME                                       STATUS   ROLES    AGE    VERSION
xxia03-cqhk8-m-0.c.openshift-qe.internal   Ready    master   7h4m   v1.17.1
xxia03-cqhk8-m-1.c.openshift-qe.internal   Ready    master   7h3m   v1.17.1
xxia03-cqhk8-m-2.c.openshift-qe.internal   Ready    master   7h3m   v1.17.1
# oc get po -n openshift-kube-apiserver # only shows one pod as below
NAME                                                   READY   STATUS              RESTARTS   AGE
installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal   0/1     ContainerCreating   0          6h59m
# oc describe po installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal -n openshift-kube-apiserver
  Warning  FailedCreatePodSandBox  102s (x1461 over 6h44m)  kubelet, xxia03-cqhk8-m-0.c.openshift-qe.internal  (combined from similar ev
ents): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-2-xxia03-cqhk8-
m-0.c.openshift-qe.internal_openshift-kube-apiserver_4214c57b-231d-409f-8edc-3b11c909e8f5_0(35073fac92a453debb270e8fdafe7c67324ab1e38e94
c3f183971a5248c7c6a6): netplugin failed with no error message
# oc get po -n openshift-sdn -o wide
ovs-cfhzt              1/1     Running   0          6h54m   10.0.0.4   xxia03-cqhk8-m-2.c.openshift-qe.internal   <none> <none>
ovs-sxkkr              1/1     Running   0          6h54m   10.0.0.3   xxia03-cqhk8-m-0.c.openshift-qe.internal   <none> <none>
ovs-tlmkl              1/1     Running   0          6h54m   10.0.0.5   xxia03-cqhk8-m-1.c.openshift-qe.internal   <none> <none>
# oc logs -n openshift-sdn ovs-sxkkr > ovs-sxkkr.log
# vi ovs-sxkkr.log # many below errors
...
2020-03-17T09:24:30.395Z|04141|jsonrpc|WARN|unix#27609: receive error: Connection reset by peer
2020-03-17T09:24:30.395Z|04142|reconnect|WARN|unix#27609: connection dropped (Connection reset by peer)
...

Comment 25 zhaozhanqi 2020-04-24 05:37:18 UTC
Try many times and this issue did not be reproduced in 4.5 4.5.0-0.nightly-2020-04-21-103613
Verified this bug

Comment 32 Andre Costa 2020-06-10 13:11:32 UTC
Created attachment 1696486 [details]
ovs-pods-logs

Comment 38 errata-xmlrpc 2020-07-13 17:14:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409