Bug 1802481 - Clusters of several matrix/versions hit ovs pod log "unix#... connection dropped (Connection reset by peer)"
Summary: Clusters of several matrix/versions hit ovs pod log "unix#... connection drop...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.5.0
Assignee: Alexander Constantinescu
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks: 1811140
TreeView+ depends on / blocked
 
Reported: 2020-02-13 09:43 UTC by Xingxing Xia
Modified: 2020-11-12 09:48 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1811140 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:14:44 UTC
Target Upstream Version:


Attachments (Terms of Use)
ovs-pods-logs (257.38 KB, text/plain)
2020-06-10 13:11 UTC, Andre Costa
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 544 0 None closed Bug 1802481: Don't override containernetworking binaries in SDN and Kuryr 2021-02-16 00:23:42 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:15:14 UTC

Comment 9 Ben Bennett 2020-03-04 14:34:43 UTC
Moving to 4.5 since it hasn't reproduced in a week.  If you hit it, please provide the details and move it back to 4.4

Comment 21 Xingxing Xia 2020-03-17 09:39:21 UTC
Today I hit failure and after analysis it appears same issue: tried 4.5.0-0.nightly-2020-03-17-011909 ipi on gcp installation https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/85156/ (where kubeconfig link is shown) , the installation failed with https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/85156/console :
...
E0317 03:09:27.807942     561 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get https://api.xxia0317-3.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: dial tcp 146.148.47.225:6443: i/o timeout
level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server"
level=info msg="Cluster operator authentication Progressing is Unknown with NoData: "
level=info msg="Cluster operator authentication Available is Unknown with NoData: "
level=info msg="Cluster operator dns Progressing is True with Reconciling: Not all DNS DaemonSets available."
level=error msg="Cluster operator etcd Degraded is True with InstallerPodContainerWaiting_ContainerCreating::InstallerPodNetworking_FailedCreatePodSandBox::StaticPods_Error: InstallerPodContainerWaitingDegraded: Pod \"installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal\" on node \"xxia03-cqhk8-m-0.c.openshift-qe.internal\" container \"installer\" is waiting for 38m56.672639188s because \"\"\nInstallerPodNetworkingDegraded: Pod \"installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal\" on node \"xxia03-cqhk8-m-0.c.openshift-qe.internal\" observed degraded networking: (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal_openshift-etcd_1c23413d-8db3-4344-935d-8c7e5d9e91fc_0(03c858bd0e2c4d0299ecd9e2118f71b5a5f498139033f85debef6f77bf747645): netplugin failed with no error message\nStaticPodsDegraded: pods \"etcd-xxia03-cqhk8-m-0.c.openshift-qe.internal\" not found\nStaticPodsDegraded: pods \"etcd-xxia03-cqhk8-m-2.c.openshift-qe.internal\" not found\nStaticPodsDegraded: pods \"etcd-xxia03-cqhk8-m-1.c.openshift-qe.internal\" not found"
level=info msg="Cluster operator etcd Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 2"
level=info msg="Cluster operator etcd Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 2"
level=error msg="Cluster operator kube-apiserver Degraded is True with InstallerPodContainerWaiting_ContainerCreating::InstallerPodNetworking_FailedCreatePodSandBox::StaticPods_Error: InstallerPodContainerWaitingDegraded: Pod \"installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal\" on node \"xxia03-cqhk8-m-0.c.openshift-qe.internal\" container \"installer\" is waiting for 36m45.196622777s because \"\"\nInstallerPodNetworkingDegraded: Pod \"installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal\" on node \"xxia03-cqhk8-m-0.c.openshift-qe.internal\" observed degraded networking: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal_openshift-kube-apiserver_4214c57b-231d-409f-8edc-3b11c909e8f5_0(37893ee187d0bfcab4944d58caebafa140990b06bb7ae6520d80120b517308a9): netplugin failed with no error message\nStaticPodsDegraded: pods \"kube-apiserver-xxia03-cqhk8-m-0.c.openshift-qe.internal\" not found\nStaticPodsDegraded: pods \"kube-apiserver-xxia03-cqhk8-m-2.c.openshift-qe.internal\" not found\nStaticPodsDegraded: pods \"kube-apiserver-xxia03-cqhk8-m-1.c.openshift-qe.internal\" not found"
...

Above it says "installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal ... failed to create pod network sandbox ... netplugin failed with no error message".
# oc get no # no workers are up yet
NAME                                       STATUS   ROLES    AGE    VERSION
xxia03-cqhk8-m-0.c.openshift-qe.internal   Ready    master   7h4m   v1.17.1
xxia03-cqhk8-m-1.c.openshift-qe.internal   Ready    master   7h3m   v1.17.1
xxia03-cqhk8-m-2.c.openshift-qe.internal   Ready    master   7h3m   v1.17.1
# oc get po -n openshift-kube-apiserver # only shows one pod as below
NAME                                                   READY   STATUS              RESTARTS   AGE
installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal   0/1     ContainerCreating   0          6h59m
# oc describe po installer-2-xxia03-cqhk8-m-0.c.openshift-qe.internal -n openshift-kube-apiserver
  Warning  FailedCreatePodSandBox  102s (x1461 over 6h44m)  kubelet, xxia03-cqhk8-m-0.c.openshift-qe.internal  (combined from similar ev
ents): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-2-xxia03-cqhk8-
m-0.c.openshift-qe.internal_openshift-kube-apiserver_4214c57b-231d-409f-8edc-3b11c909e8f5_0(35073fac92a453debb270e8fdafe7c67324ab1e38e94
c3f183971a5248c7c6a6): netplugin failed with no error message
# oc get po -n openshift-sdn -o wide
ovs-cfhzt              1/1     Running   0          6h54m   10.0.0.4   xxia03-cqhk8-m-2.c.openshift-qe.internal   <none> <none>
ovs-sxkkr              1/1     Running   0          6h54m   10.0.0.3   xxia03-cqhk8-m-0.c.openshift-qe.internal   <none> <none>
ovs-tlmkl              1/1     Running   0          6h54m   10.0.0.5   xxia03-cqhk8-m-1.c.openshift-qe.internal   <none> <none>
# oc logs -n openshift-sdn ovs-sxkkr > ovs-sxkkr.log
# vi ovs-sxkkr.log # many below errors
...
2020-03-17T09:24:30.395Z|04141|jsonrpc|WARN|unix#27609: receive error: Connection reset by peer
2020-03-17T09:24:30.395Z|04142|reconnect|WARN|unix#27609: connection dropped (Connection reset by peer)
...

Comment 25 zhaozhanqi 2020-04-24 05:37:18 UTC
Try many times and this issue did not be reproduced in 4.5 4.5.0-0.nightly-2020-04-21-103613
Verified this bug

Comment 32 Andre Costa 2020-06-10 13:11:32 UTC
Created attachment 1696486 [details]
ovs-pods-logs

Comment 38 errata-xmlrpc 2020-07-13 17:14:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.