This bug was initially created as a copy of Bug #1870247 I am copying this bug because: one problem was fixed, but openstack is still failing. Openstack is failing 25% of the time on this https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.6 I suspect there's a problem where a long running connection from client to apiserver to kubelet to crio is getting interrupted. It's specifically a problem on openstack and I don't know where in the chain it happens. This test opens a `oc port-forward` which gets broken. test: [sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] is failing frequently in CI, see search results: https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-api-machinery%5C%5D+API+data+in+etcd+should+be+stored+at+the+correct+location+and+version+for+all+resources+%5C%5BSerial%5C%5D Number one flake on openstack, occassional failer on azure https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.6
It seems like the code added in [1] to re-establish the connection in case of kube-apiserver rollout is never called. Etcd client will try to reconnect to the same port on localhost and it's failing because the port forwarding was interrupted. STEP: testing authorization.openshift.io/v1, Resource=clusterrolebindings Sep 8 14:36:13.282: INFO: using old etcd client STEP: testing authorization.openshift.io/v1, Resource=clusterroles Sep 8 14:36:13.750: INFO: using old etcd client STEP: testing authorization.openshift.io/v1, Resource=rolebindingrestrictions STEP: testing authorization.openshift.io/v1, Resource=rolebindings Sep 8 14:36:14.244: INFO: using old etcd client {"level":"warn","ts":"2020-09-08T14:36:14.340+0200","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-8df41d7a-8f1a-4cb8-a650-39fef8c96756/127.0.0.1:44753","attempt":0,"error":"rpc error: code = Unavailable desc = transport is closing"} W0908 14:36:14.340858 129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting... W0908 14:36:15.341261 129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting... [...] W0908 18:19:15.842461 129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting... W0908 18:21:35.431275 129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting... ^C I could trace it to [2] and [3] where it's setting it to retry 100 times but don't know how to proceed from here. [1] https://github.com/openshift/origin/pull/25423 [2] https://github.com/openshift/origin/blob/9b828d0/vendor/go.etcd.io/etcd/clientv3/client.go#L254 [3] https://github.com/openshift/origin/blob/9b828d0/vendor/go.etcd.io/etcd/clientv3/options.go#L45
Verified that using the latest 4.6 payload image, installation ratio has been increased and no more failures to connect API have been found
*** Bug 1871814 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196