Bug 1875005
Summary: | [sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | David Eads <deads> | |
Component: | Installer | Assignee: | Martin André <m.andre> | |
Installer sub component: | OpenShift on OpenStack | QA Contact: | David Sanz <dsanzmor> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | m.andre, pprinett, rlobillo | |
Version: | 4.6 | Keywords: | UpcomingSprint | |
Target Milestone: | --- | |||
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: Unnecessary API VIP moves
Consequence: Client connection errors
Fix: Changed API VIP healthchecks to limit the number of times it moves
Result: Fewer errors caused by API VIP moves
|
Story Points: | --- | |
Clone Of: | ||||
: | 1881147 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 16:37:14 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1881147, 1888301 |
Description
David Eads
2020-09-02 17:55:22 UTC
It seems like the code added in [1] to re-establish the connection in case of kube-apiserver rollout is never called. Etcd client will try to reconnect to the same port on localhost and it's failing because the port forwarding was interrupted. STEP: testing authorization.openshift.io/v1, Resource=clusterrolebindings Sep 8 14:36:13.282: INFO: using old etcd client STEP: testing authorization.openshift.io/v1, Resource=clusterroles Sep 8 14:36:13.750: INFO: using old etcd client STEP: testing authorization.openshift.io/v1, Resource=rolebindingrestrictions STEP: testing authorization.openshift.io/v1, Resource=rolebindings Sep 8 14:36:14.244: INFO: using old etcd client {"level":"warn","ts":"2020-09-08T14:36:14.340+0200","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-8df41d7a-8f1a-4cb8-a650-39fef8c96756/127.0.0.1:44753","attempt":0,"error":"rpc error: code = Unavailable desc = transport is closing"} W0908 14:36:14.340858 129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting... W0908 14:36:15.341261 129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting... [...] W0908 18:19:15.842461 129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting... W0908 18:21:35.431275 129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting... ^C I could trace it to [2] and [3] where it's setting it to retry 100 times but don't know how to proceed from here. [1] https://github.com/openshift/origin/pull/25423 [2] https://github.com/openshift/origin/blob/9b828d0/vendor/go.etcd.io/etcd/clientv3/client.go#L254 [3] https://github.com/openshift/origin/blob/9b828d0/vendor/go.etcd.io/etcd/clientv3/options.go#L45 Verified that using the latest 4.6 payload image, installation ratio has been increased and no more failures to connect API have been found *** Bug 1871814 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |