Bug 1875005 - [sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial]
Summary: [sig-api-machinery] API data in etcd should be stored at the correct location...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Martin André
QA Contact: David Sanz
URL:
Whiteboard:
: 1871814 (view as bug list)
Depends On:
Blocks: 1881147 1888301
TreeView+ depends on / blocked
 
Reported: 2020-09-02 17:55 UTC by David Eads
Modified: 2020-10-27 16:37 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Unnecessary API VIP moves Consequence: Client connection errors Fix: Changed API VIP healthchecks to limit the number of times it moves Result: Fewer errors caused by API VIP moves
Clone Of:
: 1881147 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:37:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2077 0 None closed Bug 1875005: OpenStack: Don't failover api vip if loadbalanced endpoint is responding 2020-11-03 10:22:07 UTC
Github openshift machine-config-operator pull 2091 0 None closed Bug 1875005: OpenStack: Reverse haproxy and keepalived check timings 2020-11-03 10:22:07 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:37:32 UTC

Description David Eads 2020-09-02 17:55:22 UTC
This bug was initially created as a copy of Bug #1870247

I am copying this bug because: one problem was fixed, but openstack is still failing.

Openstack is failing 25% of the time on this https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.6

I suspect there's a problem where a long running connection from client to apiserver to kubelet to crio is getting interrupted.  It's specifically a problem on openstack and I don't know where in the chain it happens.  This test opens a `oc port-forward` which gets broken.



test:
[sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-api-machinery%5C%5D+API+data+in+etcd+should+be+stored+at+the+correct+location+and+version+for+all+resources+%5C%5BSerial%5C%5D


Number one flake on openstack, occassional failer on azure https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.6

Comment 2 Martin André 2020-09-08 17:24:20 UTC
It seems like the code added in [1] to re-establish the connection in case of kube-apiserver rollout is never called. Etcd client will try to reconnect to the same port on localhost and it's failing because the port forwarding was interrupted.

STEP: testing authorization.openshift.io/v1, Resource=clusterrolebindings
Sep  8 14:36:13.282: INFO: using old etcd client
STEP: testing authorization.openshift.io/v1, Resource=clusterroles
Sep  8 14:36:13.750: INFO: using old etcd client
STEP: testing authorization.openshift.io/v1, Resource=rolebindingrestrictions
STEP: testing authorization.openshift.io/v1, Resource=rolebindings
Sep  8 14:36:14.244: INFO: using old etcd client
{"level":"warn","ts":"2020-09-08T14:36:14.340+0200","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-8df41d7a-8f1a-4cb8-a650-39fef8c96756/127.0.0.1:44753","attempt":0,"error":"rpc error: code = Unavailable desc = transport is closing"}
W0908 14:36:14.340858  129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting...
W0908 14:36:15.341261  129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting...
[...]
W0908 18:19:15.842461  129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting...
W0908 18:21:35.431275  129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting...
^C

I could trace it to [2] and [3] where it's setting it to retry 100 times but don't know how to proceed from here.

[1] https://github.com/openshift/origin/pull/25423
[2] https://github.com/openshift/origin/blob/9b828d0/vendor/go.etcd.io/etcd/clientv3/client.go#L254
[3] https://github.com/openshift/origin/blob/9b828d0/vendor/go.etcd.io/etcd/clientv3/options.go#L45

Comment 7 David Sanz 2020-09-22 11:44:49 UTC
Verified that using the latest 4.6 payload image, installation ratio has been increased and no more failures to connect API have been found

Comment 8 Luis Tomas Bolivar 2020-09-23 09:51:18 UTC
*** Bug 1871814 has been marked as a duplicate of this bug. ***

Comment 11 errata-xmlrpc 2020-10-27 16:37:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.