Bug 1875005

Summary: [sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial]
Product: OpenShift Container Platform Reporter: David Eads <deads>
Component: InstallerAssignee: Martin André <m.andre>
Installer sub component: OpenShift on OpenStack QA Contact: David Sanz <dsanzmor>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: m.andre, pprinett, rlobillo
Version: 4.6Keywords: UpcomingSprint
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Unnecessary API VIP moves Consequence: Client connection errors Fix: Changed API VIP healthchecks to limit the number of times it moves Result: Fewer errors caused by API VIP moves
Story Points: ---
Clone Of:
: 1881147 (view as bug list) Environment:
Last Closed: 2020-10-27 16:37:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1881147, 1888301    

Description David Eads 2020-09-02 17:55:22 UTC
This bug was initially created as a copy of Bug #1870247

I am copying this bug because: one problem was fixed, but openstack is still failing.

Openstack is failing 25% of the time on this https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.6

I suspect there's a problem where a long running connection from client to apiserver to kubelet to crio is getting interrupted.  It's specifically a problem on openstack and I don't know where in the chain it happens.  This test opens a `oc port-forward` which gets broken.



test:
[sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-api-machinery%5C%5D+API+data+in+etcd+should+be+stored+at+the+correct+location+and+version+for+all+resources+%5C%5BSerial%5C%5D


Number one flake on openstack, occassional failer on azure https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.6

Comment 2 Martin André 2020-09-08 17:24:20 UTC
It seems like the code added in [1] to re-establish the connection in case of kube-apiserver rollout is never called. Etcd client will try to reconnect to the same port on localhost and it's failing because the port forwarding was interrupted.

STEP: testing authorization.openshift.io/v1, Resource=clusterrolebindings
Sep  8 14:36:13.282: INFO: using old etcd client
STEP: testing authorization.openshift.io/v1, Resource=clusterroles
Sep  8 14:36:13.750: INFO: using old etcd client
STEP: testing authorization.openshift.io/v1, Resource=rolebindingrestrictions
STEP: testing authorization.openshift.io/v1, Resource=rolebindings
Sep  8 14:36:14.244: INFO: using old etcd client
{"level":"warn","ts":"2020-09-08T14:36:14.340+0200","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-8df41d7a-8f1a-4cb8-a650-39fef8c96756/127.0.0.1:44753","attempt":0,"error":"rpc error: code = Unavailable desc = transport is closing"}
W0908 14:36:14.340858  129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting...
W0908 14:36:15.341261  129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting...
[...]
W0908 18:19:15.842461  129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting...
W0908 18:21:35.431275  129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting...
^C

I could trace it to [2] and [3] where it's setting it to retry 100 times but don't know how to proceed from here.

[1] https://github.com/openshift/origin/pull/25423
[2] https://github.com/openshift/origin/blob/9b828d0/vendor/go.etcd.io/etcd/clientv3/client.go#L254
[3] https://github.com/openshift/origin/blob/9b828d0/vendor/go.etcd.io/etcd/clientv3/options.go#L45

Comment 7 David Sanz 2020-09-22 11:44:49 UTC
Verified that using the latest 4.6 payload image, installation ratio has been increased and no more failures to connect API have been found

Comment 8 Luis Tomas Bolivar 2020-09-23 09:51:18 UTC
*** Bug 1871814 has been marked as a duplicate of this bug. ***

Comment 11 errata-xmlrpc 2020-10-27 16:37:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196