1875005 – [sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial]

Bug 1875005 - [sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial]

Summary: [sig-api-machinery] API data in etcd should be stored at the correct location...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Martin André
QA Contact:	David Sanz
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1871814 (view as bug list)
Depends On:
Blocks:	1881147 1888301
TreeView+	depends on / blocked

Reported:	2020-09-02 17:55 UTC by David Eads
Modified:	2020-10-27 16:37 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Unnecessary API VIP moves Consequence: Client connection errors Fix: Changed API VIP healthchecks to limit the number of times it moves Result: Fewer errors caused by API VIP moves
Clone Of:
Clones:	1881147 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:37:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2077	None	closed	Bug 1875005: OpenStack: Don't failover api vip if loadbalanced endpoint is responding	2020-11-03 10:22:07 UTC
Github	openshift machine-config-operator pull 2091	None	closed	Bug 1875005: OpenStack: Reverse haproxy and keepalived check timings	2020-11-03 10:22:07 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:37:32 UTC

Description David Eads 2020-09-02 17:55:22 UTC

This bug was initially created as a copy of Bug #1870247

I am copying this bug because: one problem was fixed, but openstack is still failing.

Openstack is failing 25% of the time on this https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.6

I suspect there's a problem where a long running connection from client to apiserver to kubelet to crio is getting interrupted.  It's specifically a problem on openstack and I don't know where in the chain it happens.  This test opens a `oc port-forward` which gets broken.



test:
[sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-api-machinery%5C%5D+API+data+in+etcd+should+be+stored+at+the+correct+location+and+version+for+all+resources+%5C%5BSerial%5C%5D


Number one flake on openstack, occassional failer on azure https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.6

Comment 2 Martin André 2020-09-08 17:24:20 UTC

It seems like the code added in [1] to re-establish the connection in case of kube-apiserver rollout is never called. Etcd client will try to reconnect to the same port on localhost and it's failing because the port forwarding was interrupted.

STEP: testing authorization.openshift.io/v1, Resource=clusterrolebindings
Sep  8 14:36:13.282: INFO: using old etcd client
STEP: testing authorization.openshift.io/v1, Resource=clusterroles
Sep  8 14:36:13.750: INFO: using old etcd client
STEP: testing authorization.openshift.io/v1, Resource=rolebindingrestrictions
STEP: testing authorization.openshift.io/v1, Resource=rolebindings
Sep  8 14:36:14.244: INFO: using old etcd client
{"level":"warn","ts":"2020-09-08T14:36:14.340+0200","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-8df41d7a-8f1a-4cb8-a650-39fef8c96756/127.0.0.1:44753","attempt":0,"error":"rpc error: code = Unavailable desc = transport is closing"}
W0908 14:36:14.340858  129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting...
W0908 14:36:15.341261  129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting...
[...]
W0908 18:19:15.842461  129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting...
W0908 18:21:35.431275  129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting...
^C

I could trace it to [2] and [3] where it's setting it to retry 100 times but don't know how to proceed from here.

[1] https://github.com/openshift/origin/pull/25423
[2] https://github.com/openshift/origin/blob/9b828d0/vendor/go.etcd.io/etcd/clientv3/client.go#L254
[3] https://github.com/openshift/origin/blob/9b828d0/vendor/go.etcd.io/etcd/clientv3/options.go#L45

Comment 7 David Sanz 2020-09-22 11:44:49 UTC

Verified that using the latest 4.6 payload image, installation ratio has been increased and no more failures to connect API have been found

Comment 8 Luis Tomas Bolivar 2020-09-23 09:51:18 UTC

*** Bug 1871814 has been marked as a duplicate of this bug. ***

Comment 11 errata-xmlrpc 2020-10-27 16:37:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.