Bug 1875005 - [sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial]
Summary: [sig-api-machinery] API data in etcd should be stored at the correct location...
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Martin André
QA Contact: David Sanz
URL:
Whiteboard:
: 1871814 (view as bug list)
Depends On:
Blocks: 1881147
TreeView+ depends on / blocked
 
Reported: 2020-09-02 17:55 UTC by David Eads
Modified: 2020-09-23 09:51 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1881147 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2077 None closed Bug 1875005: OpenStack: Don't failover api vip if loadbalanced endpoint is responding 2020-09-21 06:14:19 UTC
Github openshift machine-config-operator pull 2091 None closed Bug 1875005: OpenStack: Reverse haproxy and keepalived check timings 2020-09-21 06:14:19 UTC

Description David Eads 2020-09-02 17:55:22 UTC
This bug was initially created as a copy of Bug #1870247

I am copying this bug because: one problem was fixed, but openstack is still failing.

Openstack is failing 25% of the time on this https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.6

I suspect there's a problem where a long running connection from client to apiserver to kubelet to crio is getting interrupted.  It's specifically a problem on openstack and I don't know where in the chain it happens.  This test opens a `oc port-forward` which gets broken.



test:
[sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-api-machinery%5C%5D+API+data+in+etcd+should+be+stored+at+the+correct+location+and+version+for+all+resources+%5C%5BSerial%5C%5D


Number one flake on openstack, occassional failer on azure https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.6

Comment 2 Martin André 2020-09-08 17:24:20 UTC
It seems like the code added in [1] to re-establish the connection in case of kube-apiserver rollout is never called. Etcd client will try to reconnect to the same port on localhost and it's failing because the port forwarding was interrupted.

STEP: testing authorization.openshift.io/v1, Resource=clusterrolebindings
Sep  8 14:36:13.282: INFO: using old etcd client
STEP: testing authorization.openshift.io/v1, Resource=clusterroles
Sep  8 14:36:13.750: INFO: using old etcd client
STEP: testing authorization.openshift.io/v1, Resource=rolebindingrestrictions
STEP: testing authorization.openshift.io/v1, Resource=rolebindings
Sep  8 14:36:14.244: INFO: using old etcd client
{"level":"warn","ts":"2020-09-08T14:36:14.340+0200","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-8df41d7a-8f1a-4cb8-a650-39fef8c96756/127.0.0.1:44753","attempt":0,"error":"rpc error: code = Unavailable desc = transport is closing"}
W0908 14:36:14.340858  129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting...
W0908 14:36:15.341261  129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting...
[...]
W0908 18:19:15.842461  129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting...
W0908 18:21:35.431275  129249 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:44753  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:44753: connect: connection refused". Reconnecting...
^C

I could trace it to [2] and [3] where it's setting it to retry 100 times but don't know how to proceed from here.

[1] https://github.com/openshift/origin/pull/25423
[2] https://github.com/openshift/origin/blob/9b828d0/vendor/go.etcd.io/etcd/clientv3/client.go#L254
[3] https://github.com/openshift/origin/blob/9b828d0/vendor/go.etcd.io/etcd/clientv3/options.go#L45

Comment 7 David Sanz 2020-09-22 11:44:49 UTC
Verified that using the latest 4.6 payload image, installation ratio has been increased and no more failures to connect API have been found

Comment 8 Luis Tomas Bolivar 2020-09-23 09:51:18 UTC
*** Bug 1871814 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.