Bug 1884139 - kuryr-controller stuck if connection to K8s API dies silently
Summary: kuryr-controller stuck if connection to K8s API dies silently
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Michał Dulko
QA Contact: GenadiC
URL:
Whiteboard:
Depends On:
Blocks: 1884192
TreeView+ depends on / blocked
 
Reported: 2020-10-01 07:28 UTC by Luis Tomas Bolivar
Modified: 2020-10-27 16:47 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1884192 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:47:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kuryr-kubernetes pull 359 0 None closed Bug 1884139: Set read timeout for any request in K8sClient 2020-10-06 13:35:43 UTC
Launchpad 1897893 0 None None None 2020-10-01 07:28:03 UTC
OpenStack gerrit 755254 0 None MERGED Set read timeout for any request in K8sClient 2020-10-05 13:06:39 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:47:34 UTC

Description Luis Tomas Bolivar 2020-10-01 07:28:04 UTC
Kuryr components are often contacting the K8s API through a loadbalancer (e.g. Octavia LB in DevStack deployments, HAProxy in OpenShift) and we've often seen they're able to drop connections silently, effectively leaving our requests hanging forever. This got fixed in `K8sClient.watch` by setting a read timeout there which helped a lot, but we now seem to see it happening with other requests that doesn't have read timeout set.

Comment 2 rlobillo 2020-10-02 08:43:44 UTC
Verified on 4.6.0-0.nightly-2020-10-02-001427 over OSP16.1 (RHOS-16.1-RHEL-8-20200917.n.3) with OVN-octavia provider and OSP13 (2020-09-16.1) with amphora provider.

On OSP13, the installation works fine:

(shiftstack) [stack@undercloud-0 ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-10-02-001427   True        False         13m     Cluster version is 4.6.0-0.nightly-2020-10-02-001427
(shiftstack) [stack@undercloud-0 ~]$ oc get pods -n openshift-kuryr
NAME                               READY   STATUS    RESTARTS   AGE
kuryr-cni-2prfj                    1/1     Running   1          45m
kuryr-cni-4l5p9                    1/1     Running   0          45m
kuryr-cni-cgflf                    1/1     Running   0          30m
kuryr-cni-j28qm                    1/1     Running   0          33m
kuryr-cni-k68zp                    1/1     Running   0          33m
kuryr-cni-vtwg2                    1/1     Running   1          45m
kuryr-controller-9999f7ffd-ttsqm   1/1     Running   1          45m

Timing during the installation:

DEBUG Time elapsed per stage:                      
DEBUG     Infrastructure: 1m50s                    
DEBUG Bootstrap Complete: 14m31s                   
DEBUG                API: 2m52s                    
DEBUG  Bootstrap Destroy: 47s                      
DEBUG  Cluster Operators: 23m26s                   
INFO Time elapsed: 41m23s                         

On OSP16.1, the installation also worked fine:

(shiftstack) [stack@undercloud-0 ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-10-02-001427   True        False         78s     Cluster version is 4.6.0-0.nightly-2020-10-02-001427
(shiftstack) [stack@undercloud-0 ~]$ oc get pods -n openshift-kuryr
NAME                              READY   STATUS    RESTARTS   AGE
kuryr-cni-89pm9                   1/1     Running   1          17m
kuryr-cni-jfltp                   1/1     Running   4          43m
kuryr-cni-k4j95                   1/1     Running   0          43m
kuryr-cni-l87vw                   1/1     Running   0          17m
kuryr-cni-zhvzc                   1/1     Running   0          17m
kuryr-cni-zpmfv                   1/1     Running   0          43m
kuryr-controller-775ff4bb-bgpml   1/1     Running   1          43m

Timing during the installation:

DEBUG Time elapsed per stage:                      
DEBUG     Infrastructure: 1m47s                    
DEBUG Bootstrap Complete: 27m30s                   
DEBUG                API: 3m41s                    
DEBUG  Bootstrap Destroy: 39s                      
DEBUG  Cluster Operators: 24m28s                   
INFO Time elapsed: 55m14s

Comment 5 errata-xmlrpc 2020-10-27 16:47:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.