Bug 1884139

Summary: kuryr-controller stuck if connection to K8s API dies silently
Product: OpenShift Container Platform Reporter: Luis Tomas Bolivar <ltomasbo>
Component: NetworkingAssignee: MichaƂ Dulko <mdulko>
Networking sub component: kuryr QA Contact: GenadiC <gcheresh>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: rlobillo
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1884192 (view as bug list) Environment:
Last Closed: 2020-10-27 16:47:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1884192    

Description Luis Tomas Bolivar 2020-10-01 07:28:04 UTC
Kuryr components are often contacting the K8s API through a loadbalancer (e.g. Octavia LB in DevStack deployments, HAProxy in OpenShift) and we've often seen they're able to drop connections silently, effectively leaving our requests hanging forever. This got fixed in `K8sClient.watch` by setting a read timeout there which helped a lot, but we now seem to see it happening with other requests that doesn't have read timeout set.

Comment 2 rlobillo 2020-10-02 08:43:44 UTC
Verified on 4.6.0-0.nightly-2020-10-02-001427 over OSP16.1 (RHOS-16.1-RHEL-8-20200917.n.3) with OVN-octavia provider and OSP13 (2020-09-16.1) with amphora provider.

On OSP13, the installation works fine:

(shiftstack) [stack@undercloud-0 ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-10-02-001427   True        False         13m     Cluster version is 4.6.0-0.nightly-2020-10-02-001427
(shiftstack) [stack@undercloud-0 ~]$ oc get pods -n openshift-kuryr
NAME                               READY   STATUS    RESTARTS   AGE
kuryr-cni-2prfj                    1/1     Running   1          45m
kuryr-cni-4l5p9                    1/1     Running   0          45m
kuryr-cni-cgflf                    1/1     Running   0          30m
kuryr-cni-j28qm                    1/1     Running   0          33m
kuryr-cni-k68zp                    1/1     Running   0          33m
kuryr-cni-vtwg2                    1/1     Running   1          45m
kuryr-controller-9999f7ffd-ttsqm   1/1     Running   1          45m

Timing during the installation:

DEBUG Time elapsed per stage:                      
DEBUG     Infrastructure: 1m50s                    
DEBUG Bootstrap Complete: 14m31s                   
DEBUG                API: 2m52s                    
DEBUG  Bootstrap Destroy: 47s                      
DEBUG  Cluster Operators: 23m26s                   
INFO Time elapsed: 41m23s                         

On OSP16.1, the installation also worked fine:

(shiftstack) [stack@undercloud-0 ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-10-02-001427   True        False         78s     Cluster version is 4.6.0-0.nightly-2020-10-02-001427
(shiftstack) [stack@undercloud-0 ~]$ oc get pods -n openshift-kuryr
NAME                              READY   STATUS    RESTARTS   AGE
kuryr-cni-89pm9                   1/1     Running   1          17m
kuryr-cni-jfltp                   1/1     Running   4          43m
kuryr-cni-k4j95                   1/1     Running   0          43m
kuryr-cni-l87vw                   1/1     Running   0          17m
kuryr-cni-zhvzc                   1/1     Running   0          17m
kuryr-cni-zpmfv                   1/1     Running   0          43m
kuryr-controller-775ff4bb-bgpml   1/1     Running   1          43m

Timing during the installation:

DEBUG Time elapsed per stage:                      
DEBUG     Infrastructure: 1m47s                    
DEBUG Bootstrap Complete: 27m30s                   
DEBUG                API: 3m41s                    
DEBUG  Bootstrap Destroy: 39s                      
DEBUG  Cluster Operators: 24m28s                   
INFO Time elapsed: 55m14s

Comment 5 errata-xmlrpc 2020-10-27 16:47:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196