Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1751756

Summary:	[IPI][OSP][Kuryr] The installer times out when running in OSP 13 with Kuryr sdn
Product:	OpenShift Container Platform	Reporter:	Jon Uriarte <juriarte>
Component:	Networking	Assignee:	Maysa Macedo <mdemaced>
Networking sub component:	kuryr	QA Contact:	Jon Uriarte <juriarte>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	low
Priority:	unspecified	CC:	adahiya, asimonel, bbennett, benl, eduen, itbrown, ltomasbo, mdemaced, mdulko, oblaut, wking, wzheng, xtian
Version:	4.2.z
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1758669 (view as bug list)		Environment:
Last Closed:	2020-01-09 08:16:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1783258
Bug Blocks:	1758669

Description Jon Uriarte 2019-09-12 14:12:43 UTC

Description of problem:

openshift-install 4.2 is timing out when running on OSP 13 with Kuryr. We see some operators need more time to deploy, as the connection to the API is lost from time to time.

Some operators show this message: 
>>> msg="Get https://172.30.0.1:443/api?timeout=32s: net/http: TLS handshake timeoutunable to set up overall controller manager"

and get stuck there for a while.

That means a lot of pods restarting while the installer tries to progress and finally times out.

We are still investigating the issue and do not know the root cause, but I'm raising the bug so we can collect all the info here. Will be adding more logs and info.

Version-Release number of the following components:

Release build: 4.2.0-0.nightly-2019-09-12-034447 (from https://openshift-release.svc.ci.openshift.org/)

OSP 13 2019-08-19.2 puddle

How reproducible: always

Steps to Reproduce:
1. Install OSP 13 with Octavia
2. Run OCP 4.2 installer with Kuryr

install-config.yaml:

apiVersion: v1
baseDomain: shiftstack.com
clusterID:  fdf4ba63-b341-5025-9796-580418e9c594
compute:
- name: worker
  platform: {}
  replicas: 3
controlPlane:
  name: master
  platform: {}
  replicas: 3
metadata:
  name: ostest
networking:
  clusterNetworks:
  - cidr:             10.128.0.0/14
    hostSubnetLength: 9
  serviceCIDR: 172.30.0.0/16
  machineCIDR: 10.196.0.0/16
  type: Kuryr
platform:
  openstack:
    cloud:            shiftstack
    externalNetwork:  nova
    region:           regionOne
    computeFlavor:    m4.xlarge
    lbFloatingIP:     10.0.0.200
pullSecret: xxx
sshKey: xxx

Actual results:
The installer times out

Expected results:
Successful install.

Comment 2 Maysa Macedo 2019-09-18 15:33:06 UTC

Based on previous conversations with the devs, this bug should not be considered a blocker as it's a matter of timing. The cluster initialization phase on the installer requires 30 min, and Kuryr might sometimes requires more. All the operators will eventually be ready and available.

Comment 5 Maysa Macedo 2019-09-20 17:52:12 UTC

The environments used for development and QE have a single compute node to run all the VMs for masters, workers and load balancers,
which leaded us to believe that installation with Kuryr could be failing around 50% of the times due to the heavy resource usage.
After the addition of the two compute nodes and the recent addition of a health monitor to the API LB, we've observed that the time
required for installation with Kuryr has decreased significantly when compared to our previous tries. We still need to run more times
to validate if those were the only causes of the timeouts.

Time required for the installation today and the image used:

38m10.150s - 4.2.0-0.nightly-2019-09-20-040328
36m55.677s. - latest release img 20/09 + Health Monitor
33m3.899s - latest release img 20/09 + Health Monitor

Comment 6 Luis Tomas Bolivar 2019-09-27 12:12:44 UTC

After applying the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1736854 I'm having consistent success with just 1 compute node, taking around 43-48 minutes total time.

Comment 8 W. Trevor King 2019-10-04 17:56:38 UTC

Pull landed in 4.2 [1]:

$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.0-rc.1 | grep network-operator
  cluster-network-operator                      https://github.com/openshift/cluster-network-operator                      bf92b4ec6a0e1595e9d6fdbab32ed9266f46b3d8
$ git log --oneline bf92b4ec6a0e1595e9d6fdbab32ed9266f46b3d8 | grep bf92b4e
bf92b4ec Merge pull request #233 from dulek/octavia-healthmonitor

But with the blocking bug 1736854 MODIFIED (and lacking a Target Release), maybe it makes sense for this to still be POST?  4.2.0 GA is very close; can we punt this to 4.2.z or get bug 1736854 resolved?  I dunno how to track whether bug 1736854 is MODIFIED or has since been sucked into 4.2 nightlies (or wherever it has to go to become ON_QA).

Comment 9 Ben Bennett 2019-10-04 19:30:05 UTC

The referenced pull request was for master.  So I am making this be the master BZ and will clone this for 4.2.z.

Comment 10 W. Trevor King 2019-10-04 19:32:30 UTC

> The referenced pull request was for master.

But it landed in master before the 4.2/4.3 fork, so we only need to verify for 4.2.0.

Comment 11 Ben Bennett 2019-10-04 19:35:45 UTC

You are completely right.  I just realized that when I checked the code.  Moving to ON_QA

Comment 12 Ben Bennett 2019-10-04 19:37:31 UTC

(Set to MODIFIED so ART picks it up for errata generation)

Comment 14 Itzik Brown 2019-10-06 08:25:57 UTC

4.2.0-0.nightly-2019-10-02-122541

Comment 15 Jon Uriarte 2019-10-09 07:01:53 UTC

Moving back to ON_QA as it depends on the BZ https://bugzilla.redhat.com/show_bug.cgi?id=1736854,
which is still in MODIFIED status. Once that BZ is ON_QA we will be able to verify this one as well.

Comment 20 Itzik Brown 2019-12-11 15:30:47 UTC

The bug can't be verified because it depends on https://bugzilla.redhat.com/show_bug.cgi?id=1736854 which is not on ON_QA status.