1818029 – CNO keeps waiting for Kuryr's Octavia LB even if it's in ERROR state.

Bug 1818029 - CNO keeps waiting for Kuryr's Octavia LB even if it's in ERROR state.

Summary: CNO keeps waiting for Kuryr's Octavia LB even if it's in ERROR state.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.5.0
Assignee:	rdobosz
QA Contact:	Jon Uriarte
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1821128
TreeView+	depends on / blocked

Reported:	2020-03-27 13:38 UTC by Michał Dulko
Modified:	2020-08-04 18:07 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1821128 (view as bug list)
Environment:
Last Closed:	2020-08-04 18:07:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 564	0	None	closed	Bug 1818029: Stop waiting for failed loadbalancer.	2021-02-21 15:01:09 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-08-04 18:07:27 UTC

Description Michał Dulko 2020-03-27 13:38:59 UTC

Description of problem:
On bootstrap phase of cluster-network-operator, in case of Kuryr an Octavia loadbalancer is being created. We're waiting for it to become ACTIVE before continuing with further operations. The problem is that when LB becomes ERROR, we don't stop waiting even though it does not make any sense. We should stop early with an error message in such case to make sure next iteration will remove such LB quicker.

Version-Release number of selected component (if applicable):


How reproducible:
Always when LB goes into ERROR state.

Steps to Reproduce:
1. Force LBs to go into ERROR state (e.g. rename amphora image).
2. Run installation.
3. Look into cluster-network-operator logs as soon as it's up.

Actual results:
You can see that CNO waits for that LB even if it immediately goes into ERROR state.

Expected results:
CNO should return and error immediately after LB becomes ERROR.

Additional info:

Comment 2 rdobosz 2020-04-03 13:00:48 UTC

To trigger an error, besides renaming an image, I have to deactivate it, since it was picked up anyway (it has tag "amphora-image"), and than disable loadbalancer and wait for CNO to reconcile.

Comment 5 Jon Uriarte 2020-04-07 17:05:13 UTC

Verified in 4.5.0-0.nightly-2020-04-07-073926 on top of OSP 13 2020-04-01.3.

In order to verify this BZ the LBs status need to be forced to ERROR.
That can be achieved by deactivating the amphora image (as admin).

$ openstack image set --deactivate <amphora image id>

Run the installer:

The API LB will be in ERROR status:

+--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------+
| id                                   | name                                | project_id                       | vip_address | provisioning_status | provider |
+--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------+
| 0b0db0d4-176c-4e75-8962-506494874512 | ostest-wxqqr-kuryr-api-loadbalancer | 01975583501440c1ab1f6e426c1a913e | 172.30.0.1  | ERROR               | octavia  |
+--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------+

2020/04/07 15:19:34 Creating OpenShift API loadbalancer with IP 172.30.0.1
2020/04/07 15:19:34 Detected Octavia API v2.0.0
2020/04/07 15:19:43 Failed to reconcile platform networking resources: failed to create OpenShift API loadbalancer: Error waiting for LB c730dc75-2a8e-49a1-8cdb-dd78515411fa: LoadBalancer gone in error state
2020/04/07 15:19:43 Updated ClusterOperator with conditions:
- lastTransitionTime: "2020-04-07T15:19:43Z"
  message: 'Internal error while reconciling platform networking resources: failed
    to create OpenShift API loadbalancer: Error waiting for LB c730dc75-2a8e-49a1-8cdb-dd78515411fa:
    LoadBalancer gone in error state'
  reason: BootstrapError
  status: "True"
  type: Degraded
- lastTransitionTime: "2020-04-07T15:19:15Z"
  status: "True"
  type: Upgradeable
2020/04/07 15:19:44 Reconciling Network.operator.openshift.io cluster

CNO will try re-creating the LB, and will succeed once the amphora image is activated again:

$ openstack image set --activate <amphora image id>

2020/04/07 16:17:44 Creating OpenShift API loadbalancer with IP 172.30.0.1
2020/04/07 16:17:44 Detected Octavia API v2.0.0
2020/04/07 16:17:44 Deleting Openstack LoadBalancer: 0b0db0d4-176c-4e75-8962-506494874512
2020/04/07 16:18:57 OpenShift API loadbalancer f400f6bf-e8c6-41d5-90e7-be93408d163d present
2020/04/07 16:18:57 Creating OpenShift API loadbalancer pool
2020/04/07 16:18:58 OpenShift API loadbalancer pool e2036475-83b9-4ccb-aa0e-79e0065e62ee present
2020/04/07 16:18:58 Creating OpenShift API loadbalancer health monitor
2020/04/07 16:18:58 Detected Octavia API v2.0.0
2020/04/07 16:18:58 OpenShift API loadbalancer health monitor 1f2e5729-f61f-487a-adad-f67ad0ab5d3b present
2020/04/07 16:18:58 Creating OpenShift API loadbalancer listener
2020/04/07 16:18:58 Detected Octavia API v2.0.0
2020/04/07 16:19:01 OpenShift API loadbalancer listener 6ba2e500-0bc8-4109-be35-392d1317f876 present
2020/04/07 16:19:01 Creating OpenShift API loadbalancer pool members
2020/04/07 16:19:01 Found port 53c4a08a-c554-4191-8ba8-58e4082f130a with IP 10.196.0.16
2020/04/07 16:19:08 Added member 51563fe8-9516-44bd-988a-f322d8438755 to LB pool e2036475-83b9-4ccb-aa0e-79e0065e62ee
2020/04/07 16:19:08 Found port 875f6fc4-2085-4be3-81ae-01b3f1701ff2 with IP 10.196.0.30
2020/04/07 16:19:15 Added member 568ac5d1-0f65-4d74-93f5-26fb024434a2 to LB pool e2036475-83b9-4ccb-aa0e-79e0065e62ee
2020/04/07 16:19:15 Found port a43f89ef-ce47-450c-83fc-8e8466510d50 with IP 10.196.0.32
2020/04/07 16:19:19 Added member 7c3f5712-19eb-402b-b96f-105b63e1a7ec to LB pool e2036475-83b9-4ccb-aa0e-79e0065e62ee
2020/04/07 16:19:19 Found port f5b4c942-0280-464f-a3d7-3733b1731c6b with IP 10.196.0.31
2020/04/07 16:19:25 Added member 67e34737-6942-432b-ac9f-97573291ea31 to LB pool e2036475-83b9-4ccb-aa0e-79e0065e62ee

The LB in ERROR is removed and a new one is created:

+--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------+
| id                                   | name                                | project_id                       | vip_address | provisioning_status | provider |
+--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------+
| f400f6bf-e8c6-41d5-90e7-be93408d163d | ostest-wxqqr-kuryr-api-loadbalancer | 01975583501440c1ab1f6e426c1a913e | 172.30.0.1  | ACTIVE              | octavia  |
+--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------

Comment 7 errata-xmlrpc 2020-08-04 18:07:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.