Bug 1821128

Summary: CNO keeps waiting for Kuryr's Octavia LB even if it's in ERROR state.
Product: OpenShift Container Platform Reporter: rdobosz
Component: NetworkingAssignee: rdobosz
Networking sub component: kuryr QA Contact: Jon Uriarte <juriarte>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: low CC: bbennett, gcheresh, juriarte, ltomasbo, mdulko
Version: 4.4   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1818029 Environment:
Last Closed: 2020-05-04 11:48:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1818029    
Bug Blocks:    

Description rdobosz 2020-04-06 06:27:30 UTC
+++ This bug was initially created as a clone of Bug #1818029 +++

Description of problem:
On bootstrap phase of cluster-network-operator, in case of Kuryr an Octavia loadbalancer is being created. We're waiting for it to become ACTIVE before continuing with further operations. The problem is that when LB becomes ERROR, we don't stop waiting even though it does not make any sense. We should stop early with an error message in such case to make sure next iteration will remove such LB quicker.

Version-Release number of selected component (if applicable):


How reproducible:
Always when LB goes into ERROR state.

Steps to Reproduce:
1. Force LBs to go into ERROR state (e.g. rename amphora image).
2. Run installation.
3. Look into cluster-network-operator logs as soon as it's up.

Actual results:
You can see that CNO waits for that LB even if it immediately goes into ERROR state.

Expected results:
CNO should return and error immediately after LB becomes ERROR.

Additional info:

--- Additional comment from  on 2020-04-03 13:00:48 UTC ---

To trigger an error, besides renaming an image, I have to deactivate it, since it was picked up anyway (it has tag "amphora-image"), and than disable loadbalancer and wait for CNO to reconcile.

Comment 3 Jon Uriarte 2020-04-20 14:22:30 UTC
Verified in 4.4.0-0.nightly-2020-04-20-051802 on top of OSP 13 2020-04-01.3.

In order to verify this BZ the LBs status need to be forced to ERROR.
That can be achieved by deactivating the amphora image (as admin).

$ openstack image set --deactivate <amphora image id>

Run the installer:

The API LB will be in ERROR status:

+--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------+
| id                                   | name                                | project_id                       | vip_address | provisioning_status | provider |
+--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------+
| 269010c4-a370-4bfd-bf03-b3625a607cd7 | ostest-76r72-kuryr-api-loadbalancer | 3045f335ea794b479bfca81287307151 | 172.30.0.1  | ERROR               | octavia  |
+--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------+


2020/04/20 14:00:11 Creating OpenShift API loadbalancer with IP 172.30.0.1
2020/04/20 14:00:12 Detected Octavia API v2.0.0
2020/04/20 14:00:17 Failed to reconcile platform networking resources: failed to create OpenShift API loadbalancer: Error waiting for LB 269010c4-a370-4bfd-bf03-b3625a607cd7: LoadBalancer gone in error state
2020/04/20 14:00:17 Updated ClusterOperator with conditions:
- lastTransitionTime: "2020-04-20T14:00:17Z"
  message: 'Internal error while reconciling platform networking resources: failed
    to create OpenShift API loadbalancer: Error waiting for LB 269010c4-a370-4bfd-bf03-b3625a607cd7:
    LoadBalancer gone in error state'
  reason: BootstrapError
  status: "True"
  type: Degraded
- lastTransitionTime: "2020-04-20T13:59:48Z"
  status: "True"
  type: Upgradeable
2020/04/20 14:00:18 Reconciling Network.operator.openshift.io cluster

CNO will remove the LB and try creating a new one, which will succeed once the amphora image is activated again:

$ openstack image set --activate <amphora image id>

...
2020/04/20 14:00:31 Deleting Openstack LoadBalancer: 269010c4-a370-4bfd-bf03-b3625a607cd7
...
2020/04/20 14:05:54 Creating OpenShift API loadbalancer with IP 172.30.0.1
2020/04/20 14:05:54 Detected Octavia API v2.0.0
2020/04/20 14:05:54 Deleting Openstack LoadBalancer: cb02034d-1d4f-40d7-af89-a2f81acc8705
2020/04/20 14:06:57 OpenShift API loadbalancer 2e506572-da05-45cb-8a41-ea314d607d6d present
2020/04/20 14:06:57 Creating OpenShift API loadbalancer pool
2020/04/20 14:06:59 OpenShift API loadbalancer pool f45962b1-43b6-48a6-bd38-16c9873c0c8b present
2020/04/20 14:06:59 Creating OpenShift API loadbalancer health monitor
2020/04/20 14:06:59 Detected Octavia API v2.0.0
2020/04/20 14:06:59 OpenShift API loadbalancer health monitor 9b1678e0-7775-4628-916f-4b64268cf769 present
2020/04/20 14:06:59 Creating OpenShift API loadbalancer listener
2020/04/20 14:06:59 Detected Octavia API v2.0.0
2020/04/20 14:07:02 OpenShift API loadbalancer listener e0811c5e-2e27-4ea1-91db-45492a87bfa8 present
2020/04/20 14:07:02 Creating OpenShift API loadbalancer pool members
2020/04/20 14:07:02 Found port 331a4203-5d33-4ab9-9183-bc6db5e6d53d with IP 10.196.0.18
2020/04/20 14:07:06 Added member e43a338e-8d75-49b7-aefc-f18c4f77cc97 to LB pool f45962b1-43b6-48a6-bd38-16c9873c0c8b
2020/04/20 14:07:06 Found port 530b4ea9-7f45-4276-81e8-37d90d3e9888 with IP 10.196.0.24
2020/04/20 14:07:10 Added member 05ec0801-c229-444a-b09b-bc58ac01a7d8 to LB pool f45962b1-43b6-48a6-bd38-16c9873c0c8b
2020/04/20 14:07:10 Found port 73ecbd0b-197d-4ed5-a947-a68d44aff6c7 with IP 10.196.0.13
2020/04/20 14:07:14 Added member 7a4df4b9-393c-4472-a1fe-d9e9c0e4ce2a to LB pool f45962b1-43b6-48a6-bd38-16c9873c0c8b
2020/04/20 14:07:14 Found port ebc10280-918b-4da3-93fb-7eb54c713079 with IP 10.196.0.37
2020/04/20 14:07:20 Added member 57600eab-981d-4bc3-9b1d-56b27ccc15fc to LB pool f45962b1-43b6-48a6-bd38-16c9873c0c8b

The LB in ERROR is removed and a new one is created:

+--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------+
| id                                   | name                                | project_id                       | vip_address | provisioning_status | provider |
+--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------+
| 2e506572-da05-45cb-8a41-ea314d607d6d | ostest-76r72-kuryr-api-loadbalancer | 3045f335ea794b479bfca81287307151 | 172.30.0.1  | ACTIVE              | octavia  |
+--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------+

Comment 5 errata-xmlrpc 2020-05-04 11:48:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581