Description of problem: On bootstrap phase of cluster-network-operator, in case of Kuryr an Octavia loadbalancer is being created. We're waiting for it to become ACTIVE before continuing with further operations. The problem is that when LB becomes ERROR, we don't stop waiting even though it does not make any sense. We should stop early with an error message in such case to make sure next iteration will remove such LB quicker. Version-Release number of selected component (if applicable): How reproducible: Always when LB goes into ERROR state. Steps to Reproduce: 1. Force LBs to go into ERROR state (e.g. rename amphora image). 2. Run installation. 3. Look into cluster-network-operator logs as soon as it's up. Actual results: You can see that CNO waits for that LB even if it immediately goes into ERROR state. Expected results: CNO should return and error immediately after LB becomes ERROR. Additional info:
To trigger an error, besides renaming an image, I have to deactivate it, since it was picked up anyway (it has tag "amphora-image"), and than disable loadbalancer and wait for CNO to reconcile.
Verified in 4.5.0-0.nightly-2020-04-07-073926 on top of OSP 13 2020-04-01.3. In order to verify this BZ the LBs status need to be forced to ERROR. That can be achieved by deactivating the amphora image (as admin). $ openstack image set --deactivate <amphora image id> Run the installer: The API LB will be in ERROR status: +--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------+ | id | name | project_id | vip_address | provisioning_status | provider | +--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------+ | 0b0db0d4-176c-4e75-8962-506494874512 | ostest-wxqqr-kuryr-api-loadbalancer | 01975583501440c1ab1f6e426c1a913e | 172.30.0.1 | ERROR | octavia | +--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------+ 2020/04/07 15:19:34 Creating OpenShift API loadbalancer with IP 172.30.0.1 2020/04/07 15:19:34 Detected Octavia API v2.0.0 2020/04/07 15:19:43 Failed to reconcile platform networking resources: failed to create OpenShift API loadbalancer: Error waiting for LB c730dc75-2a8e-49a1-8cdb-dd78515411fa: LoadBalancer gone in error state 2020/04/07 15:19:43 Updated ClusterOperator with conditions: - lastTransitionTime: "2020-04-07T15:19:43Z" message: 'Internal error while reconciling platform networking resources: failed to create OpenShift API loadbalancer: Error waiting for LB c730dc75-2a8e-49a1-8cdb-dd78515411fa: LoadBalancer gone in error state' reason: BootstrapError status: "True" type: Degraded - lastTransitionTime: "2020-04-07T15:19:15Z" status: "True" type: Upgradeable 2020/04/07 15:19:44 Reconciling Network.operator.openshift.io cluster CNO will try re-creating the LB, and will succeed once the amphora image is activated again: $ openstack image set --activate <amphora image id> 2020/04/07 16:17:44 Creating OpenShift API loadbalancer with IP 172.30.0.1 2020/04/07 16:17:44 Detected Octavia API v2.0.0 2020/04/07 16:17:44 Deleting Openstack LoadBalancer: 0b0db0d4-176c-4e75-8962-506494874512 2020/04/07 16:18:57 OpenShift API loadbalancer f400f6bf-e8c6-41d5-90e7-be93408d163d present 2020/04/07 16:18:57 Creating OpenShift API loadbalancer pool 2020/04/07 16:18:58 OpenShift API loadbalancer pool e2036475-83b9-4ccb-aa0e-79e0065e62ee present 2020/04/07 16:18:58 Creating OpenShift API loadbalancer health monitor 2020/04/07 16:18:58 Detected Octavia API v2.0.0 2020/04/07 16:18:58 OpenShift API loadbalancer health monitor 1f2e5729-f61f-487a-adad-f67ad0ab5d3b present 2020/04/07 16:18:58 Creating OpenShift API loadbalancer listener 2020/04/07 16:18:58 Detected Octavia API v2.0.0 2020/04/07 16:19:01 OpenShift API loadbalancer listener 6ba2e500-0bc8-4109-be35-392d1317f876 present 2020/04/07 16:19:01 Creating OpenShift API loadbalancer pool members 2020/04/07 16:19:01 Found port 53c4a08a-c554-4191-8ba8-58e4082f130a with IP 10.196.0.16 2020/04/07 16:19:08 Added member 51563fe8-9516-44bd-988a-f322d8438755 to LB pool e2036475-83b9-4ccb-aa0e-79e0065e62ee 2020/04/07 16:19:08 Found port 875f6fc4-2085-4be3-81ae-01b3f1701ff2 with IP 10.196.0.30 2020/04/07 16:19:15 Added member 568ac5d1-0f65-4d74-93f5-26fb024434a2 to LB pool e2036475-83b9-4ccb-aa0e-79e0065e62ee 2020/04/07 16:19:15 Found port a43f89ef-ce47-450c-83fc-8e8466510d50 with IP 10.196.0.32 2020/04/07 16:19:19 Added member 7c3f5712-19eb-402b-b96f-105b63e1a7ec to LB pool e2036475-83b9-4ccb-aa0e-79e0065e62ee 2020/04/07 16:19:19 Found port f5b4c942-0280-464f-a3d7-3733b1731c6b with IP 10.196.0.31 2020/04/07 16:19:25 Added member 67e34737-6942-432b-ac9f-97573291ea31 to LB pool e2036475-83b9-4ccb-aa0e-79e0065e62ee The LB in ERROR is removed and a new one is created: +--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------+ | id | name | project_id | vip_address | provisioning_status | provider | +--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------+ | f400f6bf-e8c6-41d5-90e7-be93408d163d | ostest-wxqqr-kuryr-api-loadbalancer | 01975583501440c1ab1f6e426c1a913e | 172.30.0.1 | ACTIVE | octavia | +--------------------------------------+-------------------------------------+----------------------------------+-------------+---------------------+----------
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409