Bug 2000070

Summary:	RHOS17 Octavia jobs failing on tempest tests with "show_loadbalancer provisioning_status failed to update to ACTIVE within the required time"
Product:	Red Hat OpenStack	Reporter:	Sandeep Yadav <sandyada>
Component:	openstack-octavia	Assignee:	Nate Johnston <njohnston>
Status:	CLOSED NOTABUG	QA Contact:	Bruna Bonguardo <bbonguar>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	17.0 (Wallaby)	CC:	gthiemon, ihrachys, lpeer, majopela, scohen
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-09-07 15:24:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sandeep Yadav 2021-09-01 10:18:25 UTC

Description of problem:

RHOS17 Octavia jobs failing on tempest tests with "show_loadbalancer provisioning_status failed to update to ACTIVE within the required time"


The following error is observed in Octavia worker logs:-
~~~
2021-08-30 21:06:42.068 30 ERROR octavia.amphorae.drivers.haproxy.exceptions [-] Amphora agent returned unexpected result code 409 with response {'message': 'Interface already exists'}
~~~

Version-Release number of selected component (if applicable):

RHOS17


How reproducible:

Issue hit in last 3 run in RHOSP17 Integration pipeline


Actual results:
Load balancer is stuck in PENDING_CREATE and PENDING_UPDATE state.


Expected results:
Load balancer should not stuck in PENDING_CREATE and PENDING_UPDATE state.

Additional info:

Following error observed in tempest:-

~~~
{1} setUpClass (octavia_tempest_plugin.tests.scenario.v2.test_pool.PoolScenarioTest) [0.000000s] ... FAILED

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "/usr/lib/python3.6/site-packages/tempest/test.py", line 182, in setUpClass
        six.reraise(etype, value, trace)
      File "/usr/lib/python3.6/site-packages/six.py", line 703, in reraise
        raise value
      File "/usr/lib/python3.6/site-packages/tempest/test.py", line 175, in setUpClass
        cls.resource_setup()
      File "/usr/lib/python3.6/site-packages/octavia_tempest_plugin/tests/scenario/v2/test_pool.py", line 54, in resource_setup
        CONF.load_balancer.lb_build_timeout)
      File "/usr/lib/python3.6/site-packages/octavia_tempest_plugin/tests/waiters.py", line 96, in wait_for_status
        raise exceptions.TimeoutException(message)
    tempest.lib.exceptions.TimeoutException: Request timed out
    Details: (PoolScenarioTest:setUpClass) show_loadbalancer provisioning_status failed to update to ACTIVE within the required time 900. Current status of show_loadbalancer: PENDING_CREATE
~~~


Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "/usr/lib/python3.6/site-packages/octavia_tempest_plugin/tests/scenario/v2/test_pool.py", line 79, in test_HTTPS_LC_pool_with_listener_CRUD
        algorithm=const.LB_ALGORITHM_LEAST_CONNECTIONS)
      File "/usr/lib/python3.6/site-packages/octavia_tempest_plugin/tests/scenario/v2/test_pool.py", line 366, in _test_pool_CRUD
        CONF.load_balancer.build_timeout)
      File "/usr/lib/python3.6/site-packages/octavia_tempest_plugin/tests/waiters.py", line 96, in wait_for_status
        raise exceptions.TimeoutException(message)
    tempest.lib.exceptions.TimeoutException: Request timed out
    Details: (PoolScenarioTest:test_HTTPS_LC_pool_with_listener_CRUD) show_loadbalancer provisioning_status failed to update to ACTIVE within the required time 300. Current status of show_loadbalancer: PENDING_UPDATE
~~~

Comment 2 Gregory Thiemonge 2021-09-01 12:00:39 UTC

In the Octavia worker logs, we got:

2021-08-30 21:03:23.679 30 DEBUG octavia.amphorae.drivers.haproxy.rest_api_driver [-] request url https://172.24.3.95:9443/1.0/plug/vip/10.1.1.159 request /usr/lib/python3.6/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py:678
2021-08-30 21:04:24.066 30 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='172.24.3.95', port=9443): Read timed out. (read timeout=60.0)
2021-08-30 21:04:44.124 30 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='172.24.3.95', port=9443): Read timed out. (read timeout=10.0)
2021-08-30 21:05:04.166 30 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='172.24.3.95', port=9443): Read timed out. (read timeout=10.0)
2021-08-30 21:05:24.217 30 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='172.24.3.95', port=9443): Read timed out. (read timeout=10.0)
2021-08-30 21:05:44.241 30 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='172.24.3.95', port=9443): Read timed out. (read timeout=10.0)
2021-08-30 21:06:04.266 30 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='172.24.3.95', port=9443): Read timed out. (read timeout=10.0)
2021-08-30 21:06:24.297 30 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='172.24.3.95', port=9443): Read timed out. (read timeout=10.0)
2021-08-30 21:06:42.066 30 DEBUG octavia.amphorae.drivers.haproxy.rest_api_driver [-] Connected to amphora. Response: <Response [409]> request /usr/lib/python3.6/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py:699
2021-08-30 21:06:42.068 30 ERROR octavia.amphorae.drivers.haproxy.exceptions [-] Amphora agent returned unexpected result code 409 with response {'message': 'Interface already exists'}


The "Interface already exists" error should not affect the load balancer creation, after this message, the VIP interface (created at 21:03:23.679) should be up and functional.
But this issue is due to the duration between the request from the octavia-worker and the response from the amphora-agent (21:03:23.679 -> 21:06:42.068), it took more than 3 min to set up an interface in the amphora (but it should be pretty quick, basically it calls 'ifup eth1' in the VM).

Then we can also check the duration of the flow that created the load balancer in Octavia:

in the worker logs:

2021-08-30 20:50:06.315 30 INFO octavia.controller.queue.v1.endpoints [-] Creating load balancer '16495fcc-3f88-41f9-b99d-4fd1d8fe4649'...
2021-08-30 20:50:08.097 30 DEBUG octavia.controller.worker.v1.controller_worker [-] Flow 'octavia-create-loadbalancer-flow' (6f5de613-1820-48c3-8ec1-2c23ba536c72) transitioned into state 'RUNNING' from state 'PENDING' _flow_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:143
2021-08-30 21:06:42.150 30 INFO octavia.controller.worker.v1.tasks.database_tasks [-] Mark ACTIVE in DB for load balancer id: 16495fcc-3f88-41f9-b99d-4fd1d8fe4649
2021-08-30 21:06:42.193 30 DEBUG octavia.controller.worker.v1.controller_worker [-] Flow 'octavia-create-loadbalancer-flow' (6f5de613-1820-48c3-8ec1-2c23ba536c72) transitioned into state 'SUCCESS' from state 'RUNNING' _flow_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:143

It took 16 min to create a load balancer, this is longer than any timeouts that are defined in octavia-tempest-plugin, creating a load balancer in OSP should take only 1min30 (2min in a busy environment).


In nova logs, we have the following configuration value:

2021-08-30 20:38:23.510 7 DEBUG oslo_service.service [req-4fcdfece-e5c4-4fd3-9b31-3cb27ccb953e - - - - -] libvirt.virt_type              = qemu log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2593

Here the CI uses qemu emulation which doesn't work well with Octavia because Octavia uses RHEL/Centos based VMs that are slow in those environments.

Do you have a way to enable kvm nested virtualization for those jobs?