Description of problem: Octavia is struggling with proper handling of DB connectivity issues bringing down all running loadbalancers. I've experienced problems during minor RHOSP13 release update but it can be replicated by simple restart of galera-bundle. Version-Release number of selected component (if applicable): DockerOctaviaApiImage: 192.168.111.1:8787/rhosp13/openstack-octavia-api:13.0-43 DockerOctaviaConfigImage: 192.168.111.1:8787/rhosp13/openstack-octavia-api:13.0-43 DockerOctaviaHealthManagerImage: 192.168.111.1:8787/rhosp13/openstack-octavia-health-manager:13.0-45 DockerOctaviaHousekeepingImage: 192.168.111.1:8787/rhosp13/openstack-octavia-housekeeping:13.0-45 DockerOctaviaWorkerImage: 192.168.111.1:8787/rhosp13/openstack-octavia-worker:13.0-44 How reproducible: All the time. Results may be different but are always catastrophic. Steps to Reproduce: 1. create a few loadbalancers: for i in `seq 0 5` ; do openstack loadbalancer create --vip-subnet-id 2a5c5f64-81d1-4742-a4da-6c9706111f6f --name "loadbalancer-${i}"; done 2. Wait a while so everything will settle: (overcloud) [stack@director ~]$ openstack loadbalancer list +--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+ | id | name | project_id | vip_address | provisioning_status | provider | +--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+ | 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | loadbalancer-0 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.110 | ACTIVE | octavia | | e19d1426-c351-4c01-89f6-a9842be03c19 | loadbalancer-1 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.105 | ACTIVE | octavia | | 8b794304-212b-4be2-b9a8-66ef2a3afb5d | loadbalancer-2 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.115 | ACTIVE | octavia | | c3bf6eb7-788e-40a3-954a-351ef9b94b4a | loadbalancer-3 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.112 | ACTIVE | octavia | | 32003402-216f-4c17-a675-b1442c96598a | loadbalancer-4 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.113 | ACTIVE | octavia | | 102d8b48-9794-42eb-9c9d-7e21c4409ddb | loadbalancer-5 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.103 | ACTIVE | octavia | +--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+ (overcloud) [stack@director ~]$ openstack loadbalancer amphora list +--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+ | id | loadbalancer_id | status | role | lb_network_ip | ha_ip | +--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+ | 40703e60-c4b8-43df-a9b4-f4e1ddf2237c | e19d1426-c351-4c01-89f6-a9842be03c19 | ALLOCATED | STANDALONE | 172.24.0.17 | 192.168.5.105 | | 4bec0db5-e381-4c85-b39b-e0b96629060e | 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | ALLOCATED | STANDALONE | 172.24.0.6 | 192.168.5.110 | | 5f317218-c0d8-4316-a3cc-863bc130b509 | 8b794304-212b-4be2-b9a8-66ef2a3afb5d | ALLOCATED | STANDALONE | 172.24.0.5 | 192.168.5.115 | | 94566449-2646-46b5-924c-e44a746ab5f9 | 102d8b48-9794-42eb-9c9d-7e21c4409ddb | ALLOCATED | STANDALONE | 172.24.0.18 | 192.168.5.103 | | a8155cf9-5594-4402-9ab3-7f9b440c009d | 32003402-216f-4c17-a675-b1442c96598a | ALLOCATED | STANDALONE | 172.24.0.7 | 192.168.5.113 | | fc7981f7-6c7e-42b0-a724-1a27b0622a8c | c3bf6eb7-788e-40a3-954a-351ef9b94b4a | ALLOCATED | STANDALONE | 172.24.0.19 | 192.168.5.112 | +--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+ 3. Restart galera-bundle on the controller node [root@overcloud-controller-2 octavia]# pcs resource restart galera-bundle 4. Observe that various Octavia services are loosing DB connection 2018-08-28 14:23:47.270 24 ERROR octavia.controller.healthmanager.update_db DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8) 5. Observe Octavia trying to failover amphoras: 2018-08-28 14:24:16.216 25 INFO octavia.controller.healthmanager.health_manager [-] Stale amphora's id is: 4bec0db5-e381-4c85-b39b-e0b96629060e 2018-08-28 14:24:16.247 25 INFO octavia.controller.healthmanager.health_manager [-] Stale amphora's id is: 94566449-2646-46b5-924c-e44a746ab5f9 2018-08-28 14:24:16.293 25 INFO octavia.controller.healthmanager.health_manager [-] Stale amphora's id is: fc7981f7-6c7e-42b0-a724-1a27b0622a8c 2018-08-28 14:24:16.352 25 INFO octavia.controller.healthmanager.health_manager [-] Waiting for 3 failovers to finish 2018-08-28 14:24:16.385 25 WARNING octavia.controller.worker.controller_worker [-] Failing over amphora with no spares pool may cause delays in failover times while a new amphora instance boots. 2018-08-28 14:24:16.429 25 WARNING octavia.controller.worker.controller_worker [-] Failing over amphora with no spares pool may cause delays in failover times while a new amphora instance boots. 2018-08-28 14:24:16.498 25 WARNING octavia.controller.worker.controller_worker [-] Failing over amphora with no spares pool may cause delays in failover times while a new amphora instance boots. 6. Multiple things can happen here now, just to name a few: - Octavia can't create new amphora because nova isn't ready yet after DB outage. Nova-api throws 500, Octavia nukes amphora instance and won't try to recreate it again - Octavia tries to recreate amphora instance but it get stuck in PENDING_CREATE forever - Octavia fails completely reporting DB connection issues, leaving some amphoras in error, some in pending_delete as bellow: (overcloud) [stack@director ~]$ openstack loadbalancer amphora list +--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+ | id | loadbalancer_id | status | role | lb_network_ip | ha_ip | +--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+ | 40703e60-c4b8-43df-a9b4-f4e1ddf2237c | e19d1426-c351-4c01-89f6-a9842be03c19 | ERROR | STANDALONE | 172.24.0.17 | 192.168.5.105 | | 4bec0db5-e381-4c85-b39b-e0b96629060e | 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | ERROR | STANDALONE | 172.24.0.6 | 192.168.5.110 | | 5f317218-c0d8-4316-a3cc-863bc130b509 | 8b794304-212b-4be2-b9a8-66ef2a3afb5d | PENDING_DELETE | STANDALONE | 172.24.0.5 | 192.168.5.115 | | 94566449-2646-46b5-924c-e44a746ab5f9 | 102d8b48-9794-42eb-9c9d-7e21c4409ddb | PENDING_DELETE | STANDALONE | 172.24.0.18 | 192.168.5.103 | | a8155cf9-5594-4402-9ab3-7f9b440c009d | 32003402-216f-4c17-a675-b1442c96598a | ERROR | STANDALONE | 172.24.0.7 | 192.168.5.113 | | fc7981f7-6c7e-42b0-a724-1a27b0622a8c | c3bf6eb7-788e-40a3-954a-351ef9b94b4a | PENDING_DELETE | STANDALONE | 172.24.0.19 | 192.168.5.112 | +--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+ 5. Restart Octavia containers across the whole cluster won't help. Amphoras are left as they are or at some point being successfully deleted. No option to failover or do anything - LBs get stuck in PENDING_UPDATE: (undercloud) [stack@director ~]$ ansible Controller -i /usr/bin/tripleo-ansible-inventory -b -m shell -a "docker ps -f name=octavia -q | xargs -P 5 -I{} docker restart {}" 192.168.111.202 | SUCCESS | rc=0 >> 592833395bb2 1f13973ad6cb bad8340f4752 097e18890327 192.168.111.201 | SUCCESS | rc=0 >> c6858d3d8181 7f2b304a7224 3ea76b696667 2abc5096abe1 192.168.111.207 | SUCCESS | rc=0 >> 4a73f0b48578 827cc913c74c 0b8b7dcea332 3a8ce9f1e08a (overcloud) [stack@director ~]$ openstack loadbalancer list +--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+ | id | name | project_id | vip_address | provisioning_status | provider | +--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+ | 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | loadbalancer-0 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.110 | ERROR | octavia | | e19d1426-c351-4c01-89f6-a9842be03c19 | loadbalancer-1 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.105 | ERROR | octavia | | 8b794304-212b-4be2-b9a8-66ef2a3afb5d | loadbalancer-2 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.115 | ACTIVE | octavia | | c3bf6eb7-788e-40a3-954a-351ef9b94b4a | loadbalancer-3 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.112 | PENDING_UPDATE | octavia | | 32003402-216f-4c17-a675-b1442c96598a | loadbalancer-4 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.113 | ERROR | octavia | | 102d8b48-9794-42eb-9c9d-7e21c4409ddb | loadbalancer-5 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.103 | ACTIVE | octavia | +--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+ (overcloud) [stack@director ~]$ openstack loadbalancer amphora list +--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+ | id | loadbalancer_id | status | role | lb_network_ip | ha_ip | +--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+ | 40703e60-c4b8-43df-a9b4-f4e1ddf2237c | e19d1426-c351-4c01-89f6-a9842be03c19 | ERROR | STANDALONE | 172.24.0.17 | 192.168.5.105 | | 4bec0db5-e381-4c85-b39b-e0b96629060e | 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | ERROR | STANDALONE | 172.24.0.6 | 192.168.5.110 | | 94566449-2646-46b5-924c-e44a746ab5f9 | 102d8b48-9794-42eb-9c9d-7e21c4409ddb | PENDING_DELETE | STANDALONE | 172.24.0.18 | 192.168.5.103 | | a8155cf9-5594-4402-9ab3-7f9b440c009d | 32003402-216f-4c17-a675-b1442c96598a | ERROR | STANDALONE | 172.24.0.7 | 192.168.5.113 | | fc7981f7-6c7e-42b0-a724-1a27b0622a8c | c3bf6eb7-788e-40a3-954a-351ef9b94b4a | PENDING_DELETE | STANDALONE | 172.24.0.19 | 192.168.5.112 | +--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+ 6. Attempt to nuke everything will fail as well due to orphaned ports with associated security groups which can't be deleted anymore by Octavia: (overcloud) [stack@director ~]$ openstack loadbalancer list -f value -c id | xargs -P 6 -I{} openstack loadbalancer delete {} (overcloud) [stack@director ~]$ openstack loadbalancer list +--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+ | id | name | project_id | vip_address | provisioning_status | provider | +--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+ | 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | loadbalancer-0 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.110 | PENDING_DELETE | octavia | | e19d1426-c351-4c01-89f6-a9842be03c19 | loadbalancer-1 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.105 | PENDING_DELETE | octavia | | 8b794304-212b-4be2-b9a8-66ef2a3afb5d | loadbalancer-2 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.115 | PENDING_DELETE | octavia | | c3bf6eb7-788e-40a3-954a-351ef9b94b4a | loadbalancer-3 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.112 | PENDING_DELETE | octavia | | 32003402-216f-4c17-a675-b1442c96598a | loadbalancer-4 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.113 | PENDING_DELETE | octavia | | 102d8b48-9794-42eb-9c9d-7e21c4409ddb | loadbalancer-5 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.103 | PENDING_DELETE | octavia | +--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+ 2018-08-28 14:44:59.418 23 INFO octavia.network.drivers.neutron.allowed_address_pairs [-] Removing security group 231409b7-216b-4825-8818-304a414fa23c from port 5bb9694a-c444-4abc-b018-269b5a41d364 2018-08-28 14:45:01.108 23 WARNING octavia.network.drivers.neutron.allowed_address_pairs [-] Attempt 1 to remove security group 231409b7-216b-4825-8818-304a414fa23c failed.: Conflict: Security Group 231409b7-216b-4825-8818-304a414fa23c in use. (...) 2018-08-28 14:45:18.609 23 WARNING octavia.network.drivers.neutron.allowed_address_pairs [-] Attempt 16 to remove security group 231409b7-216b-4825-8818-304a414fa23c failed.: Conflict: Security Group 231409b7-216b-4825-8818-304a414fa23c in use. 2018-08-28 14:45:19.612 23 ERROR octavia.network.drivers.neutron.allowed_address_pairs [-] All attempts to remove security group 231409b7-216b-4825-8818-304a414fa23c have failed.: Conflict: Security Group 231409b7-216b-4825-8818-304a414fa23c in use. 2018-08-28 14:45:19.612 23 ERROR octavia.network.drivers.neutron.allowed_address_pairs Conflict: Security Group 231409b7-216b-4825-8818-304a414fa23c in use. |__Flow 'octavia-delete-loadbalancer-flow': DeallocateVIPException: All attempts to remove security group 231409b7-216b-4825-8818-304a414fa23c have failed. 2018-08-28 14:45:19.617 23 ERROR octavia.controller.worker.controller_worker DeallocateVIPException: All attempts to remove security group 231409b7-216b-4825-8818-304a414fa23c have failed. 2018-08-28 14:45:19.695 23 ERROR oslo_messaging.rpc.server [-] Exception during message handling: DeallocateVIPException: All attempts to remove security group 231409b7-216b-4825-8818-304a414fa23c have failed. 2018-08-28 14:45:19.695 23 ERROR oslo_messaging.rpc.server DeallocateVIPException: All attempts to remove security group 231409b7-216b-4825-8818-304a414fa23c have failed. (overcloud) [stack@director ~]$ openstack loadbalancer list +--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+ | id | name | project_id | vip_address | provisioning_status | provider | +--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+ | 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | loadbalancer-0 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.110 | ERROR | octavia | | e19d1426-c351-4c01-89f6-a9842be03c19 | loadbalancer-1 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.105 | ERROR | octavia | | 8b794304-212b-4be2-b9a8-66ef2a3afb5d | loadbalancer-2 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.115 | ERROR | octavia | | c3bf6eb7-788e-40a3-954a-351ef9b94b4a | loadbalancer-3 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.112 | ERROR | octavia | | 32003402-216f-4c17-a675-b1442c96598a | loadbalancer-4 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.113 | ERROR | octavia | | 102d8b48-9794-42eb-9c9d-7e21c4409ddb | loadbalancer-5 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.103 | ERROR | octavia | +--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+ 7. The only way do delete those loadbalancers is to nuke ports which were associated with amphoras first, then re-try the deletion. Actual results: Octavia breaks all LoadBalancers not allowing for easy recovery due to database connectivity issues which may occur not only due to the failure but also during overcloud update. Expected results: Octavia properly handles DB outage, ideally not trying to failover amphoras. Additional info:
Thanks for the well-written report! We will look into the critical issue arising upon DB outage. The port issue is fixed in Octavia 2.0.2 (Queens). I expect OSP13 z3 to include it.
*** Bug 1609064 has been marked as a duplicate of this bug. ***
*** Bug 1603138 has been marked as a duplicate of this bug. ***
Hi Carlos, One more customer have reported the same issue. Where one of the Control node was rebooted and the database was unavailable for a few seconds, octavia brought down all of our loadbalancers. The loadbalancers are now in error state and can not be delete d because they are "immutable". However nuking the ports have worked for them and been able to delete the loadbalancers. Let me know if you guys need some data from their environment or input which could help ?
Madhur, thanks for the info. We definitely need to put all efforts in fixing this issue ASAP. Could you confirm if https://bugzilla.redhat.com/show_bug.cgi?id=1623071 is a duplicate of this one?
I will check the logs of both issues from my end to see if there are any similarities.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3614