Bug 1623146 - [OSP13] Octavia brings down all LoadBalancers due to the DB outage.
Summary: [OSP13] Octavia brings down all LoadBalancers due to the DB outage.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-octavia
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z3
: 13.0 (Queens)
Assignee: Carlos Goncalves
QA Contact: Alexander Stafeyev
URL:
Whiteboard:
: 1603138 1609064 (view as bug list)
Depends On:
Blocks: 1628743
TreeView+ depends on / blocked
 
Reported: 2018-08-28 14:55 UTC by Rafal Szmigiel
Modified: 2021-12-10 17:17 UTC (History)
18 users (show)

Fixed In Version: openstack-octavia-2.0.2-2.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1628743 (view as bug list)
Environment:
Last Closed: 2018-11-13 23:32:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 600876 0 'None' MERGED Make health checks resilient to DB outages 2020-12-15 09:34:46 UTC
OpenStack gerrit 602430 0 'None' MERGED Make health checks resilient to DB outages 2020-12-15 09:34:48 UTC
OpenStack gerrit 602431 0 'None' MERGED Make health checks resilient to DB outages 2020-12-15 09:35:17 UTC
Red Hat Issue Tracker OSP-11576 0 None None None 2021-12-10 17:17:27 UTC
Red Hat Product Errata RHBA-2018:3614 0 None None None 2018-11-13 23:33:40 UTC
Storyboard 2003575 0 None None None 2018-08-28 15:51:56 UTC

Description Rafal Szmigiel 2018-08-28 14:55:26 UTC
Description of problem:

Octavia is struggling with proper handling of DB connectivity issues bringing down all running loadbalancers. 

I've experienced problems during minor RHOSP13 release update but it can be replicated by simple restart of galera-bundle.

Version-Release number of selected component (if applicable):

  DockerOctaviaApiImage: 192.168.111.1:8787/rhosp13/openstack-octavia-api:13.0-43
  DockerOctaviaConfigImage: 192.168.111.1:8787/rhosp13/openstack-octavia-api:13.0-43
  DockerOctaviaHealthManagerImage: 192.168.111.1:8787/rhosp13/openstack-octavia-health-manager:13.0-45
  DockerOctaviaHousekeepingImage: 192.168.111.1:8787/rhosp13/openstack-octavia-housekeeping:13.0-45
  DockerOctaviaWorkerImage: 192.168.111.1:8787/rhosp13/openstack-octavia-worker:13.0-44


How reproducible:

All the time. Results may be different but are always catastrophic. 

Steps to Reproduce:
1. create a few loadbalancers: 

for i in `seq 0 5` ; do openstack loadbalancer create --vip-subnet-id 2a5c5f64-81d1-4742-a4da-6c9706111f6f --name "loadbalancer-${i}"; done


2. Wait a while so everything will settle:

(overcloud) [stack@director ~]$ openstack loadbalancer list
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
| id                                   | name           | project_id                       | vip_address   | provisioning_status | provider |
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
| 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | loadbalancer-0 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.110 | ACTIVE              | octavia  |
| e19d1426-c351-4c01-89f6-a9842be03c19 | loadbalancer-1 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.105 | ACTIVE              | octavia  |
| 8b794304-212b-4be2-b9a8-66ef2a3afb5d | loadbalancer-2 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.115 | ACTIVE              | octavia  |
| c3bf6eb7-788e-40a3-954a-351ef9b94b4a | loadbalancer-3 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.112 | ACTIVE              | octavia  |
| 32003402-216f-4c17-a675-b1442c96598a | loadbalancer-4 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.113 | ACTIVE              | octavia  |
| 102d8b48-9794-42eb-9c9d-7e21c4409ddb | loadbalancer-5 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.103 | ACTIVE              | octavia  |
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
(overcloud) [stack@director ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+
| id                                   | loadbalancer_id                      | status    | role       | lb_network_ip | ha_ip         |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+
| 40703e60-c4b8-43df-a9b4-f4e1ddf2237c | e19d1426-c351-4c01-89f6-a9842be03c19 | ALLOCATED | STANDALONE | 172.24.0.17   | 192.168.5.105 |
| 4bec0db5-e381-4c85-b39b-e0b96629060e | 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | ALLOCATED | STANDALONE | 172.24.0.6    | 192.168.5.110 |
| 5f317218-c0d8-4316-a3cc-863bc130b509 | 8b794304-212b-4be2-b9a8-66ef2a3afb5d | ALLOCATED | STANDALONE | 172.24.0.5    | 192.168.5.115 |
| 94566449-2646-46b5-924c-e44a746ab5f9 | 102d8b48-9794-42eb-9c9d-7e21c4409ddb | ALLOCATED | STANDALONE | 172.24.0.18   | 192.168.5.103 |
| a8155cf9-5594-4402-9ab3-7f9b440c009d | 32003402-216f-4c17-a675-b1442c96598a | ALLOCATED | STANDALONE | 172.24.0.7    | 192.168.5.113 |
| fc7981f7-6c7e-42b0-a724-1a27b0622a8c | c3bf6eb7-788e-40a3-954a-351ef9b94b4a | ALLOCATED | STANDALONE | 172.24.0.19   | 192.168.5.112 |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+



3. Restart galera-bundle on the controller node

[root@overcloud-controller-2 octavia]# pcs resource restart galera-bundle

4. Observe that various Octavia services are loosing DB connection

2018-08-28 14:23:47.270 24 ERROR octavia.controller.healthmanager.update_db DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)

5. Observe Octavia trying to failover amphoras:

2018-08-28 14:24:16.216 25 INFO octavia.controller.healthmanager.health_manager [-] Stale amphora's id is: 4bec0db5-e381-4c85-b39b-e0b96629060e
2018-08-28 14:24:16.247 25 INFO octavia.controller.healthmanager.health_manager [-] Stale amphora's id is: 94566449-2646-46b5-924c-e44a746ab5f9
2018-08-28 14:24:16.293 25 INFO octavia.controller.healthmanager.health_manager [-] Stale amphora's id is: fc7981f7-6c7e-42b0-a724-1a27b0622a8c
2018-08-28 14:24:16.352 25 INFO octavia.controller.healthmanager.health_manager [-] Waiting for 3 failovers to finish
2018-08-28 14:24:16.385 25 WARNING octavia.controller.worker.controller_worker [-] Failing over amphora with no spares pool may cause delays in failover times while a new amphora instance boots.
2018-08-28 14:24:16.429 25 WARNING octavia.controller.worker.controller_worker [-] Failing over amphora with no spares pool may cause delays in failover times while a new amphora instance boots.
2018-08-28 14:24:16.498 25 WARNING octavia.controller.worker.controller_worker [-] Failing over amphora with no spares pool may cause delays in failover times while a new amphora instance boots.

6. Multiple things can happen here now, just to name a few:
 - Octavia can't create new amphora because nova isn't ready yet after DB outage. Nova-api throws 500, Octavia nukes amphora instance and won't try to recreate it again

 - Octavia tries to recreate amphora instance but it get stuck in PENDING_CREATE forever

 - Octavia fails completely reporting DB connection issues, leaving some amphoras in error, some in pending_delete as bellow:


(overcloud) [stack@director ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+
| id                                   | loadbalancer_id                      | status         | role       | lb_network_ip | ha_ip         |
+--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+
| 40703e60-c4b8-43df-a9b4-f4e1ddf2237c | e19d1426-c351-4c01-89f6-a9842be03c19 | ERROR          | STANDALONE | 172.24.0.17   | 192.168.5.105 |
| 4bec0db5-e381-4c85-b39b-e0b96629060e | 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | ERROR          | STANDALONE | 172.24.0.6    | 192.168.5.110 |
| 5f317218-c0d8-4316-a3cc-863bc130b509 | 8b794304-212b-4be2-b9a8-66ef2a3afb5d | PENDING_DELETE | STANDALONE | 172.24.0.5    | 192.168.5.115 |
| 94566449-2646-46b5-924c-e44a746ab5f9 | 102d8b48-9794-42eb-9c9d-7e21c4409ddb | PENDING_DELETE | STANDALONE | 172.24.0.18   | 192.168.5.103 |
| a8155cf9-5594-4402-9ab3-7f9b440c009d | 32003402-216f-4c17-a675-b1442c96598a | ERROR          | STANDALONE | 172.24.0.7    | 192.168.5.113 |
| fc7981f7-6c7e-42b0-a724-1a27b0622a8c | c3bf6eb7-788e-40a3-954a-351ef9b94b4a | PENDING_DELETE | STANDALONE | 172.24.0.19   | 192.168.5.112 |
+--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+


5. Restart Octavia containers across the whole cluster won't help. Amphoras are left as they are or at some point being successfully deleted. No option to failover or do anything - LBs get stuck in PENDING_UPDATE:

(undercloud) [stack@director ~]$ ansible Controller -i /usr/bin/tripleo-ansible-inventory -b -m shell -a "docker ps -f name=octavia -q | xargs -P 5 -I{} docker restart {}"
192.168.111.202 | SUCCESS | rc=0 >>
592833395bb2
1f13973ad6cb
bad8340f4752
097e18890327

192.168.111.201 | SUCCESS | rc=0 >>
c6858d3d8181
7f2b304a7224
3ea76b696667
2abc5096abe1

192.168.111.207 | SUCCESS | rc=0 >>
4a73f0b48578
827cc913c74c
0b8b7dcea332
3a8ce9f1e08a


(overcloud) [stack@director ~]$ openstack loadbalancer list
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
| id                                   | name           | project_id                       | vip_address   | provisioning_status | provider |
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
| 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | loadbalancer-0 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.110 | ERROR               | octavia  |
| e19d1426-c351-4c01-89f6-a9842be03c19 | loadbalancer-1 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.105 | ERROR               | octavia  |
| 8b794304-212b-4be2-b9a8-66ef2a3afb5d | loadbalancer-2 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.115 | ACTIVE              | octavia  |
| c3bf6eb7-788e-40a3-954a-351ef9b94b4a | loadbalancer-3 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.112 | PENDING_UPDATE      | octavia  |
| 32003402-216f-4c17-a675-b1442c96598a | loadbalancer-4 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.113 | ERROR               | octavia  |
| 102d8b48-9794-42eb-9c9d-7e21c4409ddb | loadbalancer-5 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.103 | ACTIVE              | octavia  |
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
(overcloud) [stack@director ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+
| id                                   | loadbalancer_id                      | status         | role       | lb_network_ip | ha_ip         |
+--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+
| 40703e60-c4b8-43df-a9b4-f4e1ddf2237c | e19d1426-c351-4c01-89f6-a9842be03c19 | ERROR          | STANDALONE | 172.24.0.17   | 192.168.5.105 |
| 4bec0db5-e381-4c85-b39b-e0b96629060e | 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | ERROR          | STANDALONE | 172.24.0.6    | 192.168.5.110 |
| 94566449-2646-46b5-924c-e44a746ab5f9 | 102d8b48-9794-42eb-9c9d-7e21c4409ddb | PENDING_DELETE | STANDALONE | 172.24.0.18   | 192.168.5.103 |
| a8155cf9-5594-4402-9ab3-7f9b440c009d | 32003402-216f-4c17-a675-b1442c96598a | ERROR          | STANDALONE | 172.24.0.7    | 192.168.5.113 |
| fc7981f7-6c7e-42b0-a724-1a27b0622a8c | c3bf6eb7-788e-40a3-954a-351ef9b94b4a | PENDING_DELETE | STANDALONE | 172.24.0.19   | 192.168.5.112 |
+--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+

6. Attempt to nuke everything will fail as well due to orphaned ports with associated security groups which can't be deleted anymore by Octavia:

(overcloud) [stack@director ~]$ openstack loadbalancer list -f value -c id | xargs -P 6 -I{} openstack loadbalancer delete {}
(overcloud) [stack@director ~]$ openstack loadbalancer list 
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
| id                                   | name           | project_id                       | vip_address   | provisioning_status | provider |
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
| 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | loadbalancer-0 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.110 | PENDING_DELETE      | octavia  |
| e19d1426-c351-4c01-89f6-a9842be03c19 | loadbalancer-1 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.105 | PENDING_DELETE      | octavia  |
| 8b794304-212b-4be2-b9a8-66ef2a3afb5d | loadbalancer-2 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.115 | PENDING_DELETE      | octavia  |
| c3bf6eb7-788e-40a3-954a-351ef9b94b4a | loadbalancer-3 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.112 | PENDING_DELETE      | octavia  |
| 32003402-216f-4c17-a675-b1442c96598a | loadbalancer-4 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.113 | PENDING_DELETE      | octavia  |
| 102d8b48-9794-42eb-9c9d-7e21c4409ddb | loadbalancer-5 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.103 | PENDING_DELETE      | octavia  |
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+


2018-08-28 14:44:59.418 23 INFO octavia.network.drivers.neutron.allowed_address_pairs [-] Removing security group 231409b7-216b-4825-8818-304a414fa23c from port 5bb9694a-c444-4abc-b018-269b5a41d364
2018-08-28 14:45:01.108 23 WARNING octavia.network.drivers.neutron.allowed_address_pairs [-] Attempt 1 to remove security group 231409b7-216b-4825-8818-304a414fa23c failed.: Conflict: Security Group 231409b7-216b-4825-8818-304a414fa23c in use.

(...)

2018-08-28 14:45:18.609 23 WARNING octavia.network.drivers.neutron.allowed_address_pairs [-] Attempt 16 to remove security group 231409b7-216b-4825-8818-304a414fa23c failed.: Conflict: Security Group 231409b7-216b-4825-8818-304a414fa23c in use.
2018-08-28 14:45:19.612 23 ERROR octavia.network.drivers.neutron.allowed_address_pairs [-] All attempts to remove security group 231409b7-216b-4825-8818-304a414fa23c have failed.: Conflict: Security Group 231409b7-216b-4825-8818-304a414fa23c in use.
2018-08-28 14:45:19.612 23 ERROR octavia.network.drivers.neutron.allowed_address_pairs Conflict: Security Group 231409b7-216b-4825-8818-304a414fa23c in use.
           |__Flow 'octavia-delete-loadbalancer-flow': DeallocateVIPException: All attempts to remove security group 231409b7-216b-4825-8818-304a414fa23c have failed.
2018-08-28 14:45:19.617 23 ERROR octavia.controller.worker.controller_worker DeallocateVIPException: All attempts to remove security group 231409b7-216b-4825-8818-304a414fa23c have failed.
2018-08-28 14:45:19.695 23 ERROR oslo_messaging.rpc.server [-] Exception during message handling: DeallocateVIPException: All attempts to remove security group 231409b7-216b-4825-8818-304a414fa23c have failed.
2018-08-28 14:45:19.695 23 ERROR oslo_messaging.rpc.server DeallocateVIPException: All attempts to remove security group 231409b7-216b-4825-8818-304a414fa23c have failed.

(overcloud) [stack@director ~]$ openstack loadbalancer list 
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
| id                                   | name           | project_id                       | vip_address   | provisioning_status | provider |
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
| 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | loadbalancer-0 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.110 | ERROR               | octavia  |
| e19d1426-c351-4c01-89f6-a9842be03c19 | loadbalancer-1 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.105 | ERROR               | octavia  |
| 8b794304-212b-4be2-b9a8-66ef2a3afb5d | loadbalancer-2 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.115 | ERROR               | octavia  |
| c3bf6eb7-788e-40a3-954a-351ef9b94b4a | loadbalancer-3 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.112 | ERROR               | octavia  |
| 32003402-216f-4c17-a675-b1442c96598a | loadbalancer-4 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.113 | ERROR               | octavia  |
| 102d8b48-9794-42eb-9c9d-7e21c4409ddb | loadbalancer-5 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.103 | ERROR               | octavia  |
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+

7. The only way do delete those loadbalancers is to nuke ports which were associated with amphoras first, then re-try the deletion.


Actual results:

Octavia breaks all LoadBalancers not allowing for easy recovery due to database connectivity issues which may occur not only due to the failure but also during overcloud update.

Expected results:

Octavia properly handles DB outage, ideally not trying to failover amphoras.

Additional info:

Comment 1 Carlos Goncalves 2018-08-28 15:51:56 UTC
Thanks for the well-written report!

We will look into the critical issue arising upon DB outage.

The port issue is fixed in Octavia 2.0.2 (Queens). I expect OSP13 z3 to include it.

Comment 5 Carlos Goncalves 2018-08-29 19:51:09 UTC
*** Bug 1609064 has been marked as a duplicate of this bug. ***

Comment 6 Carlos Goncalves 2018-08-29 20:01:34 UTC
*** Bug 1603138 has been marked as a duplicate of this bug. ***

Comment 7 Madhur Gupta 2018-08-30 14:07:39 UTC
Hi Carlos,

One more customer have reported the same issue.

Where one of the Control node was rebooted and the database was unavailable for a few seconds, octavia brought down all of our loadbalancers. 

The loadbalancers are now in error state and can not be delete d because they are "immutable".

However nuking the ports have worked for them and been able to delete the loadbalancers.

Let me know if you guys need some data from their environment or input which could help ?

Comment 8 Carlos Goncalves 2018-08-30 14:20:17 UTC
Madhur, thanks for the info. We definitely need to put all efforts in fixing this issue ASAP.

Could you confirm if https://bugzilla.redhat.com/show_bug.cgi?id=1623071 is a duplicate of this one?

Comment 9 Madhur Gupta 2018-08-31 08:40:39 UTC
I will check the logs of both issues from my end to see if there are any similarities.

Comment 35 errata-xmlrpc 2018-11-13 23:32:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3614


Note You need to log in before you can comment on or make changes to this bug.