Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1623146

Summary: [OSP13] Octavia brings down all LoadBalancers due to the DB outage.
Product: Red Hat OpenStack Reporter: Rafal Szmigiel <rszmigie>
Component: openstack-octaviaAssignee: Carlos Goncalves <cgoncalves>
Status: CLOSED ERRATA QA Contact: Alexander Stafeyev <astafeye>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: akrzos, alberto.gonzalez, amuller, astafeye, bcafarel, cbolz, cgoncalves, dalvarez, dprince, felix.huettner, ihrachys, lmiccini, lpeer, madgupta, majopela, nyechiel, pmorey, tfreger
Target Milestone: z3Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-octavia-2.0.2-2.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1628743 (view as bug list) Environment:
Last Closed: 2018-11-13 23:32:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1628743    

Description Rafal Szmigiel 2018-08-28 14:55:26 UTC
Description of problem:

Octavia is struggling with proper handling of DB connectivity issues bringing down all running loadbalancers. 

I've experienced problems during minor RHOSP13 release update but it can be replicated by simple restart of galera-bundle.

Version-Release number of selected component (if applicable):

  DockerOctaviaApiImage: 192.168.111.1:8787/rhosp13/openstack-octavia-api:13.0-43
  DockerOctaviaConfigImage: 192.168.111.1:8787/rhosp13/openstack-octavia-api:13.0-43
  DockerOctaviaHealthManagerImage: 192.168.111.1:8787/rhosp13/openstack-octavia-health-manager:13.0-45
  DockerOctaviaHousekeepingImage: 192.168.111.1:8787/rhosp13/openstack-octavia-housekeeping:13.0-45
  DockerOctaviaWorkerImage: 192.168.111.1:8787/rhosp13/openstack-octavia-worker:13.0-44


How reproducible:

All the time. Results may be different but are always catastrophic. 

Steps to Reproduce:
1. create a few loadbalancers: 

for i in `seq 0 5` ; do openstack loadbalancer create --vip-subnet-id 2a5c5f64-81d1-4742-a4da-6c9706111f6f --name "loadbalancer-${i}"; done


2. Wait a while so everything will settle:

(overcloud) [stack@director ~]$ openstack loadbalancer list
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
| id                                   | name           | project_id                       | vip_address   | provisioning_status | provider |
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
| 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | loadbalancer-0 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.110 | ACTIVE              | octavia  |
| e19d1426-c351-4c01-89f6-a9842be03c19 | loadbalancer-1 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.105 | ACTIVE              | octavia  |
| 8b794304-212b-4be2-b9a8-66ef2a3afb5d | loadbalancer-2 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.115 | ACTIVE              | octavia  |
| c3bf6eb7-788e-40a3-954a-351ef9b94b4a | loadbalancer-3 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.112 | ACTIVE              | octavia  |
| 32003402-216f-4c17-a675-b1442c96598a | loadbalancer-4 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.113 | ACTIVE              | octavia  |
| 102d8b48-9794-42eb-9c9d-7e21c4409ddb | loadbalancer-5 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.103 | ACTIVE              | octavia  |
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
(overcloud) [stack@director ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+
| id                                   | loadbalancer_id                      | status    | role       | lb_network_ip | ha_ip         |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+
| 40703e60-c4b8-43df-a9b4-f4e1ddf2237c | e19d1426-c351-4c01-89f6-a9842be03c19 | ALLOCATED | STANDALONE | 172.24.0.17   | 192.168.5.105 |
| 4bec0db5-e381-4c85-b39b-e0b96629060e | 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | ALLOCATED | STANDALONE | 172.24.0.6    | 192.168.5.110 |
| 5f317218-c0d8-4316-a3cc-863bc130b509 | 8b794304-212b-4be2-b9a8-66ef2a3afb5d | ALLOCATED | STANDALONE | 172.24.0.5    | 192.168.5.115 |
| 94566449-2646-46b5-924c-e44a746ab5f9 | 102d8b48-9794-42eb-9c9d-7e21c4409ddb | ALLOCATED | STANDALONE | 172.24.0.18   | 192.168.5.103 |
| a8155cf9-5594-4402-9ab3-7f9b440c009d | 32003402-216f-4c17-a675-b1442c96598a | ALLOCATED | STANDALONE | 172.24.0.7    | 192.168.5.113 |
| fc7981f7-6c7e-42b0-a724-1a27b0622a8c | c3bf6eb7-788e-40a3-954a-351ef9b94b4a | ALLOCATED | STANDALONE | 172.24.0.19   | 192.168.5.112 |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+



3. Restart galera-bundle on the controller node

[root@overcloud-controller-2 octavia]# pcs resource restart galera-bundle

4. Observe that various Octavia services are loosing DB connection

2018-08-28 14:23:47.270 24 ERROR octavia.controller.healthmanager.update_db DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)

5. Observe Octavia trying to failover amphoras:

2018-08-28 14:24:16.216 25 INFO octavia.controller.healthmanager.health_manager [-] Stale amphora's id is: 4bec0db5-e381-4c85-b39b-e0b96629060e
2018-08-28 14:24:16.247 25 INFO octavia.controller.healthmanager.health_manager [-] Stale amphora's id is: 94566449-2646-46b5-924c-e44a746ab5f9
2018-08-28 14:24:16.293 25 INFO octavia.controller.healthmanager.health_manager [-] Stale amphora's id is: fc7981f7-6c7e-42b0-a724-1a27b0622a8c
2018-08-28 14:24:16.352 25 INFO octavia.controller.healthmanager.health_manager [-] Waiting for 3 failovers to finish
2018-08-28 14:24:16.385 25 WARNING octavia.controller.worker.controller_worker [-] Failing over amphora with no spares pool may cause delays in failover times while a new amphora instance boots.
2018-08-28 14:24:16.429 25 WARNING octavia.controller.worker.controller_worker [-] Failing over amphora with no spares pool may cause delays in failover times while a new amphora instance boots.
2018-08-28 14:24:16.498 25 WARNING octavia.controller.worker.controller_worker [-] Failing over amphora with no spares pool may cause delays in failover times while a new amphora instance boots.

6. Multiple things can happen here now, just to name a few:
 - Octavia can't create new amphora because nova isn't ready yet after DB outage. Nova-api throws 500, Octavia nukes amphora instance and won't try to recreate it again

 - Octavia tries to recreate amphora instance but it get stuck in PENDING_CREATE forever

 - Octavia fails completely reporting DB connection issues, leaving some amphoras in error, some in pending_delete as bellow:


(overcloud) [stack@director ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+
| id                                   | loadbalancer_id                      | status         | role       | lb_network_ip | ha_ip         |
+--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+
| 40703e60-c4b8-43df-a9b4-f4e1ddf2237c | e19d1426-c351-4c01-89f6-a9842be03c19 | ERROR          | STANDALONE | 172.24.0.17   | 192.168.5.105 |
| 4bec0db5-e381-4c85-b39b-e0b96629060e | 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | ERROR          | STANDALONE | 172.24.0.6    | 192.168.5.110 |
| 5f317218-c0d8-4316-a3cc-863bc130b509 | 8b794304-212b-4be2-b9a8-66ef2a3afb5d | PENDING_DELETE | STANDALONE | 172.24.0.5    | 192.168.5.115 |
| 94566449-2646-46b5-924c-e44a746ab5f9 | 102d8b48-9794-42eb-9c9d-7e21c4409ddb | PENDING_DELETE | STANDALONE | 172.24.0.18   | 192.168.5.103 |
| a8155cf9-5594-4402-9ab3-7f9b440c009d | 32003402-216f-4c17-a675-b1442c96598a | ERROR          | STANDALONE | 172.24.0.7    | 192.168.5.113 |
| fc7981f7-6c7e-42b0-a724-1a27b0622a8c | c3bf6eb7-788e-40a3-954a-351ef9b94b4a | PENDING_DELETE | STANDALONE | 172.24.0.19   | 192.168.5.112 |
+--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+


5. Restart Octavia containers across the whole cluster won't help. Amphoras are left as they are or at some point being successfully deleted. No option to failover or do anything - LBs get stuck in PENDING_UPDATE:

(undercloud) [stack@director ~]$ ansible Controller -i /usr/bin/tripleo-ansible-inventory -b -m shell -a "docker ps -f name=octavia -q | xargs -P 5 -I{} docker restart {}"
192.168.111.202 | SUCCESS | rc=0 >>
592833395bb2
1f13973ad6cb
bad8340f4752
097e18890327

192.168.111.201 | SUCCESS | rc=0 >>
c6858d3d8181
7f2b304a7224
3ea76b696667
2abc5096abe1

192.168.111.207 | SUCCESS | rc=0 >>
4a73f0b48578
827cc913c74c
0b8b7dcea332
3a8ce9f1e08a


(overcloud) [stack@director ~]$ openstack loadbalancer list
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
| id                                   | name           | project_id                       | vip_address   | provisioning_status | provider |
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
| 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | loadbalancer-0 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.110 | ERROR               | octavia  |
| e19d1426-c351-4c01-89f6-a9842be03c19 | loadbalancer-1 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.105 | ERROR               | octavia  |
| 8b794304-212b-4be2-b9a8-66ef2a3afb5d | loadbalancer-2 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.115 | ACTIVE              | octavia  |
| c3bf6eb7-788e-40a3-954a-351ef9b94b4a | loadbalancer-3 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.112 | PENDING_UPDATE      | octavia  |
| 32003402-216f-4c17-a675-b1442c96598a | loadbalancer-4 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.113 | ERROR               | octavia  |
| 102d8b48-9794-42eb-9c9d-7e21c4409ddb | loadbalancer-5 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.103 | ACTIVE              | octavia  |
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
(overcloud) [stack@director ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+
| id                                   | loadbalancer_id                      | status         | role       | lb_network_ip | ha_ip         |
+--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+
| 40703e60-c4b8-43df-a9b4-f4e1ddf2237c | e19d1426-c351-4c01-89f6-a9842be03c19 | ERROR          | STANDALONE | 172.24.0.17   | 192.168.5.105 |
| 4bec0db5-e381-4c85-b39b-e0b96629060e | 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | ERROR          | STANDALONE | 172.24.0.6    | 192.168.5.110 |
| 94566449-2646-46b5-924c-e44a746ab5f9 | 102d8b48-9794-42eb-9c9d-7e21c4409ddb | PENDING_DELETE | STANDALONE | 172.24.0.18   | 192.168.5.103 |
| a8155cf9-5594-4402-9ab3-7f9b440c009d | 32003402-216f-4c17-a675-b1442c96598a | ERROR          | STANDALONE | 172.24.0.7    | 192.168.5.113 |
| fc7981f7-6c7e-42b0-a724-1a27b0622a8c | c3bf6eb7-788e-40a3-954a-351ef9b94b4a | PENDING_DELETE | STANDALONE | 172.24.0.19   | 192.168.5.112 |
+--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+

6. Attempt to nuke everything will fail as well due to orphaned ports with associated security groups which can't be deleted anymore by Octavia:

(overcloud) [stack@director ~]$ openstack loadbalancer list -f value -c id | xargs -P 6 -I{} openstack loadbalancer delete {}
(overcloud) [stack@director ~]$ openstack loadbalancer list 
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
| id                                   | name           | project_id                       | vip_address   | provisioning_status | provider |
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
| 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | loadbalancer-0 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.110 | PENDING_DELETE      | octavia  |
| e19d1426-c351-4c01-89f6-a9842be03c19 | loadbalancer-1 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.105 | PENDING_DELETE      | octavia  |
| 8b794304-212b-4be2-b9a8-66ef2a3afb5d | loadbalancer-2 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.115 | PENDING_DELETE      | octavia  |
| c3bf6eb7-788e-40a3-954a-351ef9b94b4a | loadbalancer-3 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.112 | PENDING_DELETE      | octavia  |
| 32003402-216f-4c17-a675-b1442c96598a | loadbalancer-4 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.113 | PENDING_DELETE      | octavia  |
| 102d8b48-9794-42eb-9c9d-7e21c4409ddb | loadbalancer-5 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.103 | PENDING_DELETE      | octavia  |
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+


2018-08-28 14:44:59.418 23 INFO octavia.network.drivers.neutron.allowed_address_pairs [-] Removing security group 231409b7-216b-4825-8818-304a414fa23c from port 5bb9694a-c444-4abc-b018-269b5a41d364
2018-08-28 14:45:01.108 23 WARNING octavia.network.drivers.neutron.allowed_address_pairs [-] Attempt 1 to remove security group 231409b7-216b-4825-8818-304a414fa23c failed.: Conflict: Security Group 231409b7-216b-4825-8818-304a414fa23c in use.

(...)

2018-08-28 14:45:18.609 23 WARNING octavia.network.drivers.neutron.allowed_address_pairs [-] Attempt 16 to remove security group 231409b7-216b-4825-8818-304a414fa23c failed.: Conflict: Security Group 231409b7-216b-4825-8818-304a414fa23c in use.
2018-08-28 14:45:19.612 23 ERROR octavia.network.drivers.neutron.allowed_address_pairs [-] All attempts to remove security group 231409b7-216b-4825-8818-304a414fa23c have failed.: Conflict: Security Group 231409b7-216b-4825-8818-304a414fa23c in use.
2018-08-28 14:45:19.612 23 ERROR octavia.network.drivers.neutron.allowed_address_pairs Conflict: Security Group 231409b7-216b-4825-8818-304a414fa23c in use.
           |__Flow 'octavia-delete-loadbalancer-flow': DeallocateVIPException: All attempts to remove security group 231409b7-216b-4825-8818-304a414fa23c have failed.
2018-08-28 14:45:19.617 23 ERROR octavia.controller.worker.controller_worker DeallocateVIPException: All attempts to remove security group 231409b7-216b-4825-8818-304a414fa23c have failed.
2018-08-28 14:45:19.695 23 ERROR oslo_messaging.rpc.server [-] Exception during message handling: DeallocateVIPException: All attempts to remove security group 231409b7-216b-4825-8818-304a414fa23c have failed.
2018-08-28 14:45:19.695 23 ERROR oslo_messaging.rpc.server DeallocateVIPException: All attempts to remove security group 231409b7-216b-4825-8818-304a414fa23c have failed.

(overcloud) [stack@director ~]$ openstack loadbalancer list 
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
| id                                   | name           | project_id                       | vip_address   | provisioning_status | provider |
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+
| 2cc2524a-72ea-4c02-ab1a-3e37ea674c68 | loadbalancer-0 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.110 | ERROR               | octavia  |
| e19d1426-c351-4c01-89f6-a9842be03c19 | loadbalancer-1 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.105 | ERROR               | octavia  |
| 8b794304-212b-4be2-b9a8-66ef2a3afb5d | loadbalancer-2 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.115 | ERROR               | octavia  |
| c3bf6eb7-788e-40a3-954a-351ef9b94b4a | loadbalancer-3 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.112 | ERROR               | octavia  |
| 32003402-216f-4c17-a675-b1442c96598a | loadbalancer-4 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.113 | ERROR               | octavia  |
| 102d8b48-9794-42eb-9c9d-7e21c4409ddb | loadbalancer-5 | ac52d9d546584c80b517fad61f449be5 | 192.168.5.103 | ERROR               | octavia  |
+--------------------------------------+----------------+----------------------------------+---------------+---------------------+----------+

7. The only way do delete those loadbalancers is to nuke ports which were associated with amphoras first, then re-try the deletion.


Actual results:

Octavia breaks all LoadBalancers not allowing for easy recovery due to database connectivity issues which may occur not only due to the failure but also during overcloud update.

Expected results:

Octavia properly handles DB outage, ideally not trying to failover amphoras.

Additional info:

Comment 1 Carlos Goncalves 2018-08-28 15:51:56 UTC
Thanks for the well-written report!

We will look into the critical issue arising upon DB outage.

The port issue is fixed in Octavia 2.0.2 (Queens). I expect OSP13 z3 to include it.

Comment 5 Carlos Goncalves 2018-08-29 19:51:09 UTC
*** Bug 1609064 has been marked as a duplicate of this bug. ***

Comment 6 Carlos Goncalves 2018-08-29 20:01:34 UTC
*** Bug 1603138 has been marked as a duplicate of this bug. ***

Comment 7 Madhur Gupta 2018-08-30 14:07:39 UTC
Hi Carlos,

One more customer have reported the same issue.

Where one of the Control node was rebooted and the database was unavailable for a few seconds, octavia brought down all of our loadbalancers. 

The loadbalancers are now in error state and can not be delete d because they are "immutable".

However nuking the ports have worked for them and been able to delete the loadbalancers.

Let me know if you guys need some data from their environment or input which could help ?

Comment 8 Carlos Goncalves 2018-08-30 14:20:17 UTC
Madhur, thanks for the info. We definitely need to put all efforts in fixing this issue ASAP.

Could you confirm if https://bugzilla.redhat.com/show_bug.cgi?id=1623071 is a duplicate of this one?

Comment 9 Madhur Gupta 2018-08-31 08:40:39 UTC
I will check the logs of both issues from my end to see if there are any similarities.

Comment 35 errata-xmlrpc 2018-11-13 23:32:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3614