Bug 1609064 - Rebooting the cluster causes the loadbalancers are not working anymore
Summary: Rebooting the cluster causes the loadbalancers are not working anymore
Keywords:
Status: CLOSED DUPLICATE of bug 1623146
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-octavia
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Assaf Muller
QA Contact: Alexander Stafeyev
URL:
Whiteboard:
: 1609063 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-26 21:12 UTC by Alberto Gonzalez
Modified: 2019-09-10 14:09 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-08-29 19:51:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Alberto Gonzalez 2018-07-26 21:12:02 UTC
Description of problem:
Rebooting the cluster causes the loadbalancers are not working anymore

Version-Release number of selected component (if applicable):
13.0

How reproducible:
Always after reboot


Steps to Reproduce:
1. Create a loadbalancer with octavia
2. Wait till is accessible (amphora is running)
3. Reboot all the cluster

Actual results:
$ openstack loadbalancer list
+--------------------------------------+------------------------------------------------+----------------------------------+----------------+---------------------+----------+
| id                                   | name                                           | project_id                       | vip_address    | provisioning_status | provider |
+--------------------------------------+------------------------------------------------+----------------------------------+----------------+---------------------+----------+
| 5174d8f0-15b7-4051-b8db-8cc2216505cd | default/router                                 | 922c89bfc75b43fbb6cb23ae55480a74 | 172.30.205.175 | ACTIVE              | octavia  |
| e7d90234-0a8a-4f13-ae57-9dbbd7af9a9c | openshift-ansible-openshift.example.com-api-lb | 922c89bfc75b43fbb6cb23ae55480a74 | 172.30.0.1     | ACTIVE              | octavia  |

$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+
| id                                   | loadbalancer_id                      | status         | role       | lb_network_ip | ha_ip         |
+--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+
| 59ccc690-a5e7-4a67-86dc-4999d78b01a3 | a77fc719-74fb-4951-855a-d417fb858bb1 | ERROR          | STANDALONE | 172.24.0.5    | 172.30.13.157 |
| 7dc5e169-86ec-4ccc-b7c8-bd337183ff89 | e7d90234-0a8a-4f13-ae57-9dbbd7af9a9c | PENDING_DELETE | STANDALONE | 172.24.0.16   | 172.30.0.1    |
+--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+



Expected results:
LoadBalancer working and amphora servers running


Additional info:
Most of the times the only way to remove loadbalancers is directly in the database..

Comment 1 Alberto Gonzalez 2018-07-27 11:05:07 UTC
health-manager.log (here was another test with 5 LB)



2018-07-27 05:55:52.893 22 ERROR octavia.controller.worker.controller_worker [-] Failover exception: Failed to build compute instance due to: {u'message': u'No valid host was found. There are not enough hosts available.', u'code': 500, u'details': u'  File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 1116, in schedule_and_build_instances\n    instance_uuids, return_alternates=True)\n  File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 716, in _schedule_instances\n    return_alternates=return_alternates)\n  File "/usr/lib/python2.7/site-packages/nova/scheduler/utils.py", line 726, in wrapped\n    return func(*args, **kwargs)\n  File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 53, in select_destinations\n    instance_uuids, return_objects, return_alternates)\n  File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method\n    return getattr(self.instance, __name)(*args, **kwargs)\n  File "/usr/lib/python2.7/site-packages/nova/scheduler/client/query.py", line 42, in select_destinations\n    instance_uuids, return_objects, return_alternates)\n  File "/usr/lib/python2.7/site-packages/nova/scheduler/rpcapi.py", line 158, in select_destinations\n    return cctxt.call(ctxt, \'select_destinations\', **msg_args)\n  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 174, in call\n    retry=self.retry)\n  File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 131, in _send\n    timeout=timeout, retry=retry)\n  File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in send\n    retry=retry)\n  File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 550, in _send\n    raise result\n', u'created': u'2018-07-27T09:55:37Z'}: ComputeBuildException: Failed to build compute instance due to: {u'message': u'No valid host was found. There are not enough hosts available.', u'code': 500, u'details': u'  File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 1116, in schedule_and_build_instances\n    instance_uuids, return_alternates=True)\n  File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 716, in _schedule_instances\n    return_alternates=return_alternates)\n  File "/usr/lib/python2.7/site-packages/nova/scheduler/utils.py", line 726, in wrapped\n    return func(*args, **kwargs)\n  File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 53, in select_destinations\n    instance_uuids, return_objects, return_alternates)\n  File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method\n    return getattr(self.instance, __name)(*args, **kwargs)\n  File "/usr/lib/python2.7/site-packages/nova/scheduler/client/query.py", line 42, in select_destinations\n    instance_uuids, return_objects, return_alternates)\n  File "/usr/lib/python2.7/site-packages/nova/scheduler/rpcapi.py", line 158, in select_destinations\n    return cctxt.call(ctxt, \'select_destinations\', **msg_args)\n  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 174, in call\n    retry=self.retry)\n  File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 131, in _send\n    timeout=timeout, retry=retry)\n  File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in send\n    retry=retry)\n  File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 550, in _send\n    raise result\n', u'created': u'2018-07-27T09:55:37Z'}
2018-07-27 05:55:52.917 22 INFO octavia.controller.healthmanager.health_manager [-] Attempted 5 failovers of amphora
2018-07-27 05:55:52.917 22 INFO octavia.controller.healthmanager.health_manager [-] Failed at 5 failovers of amphora


Probably caused because the compute nodes were not available at that moment, but it is not retrying to create them.

Comment 2 Alberto Gonzalez 2018-07-30 13:40:37 UTC
More info, after the cluster is rebooted and active, if you restore the "amphora" table inside the "octavia" database then loadbalancers are deployed and after a while are working.

Comment 3 Nir Magnezi 2018-08-01 13:45:10 UTC
*** Bug 1609063 has been marked as a duplicate of this bug. ***

Comment 4 Carlos Goncalves 2018-08-01 14:00:23 UTC
Right, Octavia triggered failover for amphorae hosted on compute nodes that were still down when you brought the cluster back online but since at that failover time there was not enough computing resources ("No valid host was found" and considering your comment #1), it failed to do so and that is a valid behavior.

Could you explain what do you mean by restoring the amphora table inside the octavia DB?

Comment 5 Alberto Gonzalez 2018-08-01 14:39:54 UTC
echo "delete from amphora;" | ssh heat-admin.2.10 "mysql -uoctavia -pXXX octavia"
cat amphora.sql| ssh heat-admin.2.10 "mysql -uoctavia -pXX octavia"
openstack loadbalancer list -c id -f value|xargs -i openstack loadbalancer failover {}

So when the cluster is UP again and tries the failover and it fails, the original records are lost and you can not fix it anymore.

Comment 7 Carlos Goncalves 2018-08-15 14:19:42 UTC
Why have you restored amphorae records, i.e. deleted all info from amphora table? You should have not. Deleting data from the database as means to repair is not the way to go.

The expected recovery steps after restoring compute nodes is to trigger failover, either on a set of amphorae ("$ openstack loadbalancer amphora failover <ID>@) or load balancers ("$ openstack loadbalancer failover <ID>@) in ERROR state. If that fails, it is a legit bug that should be fixed.

Comment 8 Alberto Gonzalez 2018-08-15 14:26:49 UTC
The records in the table amphora are marked as DELETED when it tries to do the failover and the compute node are down. Then it is not possible to do the amphora failover <ID>, that is the reason it is needed to restore the table with the status ALLOCATED

Comment 10 Carlos Goncalves 2018-08-29 19:51:09 UTC

*** This bug has been marked as a duplicate of bug 1623146 ***


Note You need to log in before you can comment on or make changes to this bug.