Description of problem: Rebooting the cluster causes the loadbalancers are not working anymore Version-Release number of selected component (if applicable): 13.0 How reproducible: Always after reboot Steps to Reproduce: 1. Create a loadbalancer with octavia 2. Wait till is accessible (amphora is running) 3. Reboot all the cluster Actual results: $ openstack loadbalancer list +--------------------------------------+------------------------------------------------+----------------------------------+----------------+---------------------+----------+ | id | name | project_id | vip_address | provisioning_status | provider | +--------------------------------------+------------------------------------------------+----------------------------------+----------------+---------------------+----------+ | 5174d8f0-15b7-4051-b8db-8cc2216505cd | default/router | 922c89bfc75b43fbb6cb23ae55480a74 | 172.30.205.175 | ACTIVE | octavia | | e7d90234-0a8a-4f13-ae57-9dbbd7af9a9c | openshift-ansible-openshift.example.com-api-lb | 922c89bfc75b43fbb6cb23ae55480a74 | 172.30.0.1 | ACTIVE | octavia | $ openstack loadbalancer amphora list +--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+ | id | loadbalancer_id | status | role | lb_network_ip | ha_ip | +--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+ | 59ccc690-a5e7-4a67-86dc-4999d78b01a3 | a77fc719-74fb-4951-855a-d417fb858bb1 | ERROR | STANDALONE | 172.24.0.5 | 172.30.13.157 | | 7dc5e169-86ec-4ccc-b7c8-bd337183ff89 | e7d90234-0a8a-4f13-ae57-9dbbd7af9a9c | PENDING_DELETE | STANDALONE | 172.24.0.16 | 172.30.0.1 | +--------------------------------------+--------------------------------------+----------------+------------+---------------+---------------+ Expected results: LoadBalancer working and amphora servers running Additional info: Most of the times the only way to remove loadbalancers is directly in the database..
health-manager.log (here was another test with 5 LB) 2018-07-27 05:55:52.893 22 ERROR octavia.controller.worker.controller_worker [-] Failover exception: Failed to build compute instance due to: {u'message': u'No valid host was found. There are not enough hosts available.', u'code': 500, u'details': u' File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 1116, in schedule_and_build_instances\n instance_uuids, return_alternates=True)\n File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 716, in _schedule_instances\n return_alternates=return_alternates)\n File "/usr/lib/python2.7/site-packages/nova/scheduler/utils.py", line 726, in wrapped\n return func(*args, **kwargs)\n File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 53, in select_destinations\n instance_uuids, return_objects, return_alternates)\n File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method\n return getattr(self.instance, __name)(*args, **kwargs)\n File "/usr/lib/python2.7/site-packages/nova/scheduler/client/query.py", line 42, in select_destinations\n instance_uuids, return_objects, return_alternates)\n File "/usr/lib/python2.7/site-packages/nova/scheduler/rpcapi.py", line 158, in select_destinations\n return cctxt.call(ctxt, \'select_destinations\', **msg_args)\n File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 174, in call\n retry=self.retry)\n File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 131, in _send\n timeout=timeout, retry=retry)\n File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in send\n retry=retry)\n File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 550, in _send\n raise result\n', u'created': u'2018-07-27T09:55:37Z'}: ComputeBuildException: Failed to build compute instance due to: {u'message': u'No valid host was found. There are not enough hosts available.', u'code': 500, u'details': u' File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 1116, in schedule_and_build_instances\n instance_uuids, return_alternates=True)\n File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 716, in _schedule_instances\n return_alternates=return_alternates)\n File "/usr/lib/python2.7/site-packages/nova/scheduler/utils.py", line 726, in wrapped\n return func(*args, **kwargs)\n File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 53, in select_destinations\n instance_uuids, return_objects, return_alternates)\n File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method\n return getattr(self.instance, __name)(*args, **kwargs)\n File "/usr/lib/python2.7/site-packages/nova/scheduler/client/query.py", line 42, in select_destinations\n instance_uuids, return_objects, return_alternates)\n File "/usr/lib/python2.7/site-packages/nova/scheduler/rpcapi.py", line 158, in select_destinations\n return cctxt.call(ctxt, \'select_destinations\', **msg_args)\n File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 174, in call\n retry=self.retry)\n File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 131, in _send\n timeout=timeout, retry=retry)\n File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in send\n retry=retry)\n File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 550, in _send\n raise result\n', u'created': u'2018-07-27T09:55:37Z'} 2018-07-27 05:55:52.917 22 INFO octavia.controller.healthmanager.health_manager [-] Attempted 5 failovers of amphora 2018-07-27 05:55:52.917 22 INFO octavia.controller.healthmanager.health_manager [-] Failed at 5 failovers of amphora Probably caused because the compute nodes were not available at that moment, but it is not retrying to create them.
More info, after the cluster is rebooted and active, if you restore the "amphora" table inside the "octavia" database then loadbalancers are deployed and after a while are working.
*** Bug 1609063 has been marked as a duplicate of this bug. ***
Right, Octavia triggered failover for amphorae hosted on compute nodes that were still down when you brought the cluster back online but since at that failover time there was not enough computing resources ("No valid host was found" and considering your comment #1), it failed to do so and that is a valid behavior. Could you explain what do you mean by restoring the amphora table inside the octavia DB?
echo "delete from amphora;" | ssh heat-admin.2.10 "mysql -uoctavia -pXXX octavia" cat amphora.sql| ssh heat-admin.2.10 "mysql -uoctavia -pXX octavia" openstack loadbalancer list -c id -f value|xargs -i openstack loadbalancer failover {} So when the cluster is UP again and tries the failover and it fails, the original records are lost and you can not fix it anymore.
Why have you restored amphorae records, i.e. deleted all info from amphora table? You should have not. Deleting data from the database as means to repair is not the way to go. The expected recovery steps after restoring compute nodes is to trigger failover, either on a set of amphorae ("$ openstack loadbalancer amphora failover <ID>@) or load balancers ("$ openstack loadbalancer failover <ID>@) in ERROR state. If that fails, it is a legit bug that should be fixed.
The records in the table amphora are marked as DELETED when it tries to do the failover and the compute node are down. Then it is not possible to do the amphora failover <ID>, that is the reason it is needed to restore the table with the status ALLOCATED
*** This bug has been marked as a duplicate of bug 1623146 ***