Description of problem: It was observed that when an Octavia loadbalancer is configured in Active/Backup topology, Octavia is capable of performing failover action only twice. Inside the amphora VM, the originally created two amphora have the health_manager/controller_ip_port_list properly configured while any new ones instantiated have the value empty. controller_ip_port_list is currently configured under "/var/lib/config-data/puppet-generated/octavia/etc/octavia/conf.d/octavia-worker/worker-post-deploy.conf" In the main config file (/var/lib/config-data/puppet-generated/octavia/etc/octavia/octavia.conf) controller_ip_port_list is commented out. The result is pretty bad making the whole A/B topology unusable: any state change inside the Amphora VM (service death, oom, kernel panic etc) and affecting the Amphora (vm stop, vm deleted, vm suspended, neutron port deleted, neutron port admin shut down etc) is not reported at all nor detected by the Health Manager. Health Manager is not even aware of the new amphoras and it will never take any fencing actions. From tripleo-common analysis, it seems a pretty quick fix [1] and [2] should point to "{{octavia_confd_prefix}}/etc/octavia/octavia.conf" [1] https://github.com/openstack/tripleo-common/blob/stable/queens/playbooks/roles/octavia-controller-post-config/tasks/main.yml#L14 [2] https://github.com/openstack/tripleo-common/blob/stable/queens/playbooks/roles/octavia-controller-post-config/tasks/main.yml#L37 Version-Release number of selected component (if applicable): OSP13z6 How reproducible: 100% Steps to Reproduce: 1. Create an A/B LB topology 2. Configure the members 3. Delete either the LB amphora backup or the master 4. Wait for the LB to go back into a fully healthy state 5. Delete the other original LB amphora (if backup was deleted, then remove the master) 6. Wait for the LB to go back into a fully healthy state 7. Delete any of the new LB amphora 8. At this point, the cluster will never recover Actual results: The LB cluster is completely down, no corrective actions are taken Expected results: A new amphora is spawned Additional info: With properly fixed Octavia's config file, the failover workflow has been successfully tested more than 10 times.
This BZ also undercover another Octavia limitation in health manager: in the event an amphora is unable to initially "register" with it, no fencing actions will ever be taken. Let's, for example, assume any compute node type of failures (hw, network, kernel panic etc) while a new amphora is spawning.
Active/Standby topology is currently not supported in OSP, we will look for this bug in best effort.
(In reply to Gregory Thiemonge from comment #2) > Active/Standby topology is currently not supported in OSP, we will look for > this bug in best effort. I understand A/B topology is not supposed but this wrong configuration has wrong effects on any health_manager feature. A/B is just one of those
We agree that this is an issue even without active/standby. The issue is that the health manager processes have the wrong controller_ip_port_list configuration. Brent, can you comment on the worker-post-deploy.conf?
The reason we use a configuration file other than octavia.conf is that the values for controller_ip_port_list isn't known at the time that the puppet generated configuration occurs. The controller_ip_port_list is generated through ansible later on in the deployment after the basic service configuration is complete and the core openstack services are running and available. If we were to "feed the configuration" back into the puppet generated configuration it interferes with certain configuration/container lifecycle management mechanisms and also possibly risks being overwritten. So to clarify the problem isn't that the octavia services have the wrong info but that amphora do not get the correct list? Or is that both the services and the amphora do not have the correct list?
The way I read the original information is that not all of the controller services have the complete list of controller_ip_port_list, such that if one of the other controllers provisions an Amphora, via failover or other, it will have a missing or incomplete list configured in the amphora. Specifically from the example, one of the health manager processes does not have the correct controller_ip_port_list in the octavia.conf.
Patches proposed upstream and under review.
Hi Having the same issue (RH OSP 13): We deployed Octavia and in order to do some resiliency tests I create load balancers then to a 'openstack server stop' on the amphora instance. Sometimes Octavia Health Manage detects the problem and rebuilds the amphora but most of the time Octavia does nothing and the amphora stays in shutoff state for ever. Actually the failover is working only once. After creating a brand new loadbalacer I did shut off the amphora and it triggered a failover within two minutes however if I stop the freshly recreated amphora nothing happens.
*** Bug 1773582 has been marked as a duplicate of this bug. ***
I have verified that all of the Octavia controller processes are now configured with the proper controller_ip_port_list setting. For each controller in the deployment (3 here), I checked the configuration files being used by the Octavia processes: [root@controller-0 heat-admin]# docker exec 23bcc3e5ebf1 ps -ef UID PID PPID C STIME TTY TIME CMD octavia 1 0 0 14:05 ? 00:00:00 dumb-init --single-child -- kolla_start octavia 8 1 0 14:05 ? 00:00:01 /usr/bin/python2 /usr/bin/octavia-health-manager --config-file /usr/share/octavia/octavia-dist.conf --config-file /etc/octavia/octavia.conf --log-file /var/log/octavia/health-manager.log --config-file /etc/octavia/post-deploy.conf --config-dir /etc/octavia/conf.d/octavia-health-manager octavia 25 8 0 14:05 ? 00:00:16 /usr/bin/python2 /usr/bin/octavia-health-manager --config-file /usr/share/octavia/octavia-dist.conf --config-file /etc/octavia/octavia.conf --log-file /var/log/octavia/health-manager.log --config-file /etc/octavia/post-deploy.conf --config-dir /etc/octavia/conf.d/octavia-health-manager octavia 26 8 0 14:05 ? 00:02:03 /usr/bin/python2 /usr/bin/octavia-health-manager --config-file /usr/share/octavia/octavia-dist.conf --config-file /etc/octavia/octavia.conf --log-file /var/log/octavia/health-manager.log --config-file /etc/octavia/post-deploy.conf --config-dir /etc/octavia/conf.d/octavia-health-manager octavia 5451 0 0 18:09 ? 00:00:00 ps -ef [root@controller-0 heat-admin]# docker exec b03de348778d ps -ef UID PID PPID C STIME TTY TIME CMD octavia 1 0 0 14:05 ? 00:00:00 dumb-init --single-child -- kolla_start octavia 9 1 0 14:05 ? 00:01:41 /usr/bin/python2 /usr/bin/octavia-housekeeping --config-file /usr/share/octavia/octavia-dist.conf --config-file /etc/octavia/octavia.conf --log-file /var/log/octavia/housekeeping.log --config-file /etc/octavia/post-deploy.conf --config-dir /etc/octavia/conf.d/octavia-housekeeping octavia 5466 0 0 18:10 ? 00:00:00 ps -ef [root@controller-0 heat-admin]# docker exec 08fdab77ec65 ps -ef UID PID PPID C STIME TTY TIME CMD octavia 1 0 0 14:05 ? 00:00:00 dumb-init --single-child -- kolla_start octavia 8 1 1 14:05 ? 00:04:33 octavia-worker: master process [/usr/bin/octavia-worker --config-file /usr/share/octavia/octavia-dist.conf --config-file /etc/octavia/octavia.conf --log-file /var/log/octavia/worker.log --config-file /etc/octavia/post-deploy.conf --config-dir /etc/octavia/conf.d/octavia-worker] octavia 25 8 0 14:05 ? 00:00:09 octavia-worker: ConsumerService worker(0) octavia 5470 0 0 18:10 ? 00:00:00 ps -ef This validates that the "--config-file /etc/octavia/post-deploy.conf" parameter is present. Then for each Octavia controller process I inspected the post-deploy.conf file to validate the proper values for controller_ip_port_list: [root@controller-0 heat-admin]# docker exec 23bcc3e5ebf1 grep -r controller_ip_port_list /etc/octavia /etc/octavia/octavia.conf:# controller_ip_port_list example: 127.0.0.1:5555, 127.0.0.1:5555 /etc/octavia/octavia.conf:# controller_ip_port_list = /etc/octavia/post-deploy.conf:controller_ip_port_list = 172.24.0.19:5555, 172.24.0.14:5555, 172.24.0.6:5555 [root@controller-0 heat-admin]# docker exec b03de348778d grep -r controller_ip_port_list /etc/octavia /etc/octavia/octavia.conf:# controller_ip_port_list example: 127.0.0.1:5555, 127.0.0.1:5555 /etc/octavia/octavia.conf:# controller_ip_port_list = /etc/octavia/post-deploy.conf:controller_ip_port_list = 172.24.0.19:5555, 172.24.0.14:5555, 172.24.0.6:5555 [root@controller-0 heat-admin]# docker exec 08fdab77ec65 grep -r controller_ip_port_list /etc/octavia /etc/octavia/octavia.conf:# controller_ip_port_list example: 127.0.0.1:5555, 127.0.0.1:5555 /etc/octavia/octavia.conf:# controller_ip_port_list = /etc/octavia/post-deploy.conf:controller_ip_port_list = 172.24.0.19:5555, 172.24.0.14:5555, 172.24.0.6:5555 All three processes in all three controllers showed the correct settings for controller_ip_port_list.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0760
*** Bug 1834588 has been marked as a duplicate of this bug. ***