Bug 1709925 - Octavia A/B topology - health_manager/controller_ip_port_list is wrongly configured
Summary: Octavia A/B topology - health_manager/controller_ip_port_list is wrongly conf...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 13.0 (Queens)
Hardware: All
OS: All
urgent
high
Target Milestone: z11
: 13.0 (Queens)
Assignee: Carlos Goncalves
QA Contact: Michael Johnson
URL:
Whiteboard:
: 1773582 1834588 (view as bug list)
Depends On: 1776912
Blocks: 1732834
TreeView+ depends on / blocked
 
Reported: 2019-05-14 14:49 UTC by Federico Iezzi
Modified: 2023-10-06 18:18 UTC (History)
21 users (show)

Fixed In Version: openstack-tripleo-heat-templates-8.4.1-40.el7ost
Doc Type: Bug Fix
Doc Text:
Before this update, there was an issue where some Octavia controller services were not properly configured. With this update, the issue is resolved.
Clone Of:
Environment:
Last Closed: 2020-03-10 11:18:29 UTC
Target Upstream Version:
Embargoed:
ndeevy: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1836074 0 None None None 2019-08-07 14:11:42 UTC
OpenStack gerrit 687311 0 'None' MERGED Simplify octavia post deploy configs 2021-02-19 02:45:36 UTC
OpenStack gerrit 687864 0 'None' MERGED Simplify octavia post deploy configs 2021-02-19 02:45:37 UTC
OpenStack gerrit 689878 0 'None' MERGED Don't rely on crudini for octavia config 2021-02-19 02:45:38 UTC
OpenStack gerrit 691935 0 'None' MERGED Simplify octavia post deploy configs 2021-02-19 02:45:38 UTC
OpenStack gerrit 691936 0 'None' MERGED Simplify octavia post deploy configs 2021-02-19 02:45:38 UTC
OpenStack gerrit 701549 0 None MERGED Octavia: set selinux contexts on ansible generated configuration 2021-02-19 02:45:38 UTC
Red Hat Issue Tracker OSP-23731 0 None None None 2023-03-24 16:34:36 UTC

Internal Links: 1698576

Description Federico Iezzi 2019-05-14 14:49:55 UTC
Description of problem:
It was observed that when an Octavia loadbalancer is configured in Active/Backup topology, Octavia is capable of performing failover action only twice.
Inside the amphora VM, the originally created two amphora have the health_manager/controller_ip_port_list properly configured while any new ones instantiated have the value empty.

controller_ip_port_list is currently configured under "/var/lib/config-data/puppet-generated/octavia/etc/octavia/conf.d/octavia-worker/worker-post-deploy.conf"
In the main config file (/var/lib/config-data/puppet-generated/octavia/etc/octavia/octavia.conf) controller_ip_port_list is commented out.

The result is pretty bad making the whole A/B topology unusable: any state change inside the Amphora VM (service death, oom, kernel panic etc) and affecting the Amphora (vm stop, vm deleted, vm suspended, neutron port deleted, neutron port admin shut down etc) is not reported at all nor detected by the Health Manager.
Health Manager is not even aware of the new amphoras and it will never take any fencing actions.

From tripleo-common analysis, it seems a pretty quick fix [1] and [2] should point to "{{octavia_confd_prefix}}/etc/octavia/octavia.conf"

[1] https://github.com/openstack/tripleo-common/blob/stable/queens/playbooks/roles/octavia-controller-post-config/tasks/main.yml#L14
[2] https://github.com/openstack/tripleo-common/blob/stable/queens/playbooks/roles/octavia-controller-post-config/tasks/main.yml#L37

Version-Release number of selected component (if applicable):
OSP13z6

How reproducible:
100%

Steps to Reproduce:
1. Create an A/B LB topology
2. Configure the members
3. Delete either the LB amphora backup or the master
4. Wait for the LB to go back into a fully healthy state
5. Delete the other original LB amphora (if backup was deleted, then remove the master)
6. Wait for the LB to go back into a fully healthy state
7. Delete any of the new LB amphora
8. At this point, the cluster will never recover

Actual results:
The LB cluster is completely down, no corrective actions are taken

Expected results:
A new amphora is spawned

Additional info:
With properly fixed Octavia's config file, the failover workflow has been successfully tested more than 10 times.

Comment 1 Federico Iezzi 2019-05-14 14:54:09 UTC
This BZ also undercover another Octavia limitation in health manager: in the event an amphora is unable to initially "register" with it, no fencing actions will ever be taken.
Let's, for example, assume any compute node type of failures (hw, network, kernel panic etc) while a new amphora is spawning.

Comment 2 Gregory Thiemonge 2019-05-22 14:17:55 UTC
Active/Standby topology is currently not supported in OSP, we will look for this bug in best effort.

Comment 3 Federico Iezzi 2019-06-04 08:31:02 UTC
(In reply to Gregory Thiemonge from comment #2)
> Active/Standby topology is currently not supported in OSP, we will look for
> this bug in best effort.

I understand A/B topology is not supposed but this wrong configuration has wrong effects on any health_manager feature.
A/B is just one of those

Comment 4 Michael Johnson 2019-06-05 14:21:25 UTC
We agree that this is an issue even without active/standby. The issue is that the health manager processes have the wrong controller_ip_port_list configuration.

Brent, can you comment on the worker-post-deploy.conf?

Comment 5 Brent Eagles 2019-06-19 14:41:30 UTC
The reason we use a configuration file other than octavia.conf is that the values for controller_ip_port_list isn't known at the time that the puppet generated configuration occurs. The controller_ip_port_list is generated through ansible later on in the deployment after the basic service configuration is complete and the core openstack services are running and available. If we were to "feed the configuration" back into the puppet generated configuration it interferes with certain configuration/container lifecycle management mechanisms and also possibly risks being overwritten.

So to clarify the problem isn't that the octavia services have the wrong info but that amphora do not get the correct list? Or is that both the services and the amphora do not have the correct list?

Comment 6 Michael Johnson 2019-06-26 14:16:35 UTC
The way I read the original information is that not all of the controller services have the complete list of controller_ip_port_list, such that if one of the other controllers provisions an Amphora, via failover or other, it will have a missing or incomplete list configured in the amphora. Specifically from the example, one of the health manager processes does not have the correct controller_ip_port_list in the octavia.conf.

Comment 12 Carlos Goncalves 2019-10-10 11:35:52 UTC
Patches proposed upstream and under review.

Comment 19 Laurent Serot 2019-10-29 14:59:57 UTC
Hi 
Having the same issue (RH OSP 13): 
We deployed Octavia and in order to do some resiliency tests I create load balancers then to a 'openstack server stop' on the amphora instance.
Sometimes Octavia Health Manage detects the problem and rebuilds the amphora but most of the time Octavia does nothing and the amphora stays in shutoff state for ever.
Actually the failover is working only once. After creating a brand new loadbalacer I did shut off the amphora and it triggered a failover within two minutes however if I stop the freshly recreated amphora nothing happens.

Comment 20 Brian Haley 2019-11-20 15:59:08 UTC
*** Bug 1773582 has been marked as a duplicate of this bug. ***

Comment 37 Michael Johnson 2020-02-18 18:22:00 UTC
I have verified that all of the Octavia controller processes are now configured with the proper controller_ip_port_list setting.

For each controller in the deployment (3 here), I checked the configuration files being used by the Octavia processes:

[root@controller-0 heat-admin]# docker exec 23bcc3e5ebf1 ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
octavia        1       0  0 14:05 ?        00:00:00 dumb-init --single-child -- kolla_start
octavia        8       1  0 14:05 ?        00:00:01 /usr/bin/python2 /usr/bin/octavia-health-manager --config-file /usr/share/octavia/octavia-dist.conf --config-file /etc/octavia/octavia.conf --log-file /var/log/octavia/health-manager.log --config-file /etc/octavia/post-deploy.conf --config-dir /etc/octavia/conf.d/octavia-health-manager
octavia       25       8  0 14:05 ?        00:00:16 /usr/bin/python2 /usr/bin/octavia-health-manager --config-file /usr/share/octavia/octavia-dist.conf --config-file /etc/octavia/octavia.conf --log-file /var/log/octavia/health-manager.log --config-file /etc/octavia/post-deploy.conf --config-dir /etc/octavia/conf.d/octavia-health-manager
octavia       26       8  0 14:05 ?        00:02:03 /usr/bin/python2 /usr/bin/octavia-health-manager --config-file /usr/share/octavia/octavia-dist.conf --config-file /etc/octavia/octavia.conf --log-file /var/log/octavia/health-manager.log --config-file /etc/octavia/post-deploy.conf --config-dir /etc/octavia/conf.d/octavia-health-manager
octavia     5451       0  0 18:09 ?        00:00:00 ps -ef
[root@controller-0 heat-admin]# docker exec b03de348778d ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
octavia        1       0  0 14:05 ?        00:00:00 dumb-init --single-child -- kolla_start
octavia        9       1  0 14:05 ?        00:01:41 /usr/bin/python2 /usr/bin/octavia-housekeeping --config-file /usr/share/octavia/octavia-dist.conf --config-file /etc/octavia/octavia.conf --log-file /var/log/octavia/housekeeping.log --config-file /etc/octavia/post-deploy.conf --config-dir /etc/octavia/conf.d/octavia-housekeeping
octavia     5466       0  0 18:10 ?        00:00:00 ps -ef
[root@controller-0 heat-admin]# docker exec 08fdab77ec65 ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
octavia        1       0  0 14:05 ?        00:00:00 dumb-init --single-child -- kolla_start
octavia        8       1  1 14:05 ?        00:04:33 octavia-worker: master process [/usr/bin/octavia-worker --config-file /usr/share/octavia/octavia-dist.conf --config-file /etc/octavia/octavia.conf --log-file /var/log/octavia/worker.log --config-file /etc/octavia/post-deploy.conf --config-dir /etc/octavia/conf.d/octavia-worker]
octavia       25       8  0 14:05 ?        00:00:09 octavia-worker: ConsumerService worker(0)
octavia     5470       0  0 18:10 ?        00:00:00 ps -ef

This validates that the "--config-file /etc/octavia/post-deploy.conf" parameter is present.

Then for each Octavia controller process I inspected the post-deploy.conf file to validate the proper values for controller_ip_port_list:

[root@controller-0 heat-admin]# docker exec 23bcc3e5ebf1 grep -r controller_ip_port_list /etc/octavia
/etc/octavia/octavia.conf:# controller_ip_port_list example: 127.0.0.1:5555, 127.0.0.1:5555
/etc/octavia/octavia.conf:# controller_ip_port_list =
/etc/octavia/post-deploy.conf:controller_ip_port_list = 172.24.0.19:5555, 172.24.0.14:5555, 172.24.0.6:5555
[root@controller-0 heat-admin]# docker exec b03de348778d grep -r controller_ip_port_list /etc/octavia
/etc/octavia/octavia.conf:# controller_ip_port_list example: 127.0.0.1:5555, 127.0.0.1:5555
/etc/octavia/octavia.conf:# controller_ip_port_list =
/etc/octavia/post-deploy.conf:controller_ip_port_list = 172.24.0.19:5555, 172.24.0.14:5555, 172.24.0.6:5555
[root@controller-0 heat-admin]# docker exec 08fdab77ec65 grep -r controller_ip_port_list /etc/octavia
/etc/octavia/octavia.conf:# controller_ip_port_list example: 127.0.0.1:5555, 127.0.0.1:5555
/etc/octavia/octavia.conf:# controller_ip_port_list =
/etc/octavia/post-deploy.conf:controller_ip_port_list = 172.24.0.19:5555, 172.24.0.14:5555, 172.24.0.6:5555

All three processes in all three controllers showed the correct settings for controller_ip_port_list.

Comment 40 errata-xmlrpc 2020-03-10 11:18:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0760

Comment 41 Carlos Goncalves 2020-05-15 14:51:06 UTC
*** Bug 1834588 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.