Description of problem: Reported in BZ2057604 The Octavia health-manager service was killed during an update. The Octavia ansible-playbook restarts the services on configuration change. There's a 300 seconds timeout defined as the grace period for the 3 Octavia controller services (worker, health-manager, housekeeping): https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/octavia/octavia-health-manager-container-puppet.yaml#L177 When receiving a SIGTERM the Octavia services most complete their tasks and then exit gracefully. It means that if the service hasn't exited gracefully after 300 sec, systemd will kill the service. In the origin report, a network outage triggered load balancer failovers in Octavia (health-manager), a failover should take less than 2min but because of this outage, the failover might have taken 10min, the service was killed before the completion of the failovers, leaving Octavia resources in incorrect states. The longest task in Octavia can take up to 10min to complete or to fail (https://opendev.org/openstack/octavia/commit/34edb58c12f64f2e62c56b6e3cd9f71de6c6ef2e), the THT stop_grace_period should not be less than 10min. Version-Release number of selected component (if applicable): 16.2 (also 16.1 and 17) How reproducible: Steps to Reproduce: 1. Deploy OSP with Octavia 2. Create a LB 3. Trigger a network outage 4. Wait for a failover to start 5. Restart the health-manager service 6. When the hm service is up, check the provisioning_status of the Octavia resources Actual results: Expected results: the octavia services should always be restarted/stopped gracefully resources should not be stuck in PENDING_* state after a restart Additional info:
(overcloud) [stack@undercloud-0 ~]$ cat core_puddle_version RHOS-16.1-RHEL-8-20221108.n.1 # Deploying a loadbalancer (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer create --vip-subnet-id external_subnet --name lb1 +---------------------+--------------------------------------+ | Field | Value | +---------------------+--------------------------------------+ | admin_state_up | True | | created_at | 2022-11-15T15:49:11 | | description | | | flavor_id | None | | id | bcb8baee-cf82-4dc1-ab5e-0ad8ab3d6e39 | | listeners | | | name | lb1 | | operating_status | OFFLINE | | pools | | | project_id | 59cbafe7d046473199e180e5960521cf | | provider | amphora | | provisioning_status | PENDING_CREATE | | updated_at | None | | vip_address | 10.0.0.166 | | vip_network_id | f53231d3-cd2d-4ed5-bfe8-247ee1b01101 | | vip_port_id | 96344fc7-c56f-48ca-b903-ee82be2767ab | | vip_qos_policy_id | None | | vip_subnet_id | 9a81e66a-d7eb-4ae3-8047-17a8866066ac | +---------------------+--------------------------------------+ # Simulating a network outage - I find the amphora's mgmt port and disable it (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list +--------------------------------------+--------------------------------------+-----------+------------+---------------+------------+ | id | loadbalancer_id | status | role | lb_network_ip | ha_ip | +--------------------------------------+--------------------------------------+-----------+------------+---------------+------------+ | fdf5b462-fb0d-40db-83e5-2766193b2d10 | bcb8baee-cf82-4dc1-ab5e-0ad8ab3d6e39 | ALLOCATED | STANDALONE | 172.24.2.65 | 10.0.0.166 | +--------------------------------------+--------------------------------------+-----------+------------+---------------+------------+ (overcloud) [stack@undercloud-0 ~]$ openstack port list | grep 172.24.2.65 | 72a7ea0c-fb2d-4545-8c24-98c205dc1d5a | | fa:16:3e:f1:38:eb | ip_address='172.24.2.65', subnet_id='222bc0d2-bfc2-4747-a34b-33113595a2f3' | ACTIVE | (overcloud) [stack@undercloud-0 ~]$ openstack port set --disable 72a7ea0c-fb2d-4545-8c24-98c205dc1d5a (overcloud) [stack@undercloud-0 ~]$ # I waited until the LB's provisioning status changed to PENDING_UPDATE (because of the failover) and then I restarted the health-manager services in all the controllers: overcloud) [stack@undercloud-0 ~]$ for controller in {controller-0.ctlplane,controller-1.ctlplane,controller-2.ctlplane}; do ssh "$controller" sudo podman restart octavia_health_manager; done Warning: Permanently added 'controller-0.ctlplane,192.168.24.13' (ECDSA) to the list of known hosts. 30bf2fa874e78878eb0abd5802087b8a13727b7366fe4d2c73e2bd9b000f2949 Warning: Permanently added 'controller-1.ctlplane,192.168.24.53' (ECDSA) to the list of known hosts. 0ab6f85944dc24f72baaf99c3a6f8b40c9564ccac0ed523a14ea912eef4fe475 Warning: Permanently added 'controller-2.ctlplane,192.168.24.8' (ECDSA) to the list of known hosts. 204d81e0049f88bb31345615eac710210815e6b213e42e30b0943ee8d4019f16 # The Octavia resources are still ACTIVE (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer list +--------------------------------------+------+----------------------------------+-------------+---------------------+----------+ | id | name | project_id | vip_address | provisioning_status | provider | +--------------------------------------+------+----------------------------------+-------------+---------------------+----------+ | bcb8baee-cf82-4dc1-ab5e-0ad8ab3d6e39 | lb1 | 59cbafe7d046473199e180e5960521cf | 10.0.0.166 | ACTIVE | amphora | +--------------------------------------+------+----------------------------------+-------------+---------------------+----------+ This behavior looks good to me. I am moving the BZ status to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenStack 16.1.9 (openstack-tripleo-heat-templates) security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:8796