Bug 2063031

Summary: Octavia services might be killed by systemd on update
Product: Red Hat OpenStack Reporter: Gregory Thiemonge <gthiemon>
Component: openstack-tripleo-heat-templatesAssignee: Gregory Thiemonge <gthiemon>
Status: CLOSED ERRATA QA Contact: Bruna Bonguardo <bbonguar>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.2 (Train)CC: astupnik, jelynch, joflynn, lpeer, majopela, mburns, michjohn, oschwart, scohen, tmicheli
Target Milestone: z9Keywords: Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-11.3.2-1.20220901133902.29a02c1.el8ost Doc Type: Bug Fix
Doc Text:
Before this update, systemd stopped the Load-balancing services (octavia) during shutdown, leaving resources in the PENDING_UPDATE status. With this update, the graceful shutdown duration of the Load-balancing services is increased, preventing the services from being stopped by systemd.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-12-07 20:29:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gregory Thiemonge 2022-03-11 07:36:42 UTC
Description of problem:
Reported in BZ2057604

The Octavia health-manager service was killed during an update.

The Octavia ansible-playbook restarts the services on configuration change.
There's a 300 seconds timeout defined as the grace period for the 3 Octavia controller services (worker, health-manager, housekeeping):
https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/octavia/octavia-health-manager-container-puppet.yaml#L177

When receiving a SIGTERM the Octavia services most complete their tasks and then exit gracefully.
It means that if the service hasn't exited gracefully after 300 sec, systemd will kill the service.

In the origin report, a network outage triggered load balancer failovers in Octavia (health-manager), a failover should take less than 2min but because of this outage, the failover might have taken 10min, the service was killed before the completion of the failovers, leaving Octavia resources in incorrect states.

The longest task in Octavia can take up to 10min to complete or to fail (https://opendev.org/openstack/octavia/commit/34edb58c12f64f2e62c56b6e3cd9f71de6c6ef2e), the THT stop_grace_period should not be less than 10min.


Version-Release number of selected component (if applicable):
16.2 (also 16.1 and 17)

How reproducible:


Steps to Reproduce:
1. Deploy OSP with Octavia
2. Create a LB
3. Trigger a network outage
4. Wait for a failover to start
5. Restart the health-manager service
6. When the hm service is up, check the provisioning_status of the Octavia resources

Actual results:


Expected results:
the octavia services should always be restarted/stopped gracefully
resources should not be stuck in PENDING_* state after a restart

Additional info:

Comment 7 Omer Schwartz 2022-11-15 16:51:00 UTC
(overcloud) [stack@undercloud-0 ~]$ cat core_puddle_version 
RHOS-16.1-RHEL-8-20221108.n.1

# Deploying a loadbalancer
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer create --vip-subnet-id external_subnet --name lb1
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| admin_state_up      | True                                 |
| created_at          | 2022-11-15T15:49:11                  |
| description         |                                      |
| flavor_id           | None                                 |
| id                  | bcb8baee-cf82-4dc1-ab5e-0ad8ab3d6e39 |
| listeners           |                                      |
| name                | lb1                                  |
| operating_status    | OFFLINE                              |
| pools               |                                      |
| project_id          | 59cbafe7d046473199e180e5960521cf     |
| provider            | amphora                              |
| provisioning_status | PENDING_CREATE                       |
| updated_at          | None                                 |
| vip_address         | 10.0.0.166                           |
| vip_network_id      | f53231d3-cd2d-4ed5-bfe8-247ee1b01101 |
| vip_port_id         | 96344fc7-c56f-48ca-b903-ee82be2767ab |
| vip_qos_policy_id   | None                                 |
| vip_subnet_id       | 9a81e66a-d7eb-4ae3-8047-17a8866066ac |
+---------------------+--------------------------------------+


# Simulating a network outage - I find the amphora's mgmt port and disable it
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+------------+---------------+------------+
| id                                   | loadbalancer_id                      | status    | role       | lb_network_ip | ha_ip      |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+------------+
| fdf5b462-fb0d-40db-83e5-2766193b2d10 | bcb8baee-cf82-4dc1-ab5e-0ad8ab3d6e39 | ALLOCATED | STANDALONE | 172.24.2.65   | 10.0.0.166 |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+------------+
(overcloud) [stack@undercloud-0 ~]$ openstack port list | grep 172.24.2.65
| 72a7ea0c-fb2d-4545-8c24-98c205dc1d5a |                                                              | fa:16:3e:f1:38:eb | ip_address='172.24.2.65', subnet_id='222bc0d2-bfc2-4747-a34b-33113595a2f3'  | ACTIVE |
(overcloud) [stack@undercloud-0 ~]$ openstack port set --disable 72a7ea0c-fb2d-4545-8c24-98c205dc1d5a
(overcloud) [stack@undercloud-0 ~]$


# I waited until the LB's provisioning status changed to PENDING_UPDATE (because of the failover) and then I restarted the health-manager services in all the controllers:
overcloud) [stack@undercloud-0 ~]$ for controller in {controller-0.ctlplane,controller-1.ctlplane,controller-2.ctlplane}; do ssh "$controller" sudo podman restart octavia_health_manager; done
Warning: Permanently added 'controller-0.ctlplane,192.168.24.13' (ECDSA) to the list of known hosts.
30bf2fa874e78878eb0abd5802087b8a13727b7366fe4d2c73e2bd9b000f2949
Warning: Permanently added 'controller-1.ctlplane,192.168.24.53' (ECDSA) to the list of known hosts.
0ab6f85944dc24f72baaf99c3a6f8b40c9564ccac0ed523a14ea912eef4fe475
Warning: Permanently added 'controller-2.ctlplane,192.168.24.8' (ECDSA) to the list of known hosts.
204d81e0049f88bb31345615eac710210815e6b213e42e30b0943ee8d4019f16

# The Octavia resources are still ACTIVE
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer list
+--------------------------------------+------+----------------------------------+-------------+---------------------+----------+
| id                                   | name | project_id                       | vip_address | provisioning_status | provider |
+--------------------------------------+------+----------------------------------+-------------+---------------------+----------+
| bcb8baee-cf82-4dc1-ab5e-0ad8ab3d6e39 | lb1  | 59cbafe7d046473199e180e5960521cf | 10.0.0.166  | ACTIVE              | amphora  |
+--------------------------------------+------+----------------------------------+-------------+---------------------+----------+

This behavior looks good to me. I am moving the BZ status to VERIFIED.

Comment 16 errata-xmlrpc 2022-12-07 20:29:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenStack 16.1.9 (openstack-tripleo-heat-templates) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:8796