Bug 2063031 - Octavia services might be killed by systemd on update
Summary: Octavia services might be killed by systemd on update
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z9
: 16.1 (Train on RHEL 8.2)
Assignee: Gregory Thiemonge
QA Contact: Bruna Bonguardo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-11 07:36 UTC by Gregory Thiemonge
Modified: 2022-12-07 20:29 UTC (History)
10 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-1.20220901133902.29a02c1.el8ost
Doc Type: Bug Fix
Doc Text:
Before this update, systemd stopped the Load-balancing services (octavia) during shutdown, leaving resources in the PENDING_UPDATE status. With this update, the graceful shutdown duration of the Load-balancing services is increased, preventing the services from being stopped by systemd.
Clone Of:
Environment:
Last Closed: 2022-12-07 20:29:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 833614 0 None MERGED Increase stop_grace_period for Octavia controller services 2022-03-15 13:41:54 UTC
OpenStack gerrit 833968 0 None MERGED Increase stop_grace_period for Octavia controller services 2022-06-22 08:31:06 UTC
Red Hat Issue Tracker OSP-13509 0 None None None 2022-03-11 07:46:05 UTC
Red Hat Product Errata RHSA-2022:8796 0 None None None 2022-12-07 20:29:52 UTC

Description Gregory Thiemonge 2022-03-11 07:36:42 UTC
Description of problem:
Reported in BZ2057604

The Octavia health-manager service was killed during an update.

The Octavia ansible-playbook restarts the services on configuration change.
There's a 300 seconds timeout defined as the grace period for the 3 Octavia controller services (worker, health-manager, housekeeping):
https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/octavia/octavia-health-manager-container-puppet.yaml#L177

When receiving a SIGTERM the Octavia services most complete their tasks and then exit gracefully.
It means that if the service hasn't exited gracefully after 300 sec, systemd will kill the service.

In the origin report, a network outage triggered load balancer failovers in Octavia (health-manager), a failover should take less than 2min but because of this outage, the failover might have taken 10min, the service was killed before the completion of the failovers, leaving Octavia resources in incorrect states.

The longest task in Octavia can take up to 10min to complete or to fail (https://opendev.org/openstack/octavia/commit/34edb58c12f64f2e62c56b6e3cd9f71de6c6ef2e), the THT stop_grace_period should not be less than 10min.


Version-Release number of selected component (if applicable):
16.2 (also 16.1 and 17)

How reproducible:


Steps to Reproduce:
1. Deploy OSP with Octavia
2. Create a LB
3. Trigger a network outage
4. Wait for a failover to start
5. Restart the health-manager service
6. When the hm service is up, check the provisioning_status of the Octavia resources

Actual results:


Expected results:
the octavia services should always be restarted/stopped gracefully
resources should not be stuck in PENDING_* state after a restart

Additional info:

Comment 7 Omer Schwartz 2022-11-15 16:51:00 UTC
(overcloud) [stack@undercloud-0 ~]$ cat core_puddle_version 
RHOS-16.1-RHEL-8-20221108.n.1

# Deploying a loadbalancer
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer create --vip-subnet-id external_subnet --name lb1
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| admin_state_up      | True                                 |
| created_at          | 2022-11-15T15:49:11                  |
| description         |                                      |
| flavor_id           | None                                 |
| id                  | bcb8baee-cf82-4dc1-ab5e-0ad8ab3d6e39 |
| listeners           |                                      |
| name                | lb1                                  |
| operating_status    | OFFLINE                              |
| pools               |                                      |
| project_id          | 59cbafe7d046473199e180e5960521cf     |
| provider            | amphora                              |
| provisioning_status | PENDING_CREATE                       |
| updated_at          | None                                 |
| vip_address         | 10.0.0.166                           |
| vip_network_id      | f53231d3-cd2d-4ed5-bfe8-247ee1b01101 |
| vip_port_id         | 96344fc7-c56f-48ca-b903-ee82be2767ab |
| vip_qos_policy_id   | None                                 |
| vip_subnet_id       | 9a81e66a-d7eb-4ae3-8047-17a8866066ac |
+---------------------+--------------------------------------+


# Simulating a network outage - I find the amphora's mgmt port and disable it
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+------------+---------------+------------+
| id                                   | loadbalancer_id                      | status    | role       | lb_network_ip | ha_ip      |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+------------+
| fdf5b462-fb0d-40db-83e5-2766193b2d10 | bcb8baee-cf82-4dc1-ab5e-0ad8ab3d6e39 | ALLOCATED | STANDALONE | 172.24.2.65   | 10.0.0.166 |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+------------+
(overcloud) [stack@undercloud-0 ~]$ openstack port list | grep 172.24.2.65
| 72a7ea0c-fb2d-4545-8c24-98c205dc1d5a |                                                              | fa:16:3e:f1:38:eb | ip_address='172.24.2.65', subnet_id='222bc0d2-bfc2-4747-a34b-33113595a2f3'  | ACTIVE |
(overcloud) [stack@undercloud-0 ~]$ openstack port set --disable 72a7ea0c-fb2d-4545-8c24-98c205dc1d5a
(overcloud) [stack@undercloud-0 ~]$


# I waited until the LB's provisioning status changed to PENDING_UPDATE (because of the failover) and then I restarted the health-manager services in all the controllers:
overcloud) [stack@undercloud-0 ~]$ for controller in {controller-0.ctlplane,controller-1.ctlplane,controller-2.ctlplane}; do ssh "$controller" sudo podman restart octavia_health_manager; done
Warning: Permanently added 'controller-0.ctlplane,192.168.24.13' (ECDSA) to the list of known hosts.
30bf2fa874e78878eb0abd5802087b8a13727b7366fe4d2c73e2bd9b000f2949
Warning: Permanently added 'controller-1.ctlplane,192.168.24.53' (ECDSA) to the list of known hosts.
0ab6f85944dc24f72baaf99c3a6f8b40c9564ccac0ed523a14ea912eef4fe475
Warning: Permanently added 'controller-2.ctlplane,192.168.24.8' (ECDSA) to the list of known hosts.
204d81e0049f88bb31345615eac710210815e6b213e42e30b0943ee8d4019f16

# The Octavia resources are still ACTIVE
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer list
+--------------------------------------+------+----------------------------------+-------------+---------------------+----------+
| id                                   | name | project_id                       | vip_address | provisioning_status | provider |
+--------------------------------------+------+----------------------------------+-------------+---------------------+----------+
| bcb8baee-cf82-4dc1-ab5e-0ad8ab3d6e39 | lb1  | 59cbafe7d046473199e180e5960521cf | 10.0.0.166  | ACTIVE              | amphora  |
+--------------------------------------+------+----------------------------------+-------------+---------------------+----------+

This behavior looks good to me. I am moving the BZ status to VERIFIED.

Comment 16 errata-xmlrpc 2022-12-07 20:29:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenStack 16.1.9 (openstack-tripleo-heat-templates) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:8796


Note You need to log in before you can comment on or make changes to this bug.