2057604 – Overcloud update converge fails after containers are restarted, some of them taking minutes to shutdown and start again

Bug 2057604 - Overcloud update converge fails after containers are restarted, some of them taking minutes to shutdown and start again [NEEDINFO]

Summary: Overcloud update converge fails after containers are restarted, some of them ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	tripleo-ansible
Sub Component:
Version:	16.1 (Train)
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	z9
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Gregory Thiemonge
QA Contact:	Omer Schwartz
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-23 17:30 UTC by Eric Nothen
Modified:	2022-12-07 20:26 UTC (History)
CC List:	7 users (show)
Fixed In Version:	tripleo-ansible-0.5.1-1.20220614153406.902c3c8.el8ost
Doc Type:	Bug Fix
Doc Text:	Before this update, the Load-balancing services (octavia) were restarted many times during deployments or updates. With this update, the services are restarted only when required, preventing potential interruptions of the control plane.
Clone Of:
Environment:
Last Closed:	2022-12-07 20:25:58 UTC
Target Upstream Version:
Embargoed:
Flags:	bporwal: needinfo?

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	832508	None	MERGED	Prevent the abusive restart of the octavia services	2022-03-30 13:50:39 UTC
OpenStack gerrit	833976	None	MERGED	Prevent the abusive restart of the octavia services	2022-04-20 07:46:58 UTC
Red Hat Issue Tracker	OSP-13106	None	None	None	2022-02-23 17:39:50 UTC
Red Hat Product Errata	RHBA-2022:8795	None	None	None	2022-12-07 20:26:26 UTC

Description Eric Nothen 2022-02-23 17:30:22 UTC

Description of problem:

During a minor update from 16.1.4 to 16.1.6, a number of containers were restarted as part of the converge step. This caused an outage on the OpenStack workloads. In particular, the restart of tripleo_octavia_health_manager.service took 5 minutes, which caused the converge to fail. This only happened on one controller, not on the remaining 2.


Version-Release number of selected component (if applicable):
OSP update from 16.1.4 to 16.1.6

How reproducible:
Not sure if reproducible outside of customer's environment

Steps to Reproduce:
1.
2.
3.

Actual results:
Containers were restarted during converge, some of them taking a long time to stop and start again, causing the converge step to timeout.

Expected results:
I understand containers should not be restarted during the converge step, but if they are, they should definitely not take 5 minutes to complete.

Additional info:
Sosreports from before and after the update are available on the attached case, as well as mistral logs and the content of /var/lib/mistral, plus other troubleshooting files gathered in the last 2 days.

Customer case is sev1 and escalated.

Comment 15 Gregory Thiemonge 2022-03-11 07:22:46 UTC

Added a commit for tripleo-ansible, it would prevent restarting the Octavia services each time the playbook is run

Comment 25 Omer Schwartz 2022-11-22 12:53:48 UTC

Seeing that in our update job, in a build that ran from 16.1.4 -> 16.1.6 we had 2 service restarts in the converge stage:

Ex: health manager logs (we see "INFO octavia.common.config [-] /usr/bin/octavia-health-manager version 5.0.3")
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-octavia-update-16.1_director-rhel-virthost-3cont_2comp_1ipa-ipv4-geneve-tls/97/controller-0/var/log/containers/octavia/health-manager.log.gz

We can see when that step started in the following log (2022-11-14 18:44:39.107968)
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-octavia-update-16.1_director-rhel-virthost-3cont_2comp_1ipa-ipv4-geneve-tls/97/undercloud-0/home/stack/.tripleo/history.gz



And seeing that in an update job with the fix, from our current latest_cdn puddle to our current passed_phase2 puddle, we got 1 service restart during the converge stage:
health manager logs (we see "INFO octavia.common.config [-] /usr/bin/octavia-health-manager version 5.0.3")
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-octavia-update-16.1_director-rhel-virthost-3cont_2comp_1ipa-ipv4-geneve-tls/99/controller-0/var/log/containers/octavia/health-manager.log.gz

We can see when that step started in the following log (2022-11-18 20:09:07.158376)
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-octavia-update-16.1_director-rhel-virthost-3cont_2comp_1ipa-ipv4-geneve-tls/97/undercloud-0/home/stack/.tripleo/history.gz


We didn't receive as many restarts as the customer, but having the fix merged - we do see an improvement regarding the abuse the Octavia services restart.

That looks good to me. I am moving the BZ status to VERIFIED.

Comment 26 Omer Schwartz 2022-11-22 12:58:03 UTC

Some info about the puddles:

16.1.4: RHOS-16.1-RHEL-8-20210311.n.1

16.1.6: RHOS-16.1-RHEL-8-20210506.n.1

16.1 latest_cdn which was used in the aforementioned build: RHOS-16.1-RHEL-8-20220804.n.1

16.1 passed_phase2 which was used in the aforementioned build: RHOS-16.1-RHEL-8-20221116.n.1

Comment 32 errata-xmlrpc 2022-12-07 20:25:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.9 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8795

Note You need to log in before you can comment on or make changes to this bug.