Bug 2057604

Summary:	Overcloud update converge fails after containers are restarted, some of them taking minutes to shutdown and start again
Product:	Red Hat OpenStack	Reporter:	Eric Nothen <enothen>
Component:	tripleo-ansible	Assignee:	Gregory Thiemonge <gthiemon>
Status:	CLOSED ERRATA	QA Contact:	Omer Schwartz <oschwart>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	16.1 (Train)	CC:	bporwal, gthiemon, jelynch, lpeer, majopela, oschwart, scohen
Target Milestone:	z9	Keywords:	Triaged
Target Release:	16.1 (Train on RHEL 8.2)	Flags:	bporwal: needinfo?
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	tripleo-ansible-0.5.1-1.20220614153406.902c3c8.el8ost	Doc Type:	Bug Fix
Doc Text:	Before this update, the Load-balancing services (octavia) were restarted many times during deployments or updates. With this update, the services are restarted only when required, preventing potential interruptions of the control plane.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-12-07 20:25:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eric Nothen 2022-02-23 17:30:22 UTC

Description of problem:

During a minor update from 16.1.4 to 16.1.6, a number of containers were restarted as part of the converge step. This caused an outage on the OpenStack workloads. In particular, the restart of tripleo_octavia_health_manager.service took 5 minutes, which caused the converge to fail. This only happened on one controller, not on the remaining 2.


Version-Release number of selected component (if applicable):
OSP update from 16.1.4 to 16.1.6

How reproducible:
Not sure if reproducible outside of customer's environment

Steps to Reproduce:
1.
2.
3.

Actual results:
Containers were restarted during converge, some of them taking a long time to stop and start again, causing the converge step to timeout.

Expected results:
I understand containers should not be restarted during the converge step, but if they are, they should definitely not take 5 minutes to complete.

Additional info:
Sosreports from before and after the update are available on the attached case, as well as mistral logs and the content of /var/lib/mistral, plus other troubleshooting files gathered in the last 2 days.

Customer case is sev1 and escalated.

Comment 15 Gregory Thiemonge 2022-03-11 07:22:46 UTC

Added a commit for tripleo-ansible, it would prevent restarting the Octavia services each time the playbook is run

Comment 25 Omer Schwartz 2022-11-22 12:53:48 UTC

Seeing that in our update job, in a build that ran from 16.1.4 -> 16.1.6 we had 2 service restarts in the converge stage:

Ex: health manager logs (we see "INFO octavia.common.config [-] /usr/bin/octavia-health-manager version 5.0.3")
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-octavia-update-16.1_director-rhel-virthost-3cont_2comp_1ipa-ipv4-geneve-tls/97/controller-0/var/log/containers/octavia/health-manager.log.gz

We can see when that step started in the following log (2022-11-14 18:44:39.107968)
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-octavia-update-16.1_director-rhel-virthost-3cont_2comp_1ipa-ipv4-geneve-tls/97/undercloud-0/home/stack/.tripleo/history.gz



And seeing that in an update job with the fix, from our current latest_cdn puddle to our current passed_phase2 puddle, we got 1 service restart during the converge stage:
health manager logs (we see "INFO octavia.common.config [-] /usr/bin/octavia-health-manager version 5.0.3")
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-octavia-update-16.1_director-rhel-virthost-3cont_2comp_1ipa-ipv4-geneve-tls/99/controller-0/var/log/containers/octavia/health-manager.log.gz

We can see when that step started in the following log (2022-11-18 20:09:07.158376)
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-octavia-update-16.1_director-rhel-virthost-3cont_2comp_1ipa-ipv4-geneve-tls/97/undercloud-0/home/stack/.tripleo/history.gz


We didn't receive as many restarts as the customer, but having the fix merged - we do see an improvement regarding the abuse the Octavia services restart.

That looks good to me. I am moving the BZ status to VERIFIED.

Comment 26 Omer Schwartz 2022-11-22 12:58:03 UTC

Some info about the puddles:

16.1.4: RHOS-16.1-RHEL-8-20210311.n.1

16.1.6: RHOS-16.1-RHEL-8-20210506.n.1

16.1 latest_cdn which was used in the aforementioned build: RHOS-16.1-RHEL-8-20220804.n.1

16.1 passed_phase2 which was used in the aforementioned build: RHOS-16.1-RHEL-8-20221116.n.1

Comment 32 errata-xmlrpc 2022-12-07 20:25:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.9 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8795