Description of problem: During a minor update from 16.1.4 to 16.1.6, a number of containers were restarted as part of the converge step. This caused an outage on the OpenStack workloads. In particular, the restart of tripleo_octavia_health_manager.service took 5 minutes, which caused the converge to fail. This only happened on one controller, not on the remaining 2. Version-Release number of selected component (if applicable): OSP update from 16.1.4 to 16.1.6 How reproducible: Not sure if reproducible outside of customer's environment Steps to Reproduce: 1. 2. 3. Actual results: Containers were restarted during converge, some of them taking a long time to stop and start again, causing the converge step to timeout. Expected results: I understand containers should not be restarted during the converge step, but if they are, they should definitely not take 5 minutes to complete. Additional info: Sosreports from before and after the update are available on the attached case, as well as mistral logs and the content of /var/lib/mistral, plus other troubleshooting files gathered in the last 2 days. Customer case is sev1 and escalated.
Added a commit for tripleo-ansible, it would prevent restarting the Octavia services each time the playbook is run
Seeing that in our update job, in a build that ran from 16.1.4 -> 16.1.6 we had 2 service restarts in the converge stage: Ex: health manager logs (we see "INFO octavia.common.config [-] /usr/bin/octavia-health-manager version 5.0.3") http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-octavia-update-16.1_director-rhel-virthost-3cont_2comp_1ipa-ipv4-geneve-tls/97/controller-0/var/log/containers/octavia/health-manager.log.gz We can see when that step started in the following log (2022-11-14 18:44:39.107968) http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-octavia-update-16.1_director-rhel-virthost-3cont_2comp_1ipa-ipv4-geneve-tls/97/undercloud-0/home/stack/.tripleo/history.gz And seeing that in an update job with the fix, from our current latest_cdn puddle to our current passed_phase2 puddle, we got 1 service restart during the converge stage: health manager logs (we see "INFO octavia.common.config [-] /usr/bin/octavia-health-manager version 5.0.3") http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-octavia-update-16.1_director-rhel-virthost-3cont_2comp_1ipa-ipv4-geneve-tls/99/controller-0/var/log/containers/octavia/health-manager.log.gz We can see when that step started in the following log (2022-11-18 20:09:07.158376) http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-octavia-update-16.1_director-rhel-virthost-3cont_2comp_1ipa-ipv4-geneve-tls/97/undercloud-0/home/stack/.tripleo/history.gz We didn't receive as many restarts as the customer, but having the fix merged - we do see an improvement regarding the abuse the Octavia services restart. That looks good to me. I am moving the BZ status to VERIFIED.
Some info about the puddles: 16.1.4: RHOS-16.1-RHEL-8-20210311.n.1 16.1.6: RHOS-16.1-RHEL-8-20210506.n.1 16.1 latest_cdn which was used in the aforementioned build: RHOS-16.1-RHEL-8-20220804.n.1 16.1 passed_phase2 which was used in the aforementioned build: RHOS-16.1-RHEL-8-20221116.n.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.9 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8795