Bug 2001629

Summary: container-status validation incorrectly fails when some containers exited with code 143
Product: Red Hat OpenStack Reporter: Luigi Toscano <ltoscano>
Component: openstack-tripleo-validationsAssignee: Gaël Chamoulaud <gchamoul>
Status: CLOSED ERRATA QA Contact: nlevinki <nlevinki>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.2 (Train)CC: gchamoul, jbuchta, jjoyce, jpodivin, jschluet, slinaber, tvignaud
Target Milestone: z2Keywords: Triaged
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-validations-11.6.1-2.20220114124841.fef7cf7.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2040667 (view as bug list) Environment:
Last Closed: 2022-03-23 22:11:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version: stable/train
Embargoed:
Bug Depends On:    
Bug Blocks: 2040667    

Description Luigi Toscano 2021-09-06 15:24:04 UTC
Description of problem:
The container-status validation currently checks whether all the containers exited with status 0.

In a Fast Forward Upgrade scenario from OSP13 to OSP16.2 (and probably to OSP16.1 too), container-status fails sometimes because it finds some specific containers which exited with error 143, namely  neutron-haproxy-qrouter-*. The validation reports error like:

"Failed container detected: neutron-haproxy-qrouter-<foo> Exited (143) 40 minutes ago."

Looking at the logs of the failed jobs there are no many details. The usual content of the log for such container (/var/log/containers/stdouts/neutron-haproxy-qrouter-<foo>.log) is more or less 

2021-09-04T10:09:07.741719971+00:00 stderr F [WARNING] 246/100848 (987071) : Exiting Master process...                                                                                      
2021-09-04T10:09:07.742251294+00:00 stderr F [ALERT] 246/100848 (987071) : Current worker 987075 exited with code 143                                                                       
2021-09-04T10:09:07.742251294+00:00 stderr F [WARNING] 246/100848 (987071) : All workers exited. Exiting... (143)

According networking experts (thanks Slaweq!) each of those containers "is sidecar container to run haproxy for metadata service for neutron router", and it looks like this error message should just be considered as "normal" termination according:
https://www.mail-archive.com/haproxy@formilux.org/msg30473.html

(not sure whether the fix was backported on haproxy 1.8, the version available on OSP16.x, but maybe not).
 
So the validation should probably consider 143 as valid termination value, maybe always, maybe just in some specific cases. It is not an incorrect value at least for that specific container.

Version-Release number of selected component (if applicable):
openstack-tripleo-validations-11.6.1-2.20210612074808.8644a02.el8ost.1.noarch
python3-validations-libs-1.1.1-2.20210607091343.04e84c8.el8ost.1.noarch
validations-common-1.1.2-2.20210611010116.el8ost.2.noarch

How reproducible:
Mostly always on FFU jobs.


Steps to Reproduce:
openstack tripleo validator run --stack qe-Cloud-0 --validation container-status

Comment 3 Gaël Chamoulaud 2021-09-07 07:05:24 UTC
Systemd has been instructed to accept 137, 142 and 143 exit code status.

tripleo-ansible patches:
- https://wimp.usersys.redhat.com/?change_id_input=I8f19a80016a67ccad0371c5d108516aec640f031

python-paunch patches:
- https://wimp.usersys.redhat.com/?change_id_input=Iffcfc8bd18a999ae6921a4131d40241df40050f1

This container-status validation should also be able to accept those exit code status as valid termination status.

Comment 8 Jiri Podivin 2022-02-11 14:05:33 UTC
Manually tested on 11.02.2022 using RHOS 16.2.2, puddle RHOS-16.2-RHEL-8-20220210.n.1

Comment 13 errata-xmlrpc 2022-03-23 22:11:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.2), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1001